My data analytics project focuses on Bellabeat, where I collect and analyze large datasets using R programming to identify meaningful insights and trends. Identify trends in how consumers use non-Bellabeat smart devices to apply insights into Bellabeat’s marketing strategy.
Table of Contents
1. Summary
2. Ask Phase
2.1 Business Task
3. Prepare Phase
3.1 Dataset Used
3.2 Confidentiality and Availability Of Information.
3.3 Details About The Dataset
3.4 Verification and Organization of Data
3.5 Integrity and Reliability Of Information
4. Process Phase
4.1 Installing packages and opening libraries
4.2 Importing datasets
4.3 Preview our datasets
4.4 Cleaning and formatting
4.5 Merging datasets
5. Analyze And Share Phase
5.1 Type of users per activity level
5.2 Steps and minutes asleep per weekday
5.3 Hourly steps throughout the day
5.4 Correlations
5.5 Use of smart device
5.5.1 Days used smart device
5.5.2 Time used smart device per day
6. Act Phase (Conclusion)
Summary
Bellabeat is a technology company that specializes in women's health and wellness products. They offer a range of devices, such as smartwatches, trackers, and earbuds, designed to help women monitor their health and wellness. These products track menstrual cycles, pregnancy, sleep patterns, stress levels, and other health metrics, and provide personalized insights and advice. The company also offers a companion app that allows users to view their health data and track their progress.
Ask Phase
2.1 Business Task
Identify trends in how consumers use non-Bellabeat smart devices to apply insights into Bellabeat’s marketing strategy.
Stakeholders
Urška Sršen - Bellabeat co founder and Chief Creative Officer
Sando Mur - Bellabeat co founder and key member of Bellabeat executive team
Bellabeat Marketing Analytics team
3.1 Dataset Used:
Our case study utilizes FitBit Fitness Tracker Data as its source of information. This dataset can be found on Kaggle and was made accessible by Mobius
3.2 Confidentiality and Availability Of Information
By checking the metadata of our dataset, we can confirm that it is open source. The owner has voluntarily surrendered all of their copyrights to the work globally, under copyright laws, including any related and neighboring rights. This is to the fullest extent permitted by law. You have the right to copy, alter, distribute and perform the work, even for commercial purposes, without needing to request permission.
3.3 Information About The Dataset
The dataset was produced by participants in a survey distributed via Amazon Mechanical Turk from December 3rd to December 5th, 2016. Thirty eligible Fitbit users agreed to provide their personal tracker data, which included minute-by-minute records of physical activity, heart rate, and sleep monitoring. The differences in the output are a result of the various Fitbit trackers used and the distinct tracking behaviors and preferences of each individual.
We have access to 18 CSV files, each containing different quantitative data tracked by Fitbit. The data is considered "long" format as each row represents a single time point for a particular user, with multiple rows for each user as data is tracked on a daily and hourly basis. Every user is assigned a unique ID for identification purposes.
Due to the limited sample size, I sorted and filtered the data using pivot tables in Google Docs to better analyze the data. This allowed me to examine the attributes and observations in each table and identify relationships between the tables. I counted the number of users in each table and confirmed that the analysis covered a 31-day time period.
3.5 Integrity and Reliability Of Information
The limitations of the dataset include a sample size of 30 users and lack of demographic information, which could result in a sampling bias and raise questions about the representativeness of the sample to the population as a whole. Additionally, the data being outdated and limited to a 2-month survey period also poses challenges. Thus, our case study will adopt an operational approach to address these limitations.
Process Phase
I have chosen to conduct my analysis using R due to its accessibility, abundance of available data, and ability to create insightful and impactful data visualizations. This will enable me to effectively communicate my results to stakeholders and make data-driven decisions.
4.1 Installing Packages And Opening Libraries
In order to conduct a comprehensive analysis, we will select and install relevant packages in R. The following packages will be utilized:
Installing Packages
install.packages(“tidyverse”)
install.packages(“here”)
install.packages(“skimr”)
install.packages(“janitor”)
install.packages(“lubridate”)
install.packages(“ggpubr”)
install.packages(“ggrepel”)
Opening Libraries
library(tidyverse)
library(here)
library(skimr)
library(janitor)
library(lubridate)
library(ggpubr)
library(ggrepel)
These packages will provide us with various tools and functions to handle, analyze and visualize the data, allowing us to carry out a thorough investigation and draw meaningful conclusions.
Knowing the datasets we have, we will upload the datasets that will help us answer our business task. On our analysis we will focus on the following datasets:
Daily_Activity
Daily_Sleep
Hourly_Steps
Due to the small sample size, we will not take into account weight (8 users) and heart rate (7 users) in this analysis.
We will examine our chosen data frames by using head and str function in R and review the summary of each column.
4.4 Cleaning And Formatting
Now that we have familiarized ourselves with our data structures,
we will then proceed to process them to identify any errors and inconsistencies.
4.4.1 Duplicates
We will look for any duplicates:
4.4.2 Remove Duplicates And N/A
Knowing the length of our observations (Daily_Sleep 413) we are able to delete duplicates for Daily_Sleep.
4.4.3 Clean and rename columns
To ensure consistency in column names across dataset and facilitate their later on,
we will convert all column names to lowercase, thereby ensuring that they adhere to the
correct syntax and format
4.4.4 Consistency Of Date and Time Columns
After confirming and converting our column names to lowercase, the next step is to clean the date-time format for the Daily_Activity and Daily_Sleep data frames. This is important as we plan to merge the two data frames. As the time is not relevant in the Daily_Sleep data frame, we will use the as_date function instead of as_datetime.
We will check our cleaned datasets
head(Daily_Activity)
head(Daily_Sleep
We will convert the date string to date-time for our Hourly_Steps data set.
4.5 Merging Datasets
We will merge Daily_Activity and Daily_Sleep to see any correlation between variables by using ID and date as their primary keys.
5. Analyze and Share Phase
Our objective is to explore how the usage patterns of FitBit users can inform and potentially enhance BellaBeat's marketing strategy.
5.1 Type of users per activity level
As we lack demographic variables in our sample, our aim is to determine the user types based on the available data. We can accomplish this by classifying users according to their daily activity level, as measured by the number of steps taken. To do so, we will categorize users into four groups, based on the following criteria:
Sedentary - Less than 5000 steps per day.
Lightly active - Between 5000 and 7499 steps per day.
Fairly active - Between 7500 and 9999 steps per day.
Very active - More than 10000 steps per day.
This classification scheme has been derived from the article titled "Counting Steps: A Guide to Increasing Physical Activity Through Pedometer Use" (available at https://www.10000steps.org.au/articles/healthy-lifestyles/counting-steps/). By using this approach, we can gain a better understanding of the activity levels of our users and identify any relevant patterns or trends that may inform our marketing strategy.
We first will calculate the Daily_Steps_Average by user.
Now that we have added a new column indicating the user_ type, we can create a data frame that shows the percentage of users in each category, which will allow us to visualize the distribution more clearly on a graph.
Now we can see that users are fairly distributed by their activity considering the daily amount of steps. We can determine that based on users activity all kinds of users wear smart-devices.
5.2 Steps and Minutes Asleep Per Weekday
We are interested in determining the most active and restful days of the week for our users, as well as whether they meet recommended activity and sleep goals. To this end, we are using our date column to determine the day of the week for each record, and calculating the average number of steps and minutes of sleep for each weekday.
Based on the graphs, we can observe that the users generally meet the recommended amount of steps per day of 7500, except for Sundays. However, the users do not meet the recommended amount of sleep, which is 8 hours per night.
5.3 Hourly Steps Throughout The Day
To further explore our analysis, we aim to determine the specific times of day when users are most active. This can be achieved by utilizing the Hourly_Steps data frame and parsing out the date_time column.
Based on the data analysis, it appears that users tend to be more active between the hours of 8am and 7pm. Specifically, they seem to take more steps during lunchtime, from 12pm to 2pm, as well as in the evenings, between 5pm and 7pm.
5.4 Correlations
Certainly, we can examine whether there is a correlation between different variables.
Daily steps and daily sleep
Daily steps and calories
Based on the analysis, it appears that there is no correlation between daily activity level based on steps and the amount of minutes users sleep a day. This indicates that one variable does not affect the other, and they are independent of each other.
On the other hand, we observe a positive correlation between daily steps and calories burned. This suggests that as the number of steps walked increases, the number of calories burned also tends to increase. This relationship is likely due to the fact that physical activity leads to an increase in energy expenditure, which in turn can lead to a higher calorie burn. However, it is important to note that correlation does not necessarily imply causation, and further analysis may be needed to establish a cause-and-effect relationship between these variables.
5.5 Use of smart device
5.5.1 Days Used Smart Device
To determine the frequency of device usage among the users in our sample, we can create a new data frame that groups the data by Id and calculates the number of days each user used their device. Then, we can use this information to classify each user into one of three categories based on their level of usage:
high use: users who used their device between 21 and 31 days
moderate use: users who used their device between 10 and 20 days
low use: users who used their device between 1 and 10 days
Here's an example of how we can create the new data frame and classification column with the classification explained above.
To better visualize the results, we can create a percentage data frame that shows the proportion of users in each usage level.
With the percentage data frame in hand, we can create a plot to visualize the frequency of device usage among the users in our sample.
It seems that the majority of users in our sample use their smart devices frequently, with 50% of users using their device between 21 to 31 days in a 31-day interval. This is a good indication that there is a strong demand for smart devices and their features among our target audience.
However, it's also important to note that a significant portion of our sample (38%) uses their device very rarely, which may suggest that there is some room for improvement in terms of encouraging device usage and increasing user engagement. This information could be useful in informing our marketing and product development strategies, as we may want to focus on features and promotions that are specifically designed to increase user engagement and device usage. The moderate usage group of 12% also provides valuable insight and could be an important target audience for certain marketing efforts.
5.5.2 Time Used Smart Device
Merging the Daily_Use data frame and Daily_Activity data frame will allow us to see how many minutes users wear their device per day and filter the results by daily use of the device.
We need to generate a new data frame that computes the total number of minutes each user wore the device per day, and classify the data into one of three categories:
All day: if the device was worn for the entire day.
More than half day: if the device was worn for more than half of the day.
Less than half day: if the device was worn for less than half of the day.
To facilitate the visualization of our results, we will create four new data frames. These data frames will allow us to compare the total usage with the usage by category and to make meaningful conclusions.
The first data frame will display the total number of users, and the percentage of time the device was worn, classified by the three categories we created (All day, More than half day, Less than half day). By calculating the total minutes worn by each user per day, we will determine the percentage of minutes worn in each of the three categories. This data frame will have columns for the total number of users, the percentage of users wearing the device all day, the percentage of users wearing the device more than half day, and the percentage of users wearing the device less than half day.
The remaining three data frames will be filtered by category of daily users, to better understand the differences in usage patterns. The second data frame will show data only for users who wore the device All day. The third data frame will show data only for users who wore the device More than half day. The fourth data frame will show data only for users who wore the device Less than half day. These data frames will have columns for the total number of users and the total number of minutes worn by each user in each category.
By creating these four data frames, we can compare the overall usage with the usage by category, and draw meaningful conclusions about device usage patterns. We can arrange these data frames on the same visualization to facilitate comparisons and analysis.
head(minutes_worn_percent)
head(minutes_worn_highuse)
head(minutes_worn_moduse)
head(minutes_worn_lowuse)
Now that we have created the four data frames and organized the worn level categories, we can visualize our results using the following plots. To facilitate comparisons and analysis, all plots have been arranged together.
Our plots show that 36% of all users wear the device all day, while 60% wear it for more than half the day, and only 4% wear it for less than half the day. When we filter the total users by the number of days they used the device and the amount of time they wore it each day, we can observe some interesting patterns.
For users who used the device between 21 and 31 days (high use), only 6.8% wear it all day. The vast majority (88.9%) wear it for more than half the day but not all day.
Moderate users, who used the device between 10 and 20 days, tend to wear it less on a daily basis.
Interestingly, low users who used the device between 1 and 10 days tend to wear it for longer periods on the days they do use it.
These observations highlight the different usage patterns of users across different time frames and usage levels. By analyzing these patterns, we can gain insights into how users interact with the device and use this information to inform our future decisions and strategies.
6. Conclusion (Act Phase)
Bellabeat's mission is to empower women by providing them with the data to discover themselves. Based on the results of our analysis, I would recommend that we utilize our own tracking data for further analysis to better understand our target audience. The datasets we used had a small sample and lacked demographic details of users, which may have introduced biases in our analysis. As our main target audience is young and adult women, it is important that we continue to identify trends and insights that can inform a marketing strategy specifically tailored to them.
By utilizing our own tracking data, we can gather more accurate and representative information about our users' behaviors and preferences. This will allow us to develop more personalized and effective marketing campaigns that resonate with our target audience and help them discover the full potential of our products. Overall, our goal is to continue advancing Bellabeat's mission of empowering women through data-driven insights and innovation.
These are great recommendations based on our analysis:
Daily notification on steps and posts on app: This recommendation is focused on encouraging customers to reach at least the daily recommended steps by CDC, which is 8,000. We can send alarms to customers who haven't reached the goal, as well as create posts on the app explaining the benefits of reaching that goal. We can also highlight the positive correlation between steps and calories. This strategy will motivate customers to walk more and also keep them informed about the benefits of an active lifestyle.
Reward system: This recommendation suggests creating a game on the app that encourages customers to maintain a certain level of activity for a period of time. The game would consist of different levels based on the amount of steps walked every day. Customers would win stars for each level that they complete, and these stars would be redeemable for merchandise or discounts on other Bellabeat products. This strategy will encourage customers to be more active and also reward them for their efforts.
Notification and sleep techniques: This recommendation suggests offering helpful resources to customers to help them sleep better. Customers could set a desired time to go to sleep and receive a notification minutes before to prepare for sleep. We could also offer breathing advice, podcasts with relaxing music, and sleep techniques to help customers improve their sleep habits. This strategy will help customers to improve their overall health and well-being.
By implementing these recommendations, Bellabeat can empower women by providing them with the data and tools they need to discover themselves and improve their health and wellness.
Based on our analysis, we discovered that only 50% users use their device on a daily basis, and only 36% wear the device all day when they do use it. To promote Bellabeat's products, we can highlight their water-resistant and long-lasting battery features, as well as their fashionable and elegant designs. Customers can confidently wear these products every day to any occasion without having to worry about battery life.