Case Study: How Does a Bike-Share Navigate Speedy Success?

Introduction

This is my exploratory analysis case study is towards The Capstone Project requirement for Google Data Analytics Professional Certificate. The case study involves a bike-share company's data of its customer's trip details over 12 months (January 2021 - December 2021).

Link to the scope of work.

About the company

Cyclistic is a bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, the director of marketing, believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, She believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Processes for data analysis phases:

1. ASK questions and define the problem (Business Challenge/Objective/Question)

A. Identify the business task: Design marketing strategies aimed at converting casual riders into annual members. To do that, however, the marketing analyst team needs to better understand;

How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?

B. Consider key stakeholders:

The director of marketing: She is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.

2. PREPARE data by collecting and storing the information (Data generation, collection, storage, and data management)

A. Download data and store it appropriately: I download The Cyclistic’s historical trip data and create a folder on my secured drive to house the files and use appropriate file-naming conventions and then create subfolders for the .csv file so that I have a copy of the original data.

B. Identify how it’s organized: The data were organized as separate files by month and year. The data were saved as .csv files within .zip folders. The data included the following attributes:

ride_id : a unique ID per ride
rideable_type: the type of bicycle
started_at: the date and time that the bicycle was started
ended_at: the date and time that the bicycle was ended
start_station_name: the name of start station
start_station_id: a unique ID for the start station
end_station_name: the name of the end station
end_station_id : a unique ID for the end station
start_lat: the latitude of the start station
start_lng: the longitude of the start station
end_lat: the latitude of the end station
end_lng: the longitude of the end station
member_casual: the type of membership

C. Sort and filter the data: I'm going to focus on the 2021 period as it's the more relevant period to the business task.

D. Data integrity:

Physical integrity: the data is stored on a safe, reliable physical platform.
Logical integrity: the data is accurate, correct, unchanged, and meets necessary compliance standards.

E. Data bias: The data can represent all the population and be used for trend identification for this project. The data is truly representative of the population as a whole.

F. Data credibility: Good data must be ROCC (Reliable, Original, Comprehensive, Current, and Cited). Here ROCCC process is used to check data credibility.

Reliable: The data can be considered accurate, complete, and unbiased information that's been vetted and proven fit for use.
Original: The data is collected directly from the original source.
Comprehensive: The data contains all critical information needed to answer the question or find the solution.
Current: The data is released on a monthly schedule. Therefore, the data is current and relevant.
Cited: The data is properly cited in the project.

G. Data limitation:

The data does not have riders' unique IDs. Therefore, the data can only answer questions in terms of how many trips, instead of questions involving how many individual riders. Moreover, the data cannot answer the question of how many trips are made by individual riders in each user type which leads to how much cost is spent per individual rider in each user type. It will gain insight into how much potential casual riders can afford to purchase an annual subscription.
The data does not have riders' demographic such as age, gender, occupation, and geographical location. The riders' age, gender, and occupation data will provide insight into what marketing tactics to use. While the riders' geographical location will provide insight into whether the rider is a resident or a tourist.

H. Data ethics and privacy:

The data has been made available by Motivate International Inc. under this license.
The data-privacy issues prohibit anyone from using riders’ personally identifiable information. This means that anyone won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

3. PROCESS data by cleaning and checking the information (Data cleaning/data integrity)

A. Choose the tools: The combined size of all the 12 datasets is close to 1 GB. Data cleaning in spreadsheet will be time-consuming and slow compared to SQL or R. I am choosing R simply because I could do both data wrangling and analysis even visualizations in the same platform.

B. Transform the data: Involves the processes such as data migration, data warehousing, data integration, and data wrangling.

1. load packages

In the first step, I need to load some R packages.

note: to get my R explanation codes, click View Notebook below.

2. Import Data

Import customer's historical trip details over 12 months (January 2021 - December 2021)

3. Inspect Data Structures

Since I am working with large data set, sometimes, I need to get an overview of the specific structure and look for incongruencies.

Conclusion:

All data have the same number of column and column names.

4. Compare Data Type

Inspect the data type and look for incongruencies.

Conclusion:

The data type in all datasets is consistent.

5. Combine multiple Datasets

After inspecting data structure and data type and finding no value mismatch, I can proceed to the next step which is for data merging. So, I can easily analyze it all in one go.

6. RENAME COLUMNS

I changed the names of some columns for better readability.

7. Add new columns

I added a column as follows:

ride_length : for ride length in minutes.

Note: I might be able to find out the ride distance by calculating the distance between the two points (Lat, Long). But how about a trip that starts and ends at the same station? It is possible for the user to go around the city and go back to the start station. It makes the ride distance value is zero. Similarly, if I add a ride speed field it will produce the same value as ride distance since the speed formula = distance/time elapsed. So I didn't create Ride Distance and Ride Speed fields to avoid creating biased data.

C. Check the data for errors: Involves the processes such as data validation

8. Identifying dirty data

I inspected the data to find out if there were any errors in the data.

A. Find out duplicate values in trip_id column.

Conclusion:

No duplicate records were found.

B. Find out the number of mean, min, max, empty, missing, and unique values in each variable.

Conclusion:

start_station, start_station_id, end_station, and end_station_id column have empty values (blank cell).
end_lat and end_lng column have 4771 missing values (NA).
ride_length column has Min value is negative 58 minutes. In the data source, it is explained that any trips that were below 1 minute in length were potentially false starts or users trying to re-dock a bike to ensure it was secure.
ride_length has a Max value is 55944 minutes which means 932 hours or 38 days. I assume this trip is taken by staff as they service and inspect the system or the bike is in the warehouse for maintenance. So, for this case, I had to do an in-depth investigation by finding out which stations this trip was at. And I will find out the outliers with statistical calculation in the next step.

C. Find out the strange station name.

Conclusion:

In the top 30 rows of ride_length there are many stations with the name "Base - 2132 W Hubbard Warehouse" (station id: "Hubbard Bike-checking (LBS-WH-TEST)") which means these trips are taken by staff. Therefore, these records should be removed, because these are not user records.

D. Find out the outliers.

Conclusion:

The values of 0% is -58 minutes and 1% is 0.5 minutes. As previously mentioned that any trips that were below 1 minute in length were potentially false starts or users trying to re-dock a bike to ensure it was secure.
The difference between values of 2% and 100% is about 55944 minutes (almost 39 days), which is not realistic; while the difference between values of 2% and 99% is about 128 minutes, which makes much more sense. Based on this, in this statistical analysis of ride length, the value of 100% will be considered as outliers.

D. Clean the data: Involves the processes such as data cleansing

9. remove dirty data

After identifying errors in the data, now I need to remove it all to get the data clean.

10. remove unused columns

I removed some columns as follows:

I removed the start_station_id and end_station_id because these columns can refer to the start station name and end station name. moreover, using the name for the station variable is easier to understand than the id.
I removed trip_id and end_time because these columns are no longer needed.

11. export data

Now the data was clean, so it's time to export the data set a CSV file for further analysis.

E. Document the cleaning process: I save my R documentation of any data cleaning or manipulation and code explanation on my Kagle.

4. ANALYZE data to find patterns, relationships, and trends. (Data exploration, visualization, and analysis)

A. Aggregate the data and perform calculations to identify trends and relationships:

I. Conduct descriptive analysis on the ride length (all figures in minutes).

1. The summary statistics on the ride length

mean: straight average (total ride length / rides).
sd: standard deviation of the mean.
min: lowpoint number in the ascending array of ride length.
median: midpoint number in the ascending array of ride length.
max: highpoint number in the ascending array of ride length.
mode: the number which appears most often in ride length.

2. The Distribution of ride length

Analyzing the result:

The visualization displays the distribution of ride length for both member and casual riders, each with five summary statistics being emphasized: the median, first and third quartile (two hinges in the plot), minimum and maximum (two whiskers in the plot), as well as all outlying points (in black).
The result shows statistically that casual riders have longer ride length than member riders.

3. The average ride length by hour of the day

Analyzing the result:

For member riders; The average ride length changes smoothly even tend to be flat, except at 5:00 is the lowest point.
For casual riders; The average ride length is longer from 10:00 to 16:00. There is a long-term fluctuation that increases from 6:00 to 14:00 (upward trend) and decreases from 14:00 to 6:00 (downward trend).
Overall, casual riders have longer ride lengths than member riders.

Analyzing the result:

There is a seasonal pattern that occurs in the weekday (Monday, Tuesday, Wednesday, Thursday, and Friday) and the weekend (Saturday and Sunday).

Analyzing the result:

For member riders, on the weekday, the average ride length changes smoothly, except at 5:00 is the lowest point and there is a long-term fluctuation that increases from 5:00 to 17:00 (upward trend) and decreases from 17:00 to 5:00 (downward trend). While on the weekend, the average ride length is longer from 11:00 to 17:00 and there is a long-term fluctuation that increases from 0:00 to 14:00 (upward trend) and decreases from 14:00 to 0:00 (downward trend).
For casual users, on the weekday, the average ride length is longer from 10:00 to 15:00 and shorter from 4:00 to 8:00, there is a long-term fluctuation that increases from 6:00 to 13:00 (upward trend) and decreases from 13:00 t0 6:00 (downward trend). While on the weekend, the average ride length is longer from 9:00 to 17:00 and shorter from 0:00 to 5:00, there is a long-term fluctuation that increases from 4:00 to 14:00 (upward trend) and decreases from 14:00 to 4:00 (downward trend).

Analyzing the result:

Casual and member riders have longer ride lengths on the weekend than the weekday.
Overall, casual riders have longer ride lengths than member riders.

4. The average ride length by days of the week

Analyzing the result:

For casual and member riders have a similar pattern, where the average ride length on the weekend is slightly longer than that on the weekday.
Overall, casual riders have longer ride lengths than member riders.

5. The average ride length by day

Analyzing the result:

The daily change in the average ride length has no trend, seasonality or cyclic behavior. There are random fluctuations and no strong patterns.
Overall, casual riders have longer ride lengths than member riders.

6. The average ride length by month

Analyzing the result:

For member riders; The average ride length changes smoothly and flat. There is no apparent trend.
For casual riders; The average ride length is longer from March to May and shorter from November to December. There is a long-term fluctuation that increases from January to May (upward trend) and decreases from June to December (downward trend).
Overall, casual riders have longer ride lengths than member riders.

II. Conduct descriptive analysis on the number of rides.

1. The number of rides

Analyzing the result:

Member riders have more rides than casual riders.

2. The number of rides by hour of the day

Analyzing the result:

The separate changing trends of the number of rides for member and casual riders are similar, where the number goes lowest at 3:00-4:00 and highest at 17:00. There is a long-term fluctuation that increases from 3:00-4:00 to 17:00 (upward trend) and decreases from 17:00 to 3:00-4:00 (downward trend).

Member riders have more rides than casual riders.

Analyzing the result:

There is a seasonal pattern that occurs in the weekday (Monday, Tuesday, Wednesday, Thursday, and Friday) and the weekend (Saturday and Sunday).

Analyzing the result:

For member riders, on the weekday, the number goes lowest at 3:00 and highest at 17:00, there is a long-term fluctuation that increases from 3:00 to 8:00 (upward trend), decreases from 8:00 to 10:00 (downward trend), increases again from 10:00 to 17:00, and decreases again from 17:00 to 3:00. While on the weekend, the number goes lowest at 4:00 and highest at 13:00, there is a long-term fluctuation that increases from 4:00 to 13:00 (upward trend) and decreases from 13:00 to 4:00 (downward trend).
For casual riders, on the weekday, the number goes lowest at 4:00 and highest at 17:00, there is a long-term fluctuation that increases from 4:00 to 17:00 (upward trend) and decreases from 17:00 to 4:00 (downward trend). While on the weekend, the number goes lowest at 4:00 and highest at 15:00, there is a long-term fluctuation that increases from 4:00 to 15:00 (upward trend) and decreases from 15:00 to 4:00 (downward trend).

Analyzing the result:

Member and casual riders have more rides on the weekday than the weekend.
On the weekday, member riders have more rides than casual riders.
On the weekend, casual riders have more rides than member riders.
Overall, member riders have more rides than casual riders.

3. The number of rides by days of the week

Analyzing the result:

For member riders, Wednesday is the highest rides number and Sunday is the lowest rides number. There is a long-term fluctuation that increases from Sunday to Wednesday (upward trend) and decreases from Wednesday to Sunday (downward trend).
For casual riders, Saturday is the highest rides number and Tuesday is the lowest rides number. There is a long-term fluctuation that increases from Tuesday to Saturday (upward trend) and decreases from Saturday to Tuesday (downward trend).

4. The number of rides by day

Analyzing the result:

The separate changing trends of the number of rides for member and casual riders are similar, there is a clear and decreasing trend.
Overall, member riders have more rides than casual riders.

Analyzing the result:

The daily change during 2021 has no trend, seasonality or cyclic behavior. There are random fluctuations and no strong patterns.

5. The number of rides by month

Analyzing the result:

The separate changing trends of the number of rides for member and casual riders are similar, where the number goes lowest in two first months and highest in mid-year.
For member riders, there is a long-term fluctuation that increases from February to August (upward trend) and decreases from August to December (downward trend).
For casual riders, there is a long-term fluctuation that increases from February to July (upward trend) and decreases from July to December (downward trend).

III. Conduct descriptive analysis on the bike type.

1. The number of bike type usage

Analyzing the result:

Classic bike is the most popular bike type.

2. The number of bike type usage by user type

Analyzing the result:

Classic bike is the most popular bike type.

IV. Conduct descriptive analysis on the station.

1. The number of stations

2. The top 10 most visited stations

3. tHE TOP 10 MOST VISITED STATIONS by user type

4. geographical distribution

5. SHARE data with the audience. (Communicating and interpreting results)

A. Create effective data visualizations: I am choosing Tableau as the platform lends itself to “drag and drop” functionality and allows one to create simple yet clear visuals and to join data from various sources.

Click full screen for a better view or click here to see the original viz.

B. Present the findings: I am choosing Google Slides for the presentation tool.

Click here for a better view.

6. ACT on the data and use the analysis results. (Putting the insight to work to solve the problem)

A. Final conclusion based on the analysis:

1. Member

Ride length average : 13.02 minutes
Number of rides : 2,497,534
Busiest hour : 15:00-19:00
Busiest day : Monday, Tuesday, Wednesday, Thursday, and Friday. (Weekday)
Busiest month : June-October (Summer-Early Fall)
Popular bike : Classic bike
Most visited station : Stations along the coast and downtown.
Purpose : For leisure and commuting

2. Casual

Ride length average : 23.95 minutes
Number of rides : 1,984,785
Busiest hour : 15:00-19:00
Busiest day : Saturday and Sunday. (Weekend)
Busiest month : June-September (Late Spring-Summer)
Popular bike : Classic bike
Most visited station : Stations along the coast.
Purpose : For leisure

B. Recommendations:

1. Best promotion time, bike type, and location for Casual
The best odds to launch the new marketing campaign are between 15:00 to 19:00, Weekends and Holidays, June to September. The promotion should be focused on Classic Bike. A targeted strategy at stations along the coast will reach the maximum number of Casual riders.

2. Improved the benefits for Member
Since Casual has longer ride lengths than Member, limitation on the ride length should be extended to one of Member benefits. Most Casuals’ purpose uses Cyclistics’ bikes for leisure, so marketing campaign must be related to health such as the health statistics on the mobile app for annual subscription.

3. Launch the mobile app
The rider can create a profile that contains data such as riders’ IDs, age, gender, occupation, and geographical location. It will gain insight into what marketing tactics to use. While the purchase history will provide insight into how much potential Casual can afford to purchase an annual subscription.

C. Further explorations: Some level of identifiable data is needed to perform further analysis at a personal level. This data might include:

Riders’ ID. It will provide insight into how many trips are made by individual riders.
Riders’ demographic (age, gender, occupation, and geographical location). The riders’ age, gender, and occupation data will provide insight into what marketing tactics to use. While the riders’ geographical location will provide insight into whether the rider is a resident or a tourist.
Purchase history. It will provide insight into how much cost is spent per individual rider.

Acknowledgement

Thanks to Google Data Analytics Professional Certificate provided by Coursera! By working through all the courses included in the Certificate and utilizing the data analysis roadmap provided by the Capstone Project, I get key skillset and resources to accomplish this case study independently.

Thank you all for reading!

TAGS :

Comments are closed.

Case Study: How Does a Bike-Share Navigate Speedy Success?

Introduction

About the company

Processes for data analysis phases:

Acknowledgement

ISRIL CANIAGO

Categories