Cyclistic Customer Usage Trends in 2022
Ask
The Marketing Department of Cyclistic has asked the Analytics team to deliver actionable insights into the usage characteristics of Cyclistic’s customer base to determine how to convert existing Casual users into Membership holders, which represents a more profitable customer demographic.
Setup
library(tidyverse)
library(lubridate)
Prepare
The data for this product was imported from the company’s records in the form of 12 zip compressed CSV files. These were imported from the server using a bash script to gather all files pertaining to calendar year 2022 and stored locally in an rstudio project folder and managed on the teams local server as a Git repository. Only members of the team were given access to the source data. The CSV were created as read-only to maintain the integrity of the source data. After the csv files were reviewed the Zip files were removed from the repository.
#Import Data from source#Source URL: https://divvy-tripdata.s3.amazonaws.com/index.html#!/bin/bash# Set the base URL for the zip files
base_url="https://divvy-tripdata.s3.amazonaws.com/"
# Iterate through months 1 to 12
for month in {1..12}; do
# Add leading zero if the month is single digit
if ((month < 10)); then
month="0${month}"
fi
# Construct the file URL
file_url="${base_url}2022${month}-divvy-tripdata.zip"
# Download the file using curl
curl -O "$file_url"
done
The individual datasets were concatenated using the rbind() function with the following script:
# Set the directory path where the CSV files are located
directory <- "~/Desktop/gdac_cs_1/source"
# Get a list of all CSV files in the directory
csv_files <- list.files(directory, pattern = "\\.csv$", full.names = TRUE)
# Initialize an empty data frame to store the concatenated data
combined_data <- data.frame()
# Loop through each CSV file and concatenate the data
for (file in csv_files) {
# Read the CSV file
data <- read.csv(file)
# Concatenate the data using rbind
combined_data <- rbind(combined_data, data)
}
The complete dataset was was wriiten to the file 2202-all-divvy-tripdata.csv which was also set to Read Only. The dataset was then assigned to the object “df”.
df <- read.csv("~/R/gdac_cs_1/source/2022-all-divvy-tripdata.csv")
names(df)
[1] "...1" "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name"
[7] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" "end_lat"
[13] "end_lng" "member_casual"
Process
Upon inspection of the data the naming conventions were found to be
acceptable for the analysis and it was determined that there was no
obvious concern regarding bias in the population. The team chose to
create new variables to gain clearer insight into the dataset such as
trip_duration, day_of_week, day_of_year, and month_of_year. It was
determined that observations of docked_bike
were in need of
cleaning as this value was not pertinent to the business task.
colnames(df)[1] <- "id"
df$trip_duration <- as.numeric(difftime(df$ended_at, df$started_at, units = "hours"))
df$day_of_week <- wday(df$started_at)
df$day_of_year <- yday(df$started_at)
df$month_of_year <- month(df$started_at)
names(df)
[1] "id" "ride_id" "rideable_type" "started_at" "ended_at" "start_station_name"
[7] "start_station_id" "end_station_name" "end_station_id" "start_lat" "start_lng" "end_lat"
[13] "end_lng" "member_casual" "day_of_week" "trip_duration" "day_of_year" "month_of_year"
An initial summary was run on the numeric variable
trip_duration
to determine the distribution of data across
the dataset and generated the following boxplot.
summary(df$trip_duration)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-172.5558 0.0969 0.1714 0.3241 0.3078 689.7875
A table comparing the categorical variables
member_casual
and rideable_type
showed a
disparity in the docked_bike
value in relation to
subscriber types.
# Prop table for member_casual v. trip_duration
table(df$member_casual, df$rideable_type)
classic_bike docked_bike electric_bike
casual 891459 177474 1253099
member 1709755 0 1635930
boxplot(df$trip_duration)
This showed notable outliers above the third Quartile and revealed that there were ride duration values of zero or less. The following Histogram gave further insight of the trends regarding trip duration.
df %>%
filter(!trip_duration > 5,
!trip_duration <= 0) %>%
ggplot(aes(x = trip_duration))+
geom_histogram(bins = 10)+
labs(x = "Ride Duration in Hours", y = "Count", title = "Histogram of Ride Durations")
Since the vast majority of rides had a ride limit under 1 hour the
data was filtered to more accurately reflect trends including any
negative trip value. Also any observations that referenced
docked_bike
in rideable_type
were filtered out
as well.
#Filter out docked bikes and limit ride duration to target values and reset dataframe
df <- df %>%
filter(trip_duration > 0,
trip_duration <= 1,
!rideable_type == "docked_bike")
df %>%
ggplot(aes(x = trip_duration))+
geom_histogram(bins = 10)+
labs(x = "Ride Duration in Hours", y = "Count", title = "Histogram of Ride Duration")
Analyze
Tables were generated from the categorical variables
member_casual
, rideable_type
,
trip_duration
, day_of_week
, and
month_of_year
.
# Table for member_casual v. month_of_year
addmargins(table(df$member_casual, df$month_of_year))
1 2 3 4 5 6 7 8 9 10 11 12 Sum
casual 16995 19266 76575 108087 236938 320073 354181 316799 265353 189089 92633 42249 2038238
member 84686 93690 192893 243453 351689 396756 413990 423723 401602 347358 235676 136165 3321681
Sum 101681 112956 269468 351540 588627 716829 768171 740522 666955 536447 328309 178414 5359919
# Table for member_casual v. day_of_week
addmargins(table(df$member_casual, df$day_of_week))
1 2 3 4 5 6 7 Sum
casual 330201 241640 236151 247859 278413 298235 405739 2038238
member 383094 470266 515720 521024 529003 463922 438652 3321681
Sum 713295 711906 751871 768883 807416 762157 844391 5359919
# Prop table for member_casual v. day_of_week
prop.table(table(df$member_casual, df$day_of_week),2)*100
1 2 3 4 5 6 7
casual 46.29235 33.94268 31.40845 32.23624 34.48198 39.13039 48.05108
member 53.70765 66.05732 68.59155 67.76376 65.51802 60.86961 51.94892
# Prop table for member_casual v. month_of_year
prop.table(table(df$member_casual, df$month_of_year),2)*100
1 2 3 4 5 6 7 8 9 10 11 12
casual 16.71404 17.05620 28.41710 30.74671 40.25266 44.65123 46.10705 42.78050 39.78574 35.24840 28.21519 23.68032
member 83.28596 82.94380 71.58290 69.25329 59.74734 55.34877 53.89295 57.21950 60.21426 64.75160 71.78481 76.31968
# Prop table for member_casual v. rideable_type
prop.table(table(df$member_casual, df$rideable_type),2)*100
classic_bike electric_bike
casual 32.64091 42.80911
member 67.35909 57.19089
The tables show that in all monthly and weekly observations members make up the majority usage of total rides. This difference is not as marked on the weekends. Casual riders also prefer the electric bikes to a higher degree than the classic, while the opposite trend is true for members. It is not surprising that ridership for both groups increases in late spring and peaks during the summer months.
Share
#Average trip duration v. user type
df %>%
group_by(member_casual) %>%
summarise(mean_trip_dur = mean(trip_duration)*60) %>%
ggplot(aes(member_casual, mean_trip_dur, fill = member_casual))+
geom_col()+
labs(x = "Subscriber Type", y = "Minutes", title = "User Type v. Mean Trip Duration in Minutes", fill = "Subscriber Type")
# Median trip duration v. member type
df %>%
group_by(member_casual) %>%
summarise(median_trip_dur = median(trip_duration)*60) %>%
ggplot(aes(member_casual, median_trip_dur, fill = member_casual))+
geom_col()+
labs(x = "Subscriber Type", y = "Minutes", title = "User Type v. Median Trip Duration in Minutes", fill = "Subscriber Type")
# Plots of resulting tables
df %>%
count(member_casual, rideable_type) %>%
group_by(rideable_type) %>%
ggplot(aes(rideable_type, n, fill = member_casual))+
geom_col()+
labs(x = "Subscriber Type", y = "Total Rides", title = "Total Number of Rides in 2022 per Bicyle and Subscriber Types", fill = "Subsciber")
NA
df %>%
filter(!rideable_type == "docked_bike") %>%
count(member_casual, rideable_type) %>%
group_by(rideable_type) %>%
mutate(prop = n / (sum(n))*100) %>%
ggplot(aes(rideable_type, prop, fill = member_casual))+
geom_col()+
labs(title = "Subscriber Preference of Bicycle Type as Percentage", x = "Bicycle Type", y = "Percentage", fill = "Subscriber Type")
# Plots of resulting tables
df %>%
count(member_casual, rideable_type) %>%
group_by(member_casual) %>%
mutate(prop = n / (sum(n))*100) %>%
ggplot(aes(member_casual, prop, fill = rideable_type))+
geom_col()+
labs(x = "Subscriber Type", y = "Percentage ofTotal Rides", title = "Percentage of Total Number of Rides in 2022 per Subscriber Type", fill = "Bicycle Type")
# Plot of Total Annual Rides per Day of the Week
df %>%
count(member_casual, day_of_week) %>%
group_by(day_of_week) %>%
ggplot(aes(day_of_week, n, fill = member_casual))+
geom_col()+
labs(x = "Day of the Week", y = "Total Rides", title = "Total Annual Rides per Day of the Week", fill = "Subscriber Type")
# Plots of Proportion of Subscriber Type Usage v. Day of the Week
df %>%
count(member_casual, day_of_week) %>%
group_by(day_of_week) %>%
mutate(prop = n / (sum(n))*100) %>%
ggplot(aes(day_of_week, prop, fill = member_casual))+
geom_col()+
labs(x = "Day of the Week", y = "Percentage", title = "Proportion of Subscriber Type Usage v. Day of the Week", fill = "Subscriber Type")
# Plot of Total Annual Rides per Day of the Week
df %>%
count(member_casual, month_of_year) %>%
group_by(month_of_year) %>%
ggplot(aes(month_of_year, n, fill = member_casual))+
geom_col()+
labs(x = NULL, y = "Total Rides", title = "Total Annual Rides per Month of Year", fill = "Subscriber Type")
Act
The Analysis team has determined that there are clearly defined differences in the way that the two subscriber groups utilize the services of Cyclistic.
Since the goal of this project was to determine how Casual Subscribers could be encouraged to become more profitable Member subscribers key characteristics of their behaviors should be considered. To a greater degree Casual subscribers tend to be positively effected by the onset of moderate weather so a marketing campaign to existing Casual members in late Winter to early spring would be most effective as they are preparing for the time of year they are anticipating getting out and more actively utilizing Cyclistic’s services, wheras existing members appear to be less effective by weather changes as they tend to rely on the service for their daily commute.
The campaign should point out the likely cost savings of a Member subscription as Casual riders actually trend towards slightly longer ride durations. If they will save money per ride by committing to a Member subscriptions and not having to be as concerned about the length of their ride duration they may find the new arrangement more appealing and thus hold more personal value . Casual subscribers also tend to prefer the electric bikes even though they cost more per minute than the classic bikes. Offering a 30 day discount on electric bike shares for new Member subscribers would likely be an incentive to sign up for an Annual Member subscription.