BUSINESS BACKGROUND¶
Pizza Hut is the first international pizza brand present in Vietnam since 2006. With the dynamic positioning by targeting young customers, this has been the market leader in terms of revenue and the quantity of 130 stores.
During the period 2019-2023, Covid has significantly changed consumer behavior in the F&B industry. Customers now not only seek a quality product but also demand more culinary experiences. To maintain its leading position, Pizza Hut has implemented many renovation campaigns to adapt to the changing patterns.
In 2023, Pizza Hut opened 2 signature stores with new design concepts to improve dine-in quality. Furthermore, it helps customers reduce queuing time by applying the Bring Your Own Device (BYOD) feature when ordering at the table. This is also a leading point to encourage customers to interact with online ordering activities on the website or through the in-house app.
In terms of R&D, the brand regularly introduces product innovations such as ‘Hot pot pizza, ‘Crispy shrimp pancake pizza’ or ‘Silkworm pizza’. This unique fusion practice has helped a Western-origin dish become more familiar to Vietnamese tastes.
Pizza Hut's communication activities also attract a vibrant discussion on social networks thanks to its youthful language, unique approach, and novel ideas in a context where fast food brands solely mention promotion.
However, the current market is becoming more and more competitive due to the emergence of new competitors and the diversity of culinary options. Besides the renovation activities mentioned, gaining additional insights from data such as Customer life value (CLV) and Customer churn rate has always been a critical perspective in determining business growth.
The Business Analyst team are required to report to the Head of department an accurate evaluation of the company, forecast its future performance and feasible recommendations.
OUTLINE¶
- Import Data, EDA and RFM Segmentation
- Import Data
- EDA
- RFM Segmentation
- Churn and CLV Predict
- Churn Risk Modeling
- CLV Model
- Final Model for CLV and Churn Probability
- Churn Probability and CLV Model for Cluster 1 and 3
- Churn Probability and CLV Model for Cluster 0 and 2
- Additional Insights
- Retention Rate
- New Customer by Month
- Recommendations
1. Import Data, EDA and RFM Segmentation¶
Import Data¶
import numpy as np
import pandas as pd
from scipy import stats
%matplotlib inline
import datetime as dt
from IPython.display import display, HTML
# Function to display dataframes side by side
def display_side_by_side(*args):
html_str = ''
for df in args:
html_str += df.to_html() + '\xa0\xa0\xa0'
display(HTML('<div style="display: flex;">' + html_str + '</div>'))
pd.set_option('display.max_column', None)
pd.set_option('display.max_row', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import warnings
warnings.filterwarnings('ignore')
# Import Data
from config import DATA_PATH
data = pd.read_csv(DATA_PATH, parse_dates=["TransactionDate"])
data["TransactionDate"] = pd.to_datetime(data.TransactionDate).dt.date
data['TransactionDate'] = data['TransactionDate'].apply(pd.to_datetime)
# Exclude data from July (The data only contains data until 1st July)
data = data[data['TransactionDate'] < '2023-07-01']
EDA¶
data.drop(columns=['Unnamed: 0'], inplace=True)
data.info() # there are no missing values
data[['TransactionDate', 'SalesAmount']].describe()
<class 'pandas.core.frame.DataFrame'> Index: 1397202 entries, 0 to 1397201 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BillID 1397202 non-null int64 1 Channel 1397202 non-null object 2 OrderFrom 1397202 non-null object 3 TransactionDate 1397202 non-null datetime64[ns] 4 SalesAmount 1397202 non-null float64 5 CustomerID 1397202 non-null int64 6 CustomerGender 1397202 non-null object 7 VoucherStatus 1397202 non-null object 8 Province 1397202 non-null object dtypes: datetime64[ns](1), float64(1), int64(2), object(5) memory usage: 106.6+ MB
| TransactionDate | SalesAmount | |
|---|---|---|
| count | 1397202 | 1397202.000 |
| mean | 2022-08-01 12:10:43.237555968 | 310330.403 |
| min | 2021-10-01 00:00:00 | -31605.000 |
| 25% | 2022-02-22 00:00:00 | 178089.000 |
| 50% | 2022-07-21 00:00:00 | 261690.000 |
| 75% | 2023-01-09 00:00:00 | 364380.750 |
| max | 2023-06-30 00:00:00 | 57731681.000 |
| std | NaN | 317534.538 |
We see that the min value of SalesAmount is negative, so we will need to see the number of negative values and the value itself to decide how to handle it.
data[data['SalesAmount'] <= 0].head()
| BillID | Channel | OrderFrom | TransactionDate | SalesAmount | CustomerID | CustomerGender | VoucherStatus | Province | |
|---|---|---|---|---|---|---|---|---|---|
| 1052712 | 1052712 | Take Away | STORE | 2023-01-12 | -31605.000 | 1996642 | Female | No | Nothern Provinces |
| 1212340 | 1212340 | Take Away | STORE | 2023-03-29 | -1104.000 | 768491 | Male | No | Hanoi |
display_side_by_side(data[data['CustomerID'] == 1996642][['CustomerID', 'SalesAmount']],
data[data['CustomerID'] == 768491][['CustomerID', 'SalesAmount']])
| CustomerID | SalesAmount | |
|---|---|---|
| 289260 | 1996642 | 255165.000 |
| 1048367 | 1996642 | 3821364.000 |
| 1052712 | 1996642 | -31605.000 |
| CustomerID | SalesAmount | |
|---|---|---|
| 14910 | 768491 | 480696.000 |
| 258382 | 768491 | 527982.000 |
| 359281 | 768491 | 395003.000 |
| 515940 | 768491 | 390227.000 |
| 658327 | 768491 | 450617.000 |
| 790788 | 768491 | 483823.000 |
| 1133172 | 768491 | 666475.000 |
| 1186516 | 768491 | 232791.000 |
| 1212096 | 768491 | 279389.000 |
| 1212340 | 768491 | -1104.000 |
| 1233398 | 768491 | 242667.000 |
Based on the data, we see that the data is probably have the wrong sales amount, since the value without the minus sign is very low compared to normal SalesAmount of these 2 customers. So we'll cutoff these 2 records.
data = data[data['SalesAmount'] > 0]
data.nunique() # There is no duplicate value in billID
BillID 1397200 Channel 3 OrderFrom 4 TransactionDate 638 SalesAmount 529684 CustomerID 718050 CustomerGender 3 VoucherStatus 2 Province 4 dtype: int64
import matplotlib.pyplot as plt
import seaborn as sns
group = ["CustomerGender", "VoucherStatus", "Province", "OrderFrom", "Channel"]
for x in group:
plt.figure(figsize=(7,4))
# Plotting the average sales amount
avg_sales = data.groupby(x)['SalesAmount'].mean()
plt.subplot(1, 2, 1)
sns.barplot(x=avg_sales.index, y=avg_sales.values, palette='viridis') # Change the color palette here
plt.title('Average Sales Amount per ' + x)
plt.xticks(rotation=90)
# Plotting the total sales amount
total_sales = data.groupby(x)['SalesAmount'].sum()
plt.subplot(1, 2, 2)
sns.barplot(x=total_sales.index, y=total_sales.values, palette='viridis') # Change the color palette here
plt.title('Total Sales Amount per ' + x)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
- There are certain segments of the population that are more likely to eat pizza than others, with a slight majority of male customers. Nearly 60% of the bill doesn’t reveal the gender.
- Two cities provide 63.3% of sales, with Hanoi favoring stores and Ho Chi Minh residents preferring websites. Hanoi accounts for the largest proportion of sales amount, followed by Ho Chi Minh City.
- Most of the revenue in Hanoi branches come from stores, while website is the “golden source” for Ho Chi Minh City.
- Customers take advantage of vouchers through app & website, but still mostly order from store. Most customers arrange from store & store orders have the lowest average amount.
- Large number of people use vouchers through app and website.
- Customers prefer takeout or receive delivery, but “Dine In” & “Take away” show opposite tendency. Takeaway & delivery account for nearly 98% of the order.
- Customers tend to spend more when Dine In.
# Count the number of orders per customer
order_counts = data.groupby('CustomerID')['BillID'].count()
# Filter customers with only one order
customers_with_one_order = order_counts[order_counts == 1]
# Get the count of customers with only one order
number_of_customers_with_one_order = len(customers_with_one_order)
print("Number of customers with only one order:", number_of_customers_with_one_order)
Number of customers with only one order: 495255
from datetime import timedelta
# Find the final date in the 'TransactionDate' column
final_date = data['TransactionDate'].max()
# Calculate the cutoff date for the last 6 months from the final date
cutoff_date = final_date - timedelta(days=180)
# Filter customers who haven't had any orders in the last 6 months from the final date
inactive_customers = data[data['TransactionDate'] < cutoff_date].groupby('CustomerID').size()
# Get the count of inactive customers
number_of_inactive_customers = len(inactive_customers)
print("Number of customers with no orders in the last 6 months from the final date:", number_of_inactive_customers)
Number of customers with no orders in the last 6 months from the final date: 575736
# Sort values to calculate time since last order
data_sorted = data.sort_values(['CustomerID', 'TransactionDate'])
# Lag transaction time
data_sorted['lag'] = data_sorted.groupby('CustomerID')['TransactionDate'].shift(1)
# Convert timedelta to hours
data_sorted['diff'] = (data_sorted['TransactionDate'] - data_sorted['lag'])/np.timedelta64(1, 'D')
# Calculate the average time between orders
average_time_between_orders = data_sorted['diff'].mean()
print("Average time between customer orders:", average_time_between_orders)
Average time between customer orders: 79.64693513951262
# Filter customers who haven't made an order in 2023
inactive_customers_2023 = data[data['TransactionDate'].dt.year == 2023].groupby('CustomerID').size()
# Get the count of inactive customers in 2023
number_of_inactive_customers_2023 = len(inactive_customers_2023)
print("Number of customers who haven't made an order in 2023:", number_of_inactive_customers_2023)
Number of customers who haven't made an order in 2023: 240184
sns.distplot(data_sorted[~data_sorted['diff'].isna()]['diff'])
plt.show()
- Here we can see a glimpse of the distribution of the gap between user purchases, with the majority of customers having a gap between purchases of less than 100 days, with an average of 80 days.
- The majority of Pizza Hut's customers are one-time customers (495255 customers).
Other Conclusion (From analysis in PowerBI):¶
- PizzaHut is a “Hot spot” during weekends & holiday seasons. The closer the weekend is, the more customers tend to buy as people mostly buy on Saturday & Sunday. The sales amount per day, especially in weekends & holiday seasons is higher than normal day.
- The number of customers choose Dine In tend to increase, while the number of customers choose Takeaway decrease.
RFM Segmentation¶
today = max(data['TransactionDate']) + timedelta(days=1)
# Calculate RFM values
RFM = data.groupby(['CustomerID']).agg({
'TransactionDate': lambda x: (today - x.max()).days,
'BillID': 'count',
'SalesAmount': 'sum'})
# Rename column
RFM.columns = ['Recency', 'Frequency', 'Monetary']
fig, ax = plt.subplots(3, 1, figsize=(8, 12))
# Plot Monetary
sns.kdeplot(data=RFM, x='Monetary', ax=ax[0], fill=True, palette="crest", linewidth=3)
ax[0].set_title('Monetary Distribution')
# Plot Frequency
sns.kdeplot(data=RFM, x='Frequency', ax=ax[1], fill=True, palette="crest", linewidth=3)
ax[1].set_title('Frequency Distribution')
# Plot Recency
sns.kdeplot(data=RFM, x='Recency', ax=ax[2], fill=True, palette="crest", linewidth=3)
ax[2].set_title('Recency Distribution')
plt.tight_layout()
plt.show()
Here we can see the Monetary and Frequency is highly skewed to the right. So will we need to try some transformation technique to reduce the skewness.
from numpy import sqrt, log10
# Define transformation methods
transformations = {
'none': lambda x: x,
'sqrt': sqrt,
'log10': log10,
'boxcox': lambda x: stats.boxcox(x)[0]
}
# Calculate and print skewness for each transformation
for name, transform in transformations.items():
recency_transformed = transform(RFM['Recency'])
recency_skew = stats.skew(recency_transformed)
frequency_transformed = transform(RFM['Frequency'])
frequency_skew = stats.skew(frequency_transformed)
monetary_transformed = transform(RFM['Monetary'])
monetary_skew = stats.skew(monetary_transformed)
print(f"\n{name.capitalize()} transformation:")
print(f"Skewness of Recency: {recency_skew}")
print(f"Skewness of Frequency: {frequency_skew}")
print(f"Skewness of Monetary: {monetary_skew}")
None transformation: Skewness of Recency: 0.04356166369846459 Skewness of Frequency: 12.369499427471565 Skewness of Monetary: 18.954224583976302 Sqrt transformation: Skewness of Recency: -0.49097751362347036 Skewness of Frequency: 3.4505122033956948 Skewness of Monetary: 2.83530947499132 Log10 transformation: Skewness of Recency: -1.632735789608205 Skewness of Frequency: 1.8799794010067354 Skewness of Monetary: 0.6165928123199858 Boxcox transformation: Skewness of Recency: -0.3102933245609721 Skewness of Frequency: 0.8771981678957986 Skewness of Monetary: -0.01735377665060057
Based on the result, we'll choose:
- None for Recency
- Boxcox for Frequency and Monetary
# Preview Skewness of data
sns.displot(RFM['Recency'])
sns.displot(stats.boxcox(RFM['Frequency'])[0])
sns.displot(stats.boxcox(RFM['Monetary'])[0])
plt.show()
After the transformation, we can see that the Monetary is very symmetric, while the Frequency is still highly right skewed because of the number of one-time customers are too large.
from sklearn.preprocessing import MinMaxScaler
# Box-cox Transformation
RFM['Frequency_boxcox'], _ = stats.boxcox(RFM['Frequency'])
RFM['Monetary_boxcox'], _ = stats.boxcox(RFM['Monetary'])
# Data for transformation
fit_RFM = RFM[['Recency', 'Frequency_boxcox', 'Monetary_boxcox']]
fit_RFM.columns = ['Recency', 'Frequency', 'Monetary']
# Scale Data
scaler = MinMaxScaler()
scaler.fit(fit_RFM)
transformed_RFM = scaler.transform(fit_RFM)
import logging
from sklearn.cluster import KMeans
# Suppress warnings
logging.getLogger('matplotlib.font_manager').disabled = True
from yellowbrick.cluster import KElbowVisualizer
Elbow_M = KElbowVisualizer(KMeans(), k=10, metric='distortion')
Elbow_M.fit(transformed_RFM)
Elbow_M.show() # 4 cluster is the optimize number
<Axes: title={'center': 'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>
from sklearn.metrics import silhouette_score
# Fitting KMeans Model
model = KMeans(n_clusters=4, init='k-means++', random_state=3, n_init=30)
model.fit(transformed_RFM)
labeled_RFM = pd.DataFrame(transformed_RFM, columns=['Recency', 'Frequency', 'Monetary'], index=fit_RFM.index)
labeled_RFM['Cluster'] = model.predict(labeled_RFM)
# Evaluate model
silhouette = silhouette_score(labeled_RFM[['Recency', 'Frequency', 'Monetary']], labeled_RFM['Cluster'], sample_size=40000)
print("Silhouette Coefficient:", silhouette) # The silhouette score is quite good
Silhouette Coefficient: 0.5688068708100638
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 10))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(labeled_RFM['Recency'],
labeled_RFM['Frequency'],
labeled_RFM['Monetary'],
c=labeled_RFM['Cluster'].astype(int),
cmap='viridis',
alpha=0.7)
ax.set_title('Clusters', fontsize=16, pad=20)
ax.set_xlabel('Recency', fontsize=14, labelpad=10)
ax.set_ylabel('Frequency', fontsize=14, labelpad=10)
ax.set_zlabel('Monetary', fontsize=14, labelpad=10)
ax.view_init(elev=20, azim=120)
cbar = plt.colorbar(scatter, shrink=0.5)
cbar.set_label('Cluster Number', rotation=270, labelpad=15)
ax.tick_params(axis='both', which='major', labelsize=12)
plt.show()
From this angle, it quite hard to see the different between clusters in term of Monetary, but we can clearly see the four clusters are separated by the different in the Recency and Frequency value.
# Create new df
transformed_RFM_df = pd.DataFrame(transformed_RFM,
index=RFM.index,
columns=RFM[['Recency', 'Frequency', 'Monetary']].columns)
transformed_RFM_df['Cluster'] = labeled_RFM.Cluster
# Create long table
RFM_melt = pd.melt(transformed_RFM_df.reset_index(),
id_vars=['CustomerID', 'Cluster'],
value_vars=['Recency', 'Frequency', 'Monetary'],
var_name='Attribute',
value_name='Value')
plt.title("Snake plot")
sns.lineplot(x='Attribute', y='Value', hue='Cluster', data=RFM_melt)
plt.show()
The Snake plot is distorted in the Monetary section because applying MinMaxScaler to this column along with the existence of a number of outliers with a very large total purchase value has accidentally caused this column to have very small values.
To check whether the model has effective clustering, we will use the original Monetary result.
# Assign Cluster number
RFM['Cluster'] = labeled_RFM.Cluster
# Calculate Average RFM Values of each Cluster
print(RFM.groupby('Cluster')[['Recency', 'Frequency', 'Monetary']].mean())
Recency Frequency Monetary Cluster 0 162.068 1.000 307799.016 1 97.538 4.851 1565071.450 2 486.570 1.000 289320.191 3 394.498 2.704 816652.064
We can see there are differences between the 4 groups in 3 RFM values. Specifically, Cluster 0 is recent customers but has only purchased once and has an average Monetary value.
Meanwhile, Cluster 1 is the Cluster with three scores at the best level.
Cluster 2 is the customer group with the three lowest indexes among the four groups and has a very high possibility of churn.
Cluster 3 is a group of customers with high purchasing frequency and high value, but have not returned for a long time and are at high risk of losing this group of customers to competitors.
# Building Relative Importance Heatmap
cluster_avg = RFM.groupby(['Cluster']).agg({'Recency':'mean',
'Frequency':'mean',
'Monetary':'mean'})
population_avg = RFM[['Recency','Frequency','Monetary']].mean()
relative_imp = cluster_avg/population_avg - 1
plt.figure(figsize=(8,2))
plt.title('Relative Importance of Attributes')
sns.heatmap(data=relative_imp, annot=True, fmt='.2f', cmap='RdYlGn')
plt.show()
The farther the values are from 0, the more effective the grouping is. As we can see, all divided groups have values quite far from the value 0, showing that KMeans has effectively divided the data into 4 groups.
2. Churn and CLV Predict¶
In this section, we will use the statistical model Beta Geometric Negative Binomial Distribution to calculate customer churn probability, predict the average Frequency of each customer; and the Gamma-Gamma model to predict the average Monetary. Then we'll predict the CLV of customers based on these 2 models.
Because this model only applies to customers with repeated purchases, we will use this model to test and calculate with Cluster 1 and 3 only.
A. Churn Risk Modeling¶
# Import Model
from lifetimes import *
from lifetimes.utils import *
from lifetimes.plotting import *
# Create data and auto calculate RFM Score (Formula of the BetaGeo Model is different than the way that we calculated above)
churn_RFM = summary_data_from_transaction_data(data.reset_index(drop=True), 'CustomerID' , 'TransactionDate' , 'SalesAmount' )
bgf = BetaGeoFitter(penalizer_coef=0.0)
bgf.fit(churn_RFM['frequency'], churn_RFM['recency'], churn_RFM['T'])
t = 180 # Forecast for the next 180 days
churn_RFM['expected_purchases'] = bgf.conditional_expected_number_of_purchases_up_to_time(t, churn_RFM['frequency'], churn_RFM['recency'], churn_RFM['T'])
# Alive Probability of Customers
churn_RFM['Retention'] = bgf.conditional_probability_alive(churn_RFM['frequency'], churn_RFM['recency'], churn_RFM['T'])
churn_RFM['Churn'] = 1 - churn_RFM['Retention']
Next we want to evaluate the model to see how well it performs in the future. I'll split the data into a training (calibration) period and a holdout (observation) period, train the BG/NBD model and evaluate performance with four plots:
- Calibration period histogram: does the model fit the training data?
- Cumulative transaction plot: does the model predict cumulative sales well?
- Incremental transaction plot: does the model capture the overall trend in transactions?
- Conditional expectations plot: can the model predict the number of purchases a customer will make based on the training data?
Note: except for graph 1, in this section we will only consider the model's performance on data from customers in Cluster 1 and 3, two groups of customers with repeated purchases.
1. Calibration Period Histogram¶
plot_period_transactions(bgf).set_yscale('log')
As we can see, the model predicts quite accurately the number of customers according to the number of repeat transactions, the difference becomes larger at 5 and 6 repeat purchases because there is not much data of these customers.
2. Cumulative Transaction Plot¶
# Split Calibration and Holdout data
summary_cal_holdout = calibration_and_holdout_data(data.reset_index(drop=True), 'CustomerID', 'TransactionDate',
calibration_period_end='2023-01-01',
observation_period_end='2023-06-30',
monetary_value_col='SalesAmount')
summary_cal_holdout = summary_cal_holdout[summary_cal_holdout['frequency_cal'] > 0]
# Add Frequency column to data
data.index = data['CustomerID']
data['Frequency'] = RFM.Frequency
# Plot Cumulative transaction of Model vs Actual data
bgf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal'])
plot_cumulative_transactions(bgf, data[data['Frequency'] > 1].reset_index(drop=True), 'TransactionDate', 'CustomerID', 637, 456)
<Axes: title={'center': 'Tracking Cumulative Transactions'}, xlabel='day', ylabel='Cumulative Transactions'>
The orange line represents the boundary between the calibration period on the left and the holdout period on the right. As we can see, the model predicts quite closely the orange line in terms of the number of transactions over time with the gap between the two lines not too significant.
3. Incremental Transaction Plot¶
plot_incremental_transactions(bgf, data[data['Frequency'] > 1].reset_index(drop=True), 'TransactionDate', 'CustomerID', 637, 456)
<Axes: title={'center': 'Tracking Daily Transactions'}, xlabel='day', ylabel='Transactions'>
The model captures the general trend of the number of purchases by day, as we can see the orange line is trending down and captures the recent slight downward trend.
4. Conditional Expectations Plot¶
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)
<Axes: title={'center': 'Actual Purchases in Holdout Period vs Predicted Purchases'}, xlabel='Purchases in calibration period', ylabel='Average of Purchases in Holdout Period'>
The distance between the two lines tends to decrease, showing that the model predicts the number of purchases quite well.
from sklearn.metrics import mean_absolute_error
predicted_purchases = bgf.conditional_expected_number_of_purchases_up_to_time(t,
summary_cal_holdout['frequency_cal'],
summary_cal_holdout['recency_cal'],
summary_cal_holdout['T_cal'])
actual_purchases = summary_cal_holdout['frequency_holdout']
mae_frequency = mean_absolute_error(actual_purchases, predicted_purchases)
print(f'Mean Absolute Error: {mae_frequency}')
Mean Absolute Error: 0.8605611140417139
The MAE of the frequency is < 1, which is very good since the Cluster 1 and Cluster 3 are the two clusters with very high average frequency of customers.
5. Alive Probability¶
- The BG/NBD model assumes that death can only occur after a repeat purchase, since the customer leaving occurs during a purchase and the first purchase is reserved to signal a customer's birth.
- Because of this, customers with only one transactions will have a 100% probability of being alive, which is suspect. To account for this limitation, we'll only predict churn risk on customers who have made at least one repeat transaction, which is customers in Cluster 1 and 3.
# Filter out Frequency=1 Customers
repeat_customers = churn_RFM[churn_RFM['frequency'] > 0]
repeat_customers['Cluster'] = labeled_RFM.Cluster
# Plot alive probability
repeat_customers['prob_alive'] = bgf.conditional_probability_alive(repeat_customers['frequency'], repeat_customers['recency'], repeat_customers['T'])
sns.distplot(repeat_customers['prob_alive'])
plt.show()
print(repeat_customers.groupby("Cluster")["prob_alive"].mean())
Cluster 1 0.608 3 0.101 Name: prob_alive, dtype: float64
We can see that there is a clear difference in average prob_alive between the two Clusters, with Cluster 1 having much higher prob_alive than Cluster 3.
B. CLV Model¶
The assumption that the transaction frequency (how often a customer purchases) is independent of the monetary value (how much they spend per transaction) is central to the Gamma-Gamma model. To test this assumption we'll use Pearson Correlation.
repeat_customers[['frequency', 'monetary_value']].corr()
| frequency | monetary_value | |
|---|---|---|
| frequency | 1.000 | -0.012 |
| monetary_value | -0.012 | 1.000 |
- We see that -0.01225 is a low correlation number --> we can use the model.
- Beside the above assumption, this model also assume that the monetary values are gamma-distributed. To check this, we'll see the histogram plot of monetary value after transformed using box-cox, with the overlay PDF of the gamma distribution.
from scipy.stats import gamma
repeat_customers['monetary_value_boxcox'], _ = stats.boxcox(repeat_customers['monetary_value'])
# Fit a gamma distribution to monetary values
ag, loc, scale = gamma.fit(repeat_customers['monetary_value_boxcox'])
# Generate values from fitted gamma distribution
gamma_fitted = gamma.pdf(np.linspace(min(repeat_customers['monetary_value_boxcox']), max(repeat_customers['monetary_value_boxcox']), 100), ag, loc, scale)
# Plot histogram and fitted gamma PDF
plt.figure()
plt.hist(repeat_customers['monetary_value_boxcox'], bins=30, density=True, alpha=0.75, label='Observed data')
plt.plot(np.linspace(min(repeat_customers['monetary_value_boxcox']), max(repeat_customers['monetary_value_boxcox']), 100), gamma_fitted, 'r-', label='Gamma fit')
plt.xlabel('Monetary Value')
plt.ylabel('Density')
plt.title('Fit of Gamma Distribution')
plt.legend()
plt.show()
The fit of the Gamma distribution to the data seems quite good, meaning that we can proceed with this model to predict CLV of customers.
ggf = GammaGammaFitter(penalizer_coef=0.0)
summary_cal_holdout['monetary_value_cal_boxcox'], mon_val_lmbda = stats.boxcox(summary_cal_holdout['monetary_value_cal'])
ggf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['monetary_value_cal_boxcox'])
<lifetimes.GammaGammaFitter: fitted with 166916 subjects, p: 525.60, q: 540.76, v: 16.96>
# Predict the expected number of transactions in the next 180 days since 2023-01-01
days = (dt.datetime(2023, 6, 30) - dt.datetime(2023, 1, 1)).days
predicted_bgf = bgf.predict(days,
summary_cal_holdout['frequency_cal'],
summary_cal_holdout['recency_cal'],
summary_cal_holdout['T_cal'])
# Predict the average order value
monetary_pred_boxcox = ggf.conditional_expected_average_profit(
summary_cal_holdout['frequency_cal'],
summary_cal_holdout['monetary_value_cal_boxcox'])
# Inverse Box-Cox transformation to map predictions back to the original scale
def inverse_boxcox(y, lambda_):
if lambda_ == 0:
return np.exp(y)
else:
return (y * lambda_ + 1) ** (1 / lambda_)
monetary_pred = inverse_boxcox(monetary_pred_boxcox, mon_val_lmbda)
# Calculate the predicted sales
sales_pred = predicted_bgf * monetary_pred
# Actual values calculation
actual = summary_cal_holdout['monetary_value_holdout'] * summary_cal_holdout['frequency_holdout']
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
# Calculate the mean of the training set
train_actual = summary_cal_holdout['monetary_value_cal'] * summary_cal_holdout['frequency_cal']
median_sales = train_actual.median()
def evaluate(actual, sales_prediction, median_sales):
actual_log = np.log1p(actual) # Log-transforming the actual values
sales_pred_log = np.log1p(sales_prediction) # Log-transforming the predictions
print(f"Total Sales Actual: {np.round(actual.sum())}")
print(f"Total Sales Predicted: {np.round(sales_prediction.sum())}")
# Calculate and print the percentage difference
percentage_difference = ((sales_prediction.sum() - actual.sum()) / actual.sum()) * 100
print(f"Percentage Difference between Actual Sales and Predicted Sales: {percentage_difference:.2f}%")
print(f"Individual Mean Absolute Error: {mean_absolute_error(actual, sales_prediction)}")
# Predict the mean sales for all data points
naive_predictions = np.full(shape=actual.shape, fill_value=median_sales)
# Calculate the MAE of the naive model
naive_mae = mean_absolute_error(actual, naive_predictions)
print(f"Naive Mean Absolute Error: {naive_mae}")
plt.scatter(sales_pred_log, actual_log, alpha=0.5)
plt.xlabel('Log of Prediction')
plt.ylabel('Log of Actual')
plt.title('Log-Log Plot of Actual vs. Prediction')
plt.show()
# Apply the function with actual and predicted values
evaluate(actual, sales_pred, median_sales)
Total Sales Actual: 47080105051.0 Total Sales Predicted: 45781350436.0 Percentage Difference between Actual Sales and Predicted Sales: -2.76% Individual Mean Absolute Error: 291539.45987536316 Naive Mean Absolute Error: 469809.1762456828
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
# Prepare the features and target
X = summary_cal_holdout[['frequency_cal', 'recency_cal', 'T_cal', 'monetary_value_cal']]
y = summary_cal_holdout['monetary_value_holdout'] * summary_cal_holdout['frequency_holdout']
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize the XGBRegressor
xgb = XGBRegressor(objective='reg:squarederror', random_state=42)
# Fit the model
xgb.fit(X_train, y_train)
# Make predictions
xgb_predictions = xgb.predict(X_test)
# Calculate the MAE of the XGBRegressor
xgb_mae = mean_absolute_error(y_test, xgb_predictions)
print(f"XGBRegressor Mean Absolute Error: {xgb_mae}")
XGBRegressor Mean Absolute Error: 302864.0472673454
- We can see that the MAE of the GammaGamma model is lower compared to both the two benchmarks are XGBoost and Naive prediction approach (here we use the median since the raw Monetary data is skewed, so using median instead of mean would be the better benchmark - The mean approach have MAE of > 700,000), indicating that the model is suitable for these two Clusters in predict the CLV of customers.
- Beside that, the Total Sales Actual and the Total Sales Predicted are very close, only off by 2.8%. There are a lot of customers that have 0 sales value but predicted > 0 sales value. We can understand this as the period is not long enough for those users to proceed next transactions.
3. Final Model for CLV and Churn Probability¶
Here, we will calculate the churn probability of a customer for each Cluster and along with that customer's CLV in the next 6 months.
Churn Probability and CLV Model for Cluster 1 and 3¶
# Data for fitting model
temp = summary_data_from_transaction_data(data.reset_index(drop=True), 'CustomerID', 'TransactionDate', monetary_value_col='SalesAmount')
cluster_1_3 = temp.loc[temp.frequency > 0, :]
# BG/NBD - Churn Probability
bgf = BetaGeoFitter(penalizer_coef=0.0)
bgf.fit(cluster_1_3['frequency'], cluster_1_3['recency'], cluster_1_3['T'])
cluster_1_3['Churn'] = 1 - bgf.conditional_probability_alive(cluster_1_3['frequency'], cluster_1_3['recency'], cluster_1_3['T'])
# Gamma-Gamma
ggf = GammaGammaFitter(penalizer_coef=0.0)
cluster_1_3['monetary_value_boxcox'], monetary_lmbda = stats.boxcox(cluster_1_3['monetary_value'])
# Fitting ggf model
ggf.fit(cluster_1_3['frequency'],
cluster_1_3['monetary_value_boxcox'])
# Avg Monetary predict
expected_avg_profit = ggf.conditional_expected_average_profit(
cluster_1_3['frequency'],
cluster_1_3['monetary_value_boxcox']
)
expected_avg_profit_true = inverse_boxcox(expected_avg_profit, monetary_lmbda)
# Avg frequency predict
expected_transactions = bgf.predict(
180, # 6 months
cluster_1_3['frequency'],
cluster_1_3['recency'],
cluster_1_3['T']
)
cluster_1_3['clv'] = expected_transactions * expected_avg_profit_true
cluster_1_3['Cluster'] = labeled_RFM.Cluster
print(cluster_1_3.groupby("Cluster")["Churn"].mean())
print(cluster_1_3.groupby("Cluster")["clv"].mean())
Cluster 1 0.364 3 0.862 Name: Churn, dtype: float64 Cluster 1 355426.359 3 38039.774 Name: clv, dtype: float64
Here, we can clearly see the difference in average CLV value and Churn probability between the two Clusters, with Cluster 1 having a much higher value than Cluster 3 and having a lower average Churn probability, showing that Customers in the group are very valuable.
Churn Probability and CLV Model for Cluster 0 and 2¶
data_sorted.index = data_sorted['CustomerID']
data_sorted['Cluster'] = RFM.Cluster
average_diff = data_sorted[(data_sorted['Cluster'] == 1) | (data_sorted['Cluster'] == 3)].reset_index(drop=True).groupby('CustomerID')['diff'].mean()
average_diff.columns = ['average_diff']
count, bins_count = np.histogram(average_diff, bins=20)
# using numpy np.cumsum to calculate the CDF
pdf = count / sum(count)
cdf = np.cumsum(pdf)
# plotting PDF and CDF
plt.plot(bins_count[1:], cdf, label="CDF")
plt.xlabel("Days")
plt.ylabel("Percentage of Customers Buying Again")
plt.legend()
<matplotlib.legend.Legend at 0x7fc4cff3c2f0>
According to hackermoon's theory, we will use CDF (Cumulative Distribution Function) to choose the threshold that determines the number of days since last order at which a customer will churn.
A good rule of thumb is to find the interval where there is an inflection point in the curve --> we will choose a threshold of 93% (0.93 quantile).
#Filter for cluster 0 and 2
cluster_0_2 = RFM[(RFM['Cluster'] == 0) | (RFM['Cluster'] == 2)]
#Define threshold for defining churn
churn_threshold = average_diff.reset_index()['diff'].quantile(0.93)
#Probability of churn
cluster_0_2['Churn_1'] = cluster_0_2['Recency']/churn_threshold
cluster_0_2['Churn'] = np.where(cluster_0_2['Churn_1']>1, 1, cluster_0_2['Churn_1'])
cluster_0_2.groupby('Cluster')['Churn'].mean()
Cluster 0 0.517 2 1.000 Name: Churn, dtype: float64
Cluster 2 has a Churn probability of 100%, proving that 100% of customers in this group have not made a purchase in the past year. However, this method is only approximate and not 100% confident.
Because Cluster 2 has a churn probability of 100%, we will assume the CLV of this Cluster in the next 6 months is 0.
For Cluster 0, we will use XGBRegressor to predict the CLV of this Cluster for the next 6 months.
# Calculate cutoff date for target variable
n_days = 180
max_date = data['TransactionDate'].max()
cutoff = max_date - pd.to_timedelta(n_days, unit='d')
# Split data
temporal_in_df = data[data['TransactionDate'] <= cutoff]
temporal_out_df = data[data['TransactionDate'] > cutoff]
# Making target data
targets_df = temporal_out_df[['CustomerID', 'SalesAmount']].reset_index(drop=True).\
groupby('CustomerID').sum().rename({'SalesAmount':'spend_180_total'}, axis=1).assign(spend_180_flag=1)
# RFM
max_day = temporal_in_df['TransactionDate'].max() + timedelta(days=1)
# Calculate RFM values
features_df = temporal_in_df.reset_index(drop=True).groupby(['CustomerID']).agg({
'TransactionDate': lambda x: (max_day - x.max())/pd.to_timedelta(1,"day"),
'BillID': 'count',
'SalesAmount': 'sum'}).merge(targets_df, left_index=True, right_index=True, how='left').fillna(0)
features_df['Cluster'] = RFM.Cluster
# Rename column
features_df.rename(columns={'TransactionDate': 'Recency', 'BillID': 'Frequency', 'SalesAmount': 'Monetary'}, inplace=True)
# MACHINE LEARNING FOR REGRESSION
X = features_df[['Recency', 'Frequency', 'Monetary']]
y_spend = features_df['spend_180_total']
##Train and test split
from sklearn.model_selection import KFold, cross_val_score
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X, y_spend, test_size=0.2, random_state=3)
#Create XGB Regression model and Tuning Model
xgb_reg = XGBRegressor(n_estimators=100, learning_rate=0.05, random_state=2, n_jobs=12)
kf = KFold(n_splits=5, random_state=1, shuffle=True)
cv_scores = cross_val_score(xgb_reg, X_train_reg, y_train_reg, cv=kf, scoring='neg_mean_absolute_error')
# Fit the model
xgb_reg.fit(X_train_reg, y_train_reg)
# Print the cross-validation scores
print('Cross-validation scores:', cv_scores)
#Predict X
predictions_reg = xgb_reg.predict(X_test_reg)
predictions_reg = np.where(predictions_reg < 0, 0, predictions_reg)
#Mean Absolute Error
print('Mean Absolute Error of XGB Regression Model: {}'.format(mean_absolute_error(y_test_reg, predictions_reg)))
Cross-validation scores: [-141070.05543361 -141681.34271143 -143925.69511609 -139769.06362863 -144515.03960475] Mean Absolute Error of XGB Regression Model: 141027.53955605143
import optuna
from optuna.samplers import TPESampler
def objective(trial):
params = {
"learning_rate": trial.suggest_float("learning_rate", 0.01, 1.0),
"n_estimators": trial.suggest_int("n_estimators", 100, 1000),
"max_depth": trial.suggest_int("max_depth", 1, 15),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
"gamma": trial.suggest_float("gamma", 0.0, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.1, 1.0),
"subsample": trial.suggest_float("subsample", 0.1, 1.0),
"reg_alpha": trial.suggest_float("reg_alpha", 1e-8, 10.0, log=True),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-8, 10.0, log=True),
"scale_pos_weight": trial.suggest_float("scale_pos_weight", 1e-6, 500.0, log=True)
}
model = XGBRegressor(**params, n_jobs=12)
model.fit(X_train_reg, y_train_reg)
# Use the validation set for evaluation
preds = model.predict(X_test_reg)
mae = mean_absolute_error(y_test_reg, preds)
return -mae # return negative MAE for maximization
optuna.logging.set_verbosity(optuna.logging.WARNING)
sampler = TPESampler(seed=1)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=100)
best_params = study.best_params
best_score = study.best_value
print(f"Best parameters: {best_params}")
print(f"Best negative MAE score: {best_score}")
Best parameters: {'learning_rate': 0.8044705559265699, 'n_estimators': 343, 'max_depth': 1, 'min_child_weight': 8, 'gamma': 0.5212677432318678, 'colsample_bytree': 0.46482910817480444, 'subsample': 0.10165275993209091, 'reg_alpha': 1.412096705991424e-06, 'reg_lambda': 0.8675148598594582, 'scale_pos_weight': 3.3352054545636855e-05}
Best negative MAE score: -140164.43003257317
# FINAL MODEL AND CLUSTER 0 CLV
# Fit model
XGB_reg_tuned = XGBRegressor(reg_lambda=16, min_child_weight=12, max_depth=10, learning_rate=0.25, random_state=3)
XGB_reg_tuned.fit(X, y_spend)
# Predict
clv_predict = XGB_reg_tuned.predict(RFM[RFM['Cluster'] == 0][['Recency', 'Frequency', 'Monetary']])
clv_predict_tune = np.where(clv_predict < 0, 0, clv_predict) # Setting CLV = 0 for negative values
result_df = pd.concat([pd.DataFrame(clv_predict_tune).set_axis(['clv'], axis=1),
RFM[RFM['Cluster'] == 0][['Recency', 'Frequency', 'Monetary']].reset_index()],
axis=1)
result_df.index = result_df['CustomerID']
result_df['Cluster'] = RFM.Cluster
cluster_0_clv = result_df.groupby('Cluster')['clv'].mean()
cluster_0_clv.head()
Cluster 0 50440.219 Name: clv, dtype: float32
# Concat 2 df
cluster_0_2 = cluster_0_2.merge(result_df, left_index=True, right_index=True, how='left', suffixes=('', '_x'))
cluster_0_2 = cluster_0_2[['Frequency', 'Recency', 'Monetary', 'Churn', 'clv', 'Cluster']].fillna(0)
cluster_1_3.rename(columns={'frequency': 'Frequency', 'recency': 'Recency', 'monetary_value': 'Monetary'}, inplace=True)
cluster_1_3 = cluster_1_3.drop(['T'], axis=1)
final_clv = pd.concat([cluster_0_2, cluster_1_3], axis=0)
final_clv.sort_index(ascending=True, inplace=True)
print("Total CLV of each Cluster: ", final_clv.groupby('Cluster')['clv'].sum()) # Total CLV
print("Average CLV of each Cluster: ", final_clv.groupby('Cluster')['clv'].mean()) # Average CLV
print("Churn Probability of each Cluster: ", final_clv.groupby('Cluster')['Churn'].mean()) # Average Churn Probability
Total CLV of each Cluster: Cluster 0 10771205926.307 1 49285196011.201 2 0.000 3 3115913941.070 Name: clv, dtype: float64 Average CLV of each Cluster: Cluster 0 50440.218 1 355426.359 2 0.000 3 38039.774 Name: clv, dtype: float64 Churn Probability of each Cluster: Cluster 0 0.517 1 0.364 2 1.000 3 0.862 Name: Churn, dtype: float64
4. Additional Insights¶
In this section, we will look at the company's Retention Rate and its number of new customers by month over time.
Retention Rate¶
# Get the first month of each Customer
data = data.reset_index(drop=True)
def get_month(x):
return dt.datetime(x.year, x.month, 1)
data['transaction_month'] = data['TransactionDate'].apply(get_month)
first_month = data.groupby("CustomerID").agg({'transaction_month': 'min'}).reset_index().rename(columns={'transaction_month': 'first_month'})
cohort = data.merge(first_month, how='left', on='CustomerID')
# Calculate the months since Customer's first transaction month
def month_diff(a, b):
return 12*(a.dt.year - b.dt.year) + (a.dt.month - b.dt.month)
cohort['diff'] = month_diff(cohort.transaction_month, cohort.first_month)
cohort_pivot = cohort.groupby(['first_month', 'diff'])['CustomerID'].nunique().reset_index()\
.pivot_table(values="CustomerID", index="first_month", columns="diff")
# Draw the heatmap of MoM retention rate
mom_retention_rate = cohort_pivot.divide(cohort_pivot.iloc[:, 0], axis=0).round(3) * 100
plt.figure(figsize=(18,14))
plt.title("MoM Retention Rate for Customer Transaction Data")
ax = sns.heatmap(data=mom_retention_rate, annot=True, vmin=0, vmax=10, cmap='crest', fmt=".1f")
ax = ax.set_yticklabels(mom_retention_rate.index.strftime('%Y-%m-%d'))
plt.xlabel('Months since first Order')
plt.ylabel('Month')
plt.show()
- We can see that the company's retention rate generally tends to decrease for customers acquired in the past. The company's 2021-10 rate of new customers who purchased after the first month was 14.5%, but this number dropped to 6.8% after 20 months.
- Customer retention rates after the first month also decreased. This rate in 2021-10 is 14.5%, but has decreased to 7% in 2023-05.
New Customer by Month¶
#Draw the number of new customer by month overtime
plt.figure(figsize=(12, 6))
sns.lineplot(cohort.groupby('first_month')['CustomerID'].nunique())
plt.xlabel('Month')
plt.ylabel('Number of Customer')
plt.show()
We can see Pizza Hut's number of new customers decrease sharply from 2021-10 to 2023-06, from nearly 70,000 new customers in October 2021 to more than 20,000 new customers in June 2023.
RECOMMENDATIONS¶
Cluster 0¶
BEHAVIOR:
- Nearly 60% of the customer uses “Take away” channel.
- 57.6% of them order from Store & 22.71% from Website.
- Most of the people who order in Hanoi, Northern & Southern Provinces use “Take away” while people who order in Ho Chi Minh City choose “Delivery”.
RECOMMENDATION:
- Increase retention through loyalty programs, discounts, and promotions to encourage customer purchases.
- Focusing on optimizing the customer experience at the store & Optimizing the online shopping process and website interface.
- Ask them for feedback or reviews on their first purchase and use their insights to improve your products and services.
- Use geo-targeting or location-based marketing to send them timely and relevant messages when they are near the store or in the delivery area.
Cluster 1¶
BEHAVIOR:
- 53.58% of the customer uses “Delivery” channel.
- 40.9% orders from Store and 21.27% from App. Through time customers have a tendency to order more from App, while orders via Call Center witnessed a decrease.
- Even though the number of customers is only in Third place, the number of people who use vouchers is the highest.
- Most of the people who order in Northern & Southern Provinces use “Take away” while people who order in Hanoi & Ho Chi MInh City choose “Delivery”.
RECOMMENDATION:
- Optimizing delivery services: timely deliveries and methods to maintain the food temperature.
- Providing attractive and personalized voucher offers to increase customer loyalty.
- Utilizing up-selling strategies to optimize revenue as well as enhance customer experience.
- Improve the ordering experience on the mobile app and promote the advertising of special benefits when ordering on the app, building an appropriate promotion code program.
Cluster 2¶
BEHAVIOR:
- The preference for “Take Away” was 58.45%, while “Delivery” was 40.69%.
- 59.1% of customers ordered at the Store, 22.03% ordered on the Website.
- Most of the people who order in Hanoi, Northern & Southern Provinces use “Take Away” while people who order in Ho Chi Minh City choose “Delivery”.
- Customers last purchased in August 2022.
RECOMMENDATION:
- Ignore this group of customers.
Cluster 3¶
BEHAVIOR:
- Most of the people who order in Northern & Southern Provinces use “Take Away” while people who order in Hanoi & Ho Chi Minh City choose “Delivery”.
- These customers used to be frequent buyers, but they have not purchased in a long time.
- These customers prefer “Delivery” (49.97%), but they often order in-store (48.96%) or Website (22.68%).
RECOMMENDATION:
- Craft personalized win-back campaigns for inactive but previously frequent buyers, offering compelling incentives or discounts to encourage their return.
- Consider reaching out via phone calls with exclusive reactivation offers to reconnect with these customers personally.
- Showcase any service improvements or enhancements made since their last purchase to entice them back.
- Promote testimonials from returning customers to develop trust in the brand's quality and service.
General Recommendations¶
MARKETING & PROMOTIONS:
- Provide cultural immersive experience by generating pizza cars on weekends or holidays in some famous places in Hanoi or Ho Chi Minh City like Hoan Kiem Lake or Nguyen Hue Walking Street.
- Using KOL and KOC to expand its brand recognition.
- The company can open another signature restaurant in Hanoi to boost the Dine-in experience of the customer.
- Use trending TV shows as a channel of Marketing
- Promote marketing campaigns via interesting activities, Challenges on Social Media
--> By taking the above actions, Pizza Hut can attract many new customers when the number of new customers is currently following the trend of decreasing month after month.
ENHANCING CUSTOMER LOYALTY:
- Strengthening the loyalty program with tiered rewards and exclusive benefits like vouchers, discounts, coupons.
- Optimize the Pizza Hut mobile app for a seamless and engaging user experience, offering exclusive promotions and features.
--> Since the retention rate of Pizza Hut tends to decrease over time, this is an important action in maintaining the consistency of the ratio of loyal customers of the company.
INNOVATION IN SERVICES & PRODUCTS:
- Continue introducing innovative menu items to cater to diverse tastes and consider limited-time releases to create excitement and encourage exploration of the menu (Ex: Introducing new flavors for current L1MO - 1 meter pizza).
- For customizable products (My Box), create an interactive, digital experience where customers can customize their pizza with their favorite toppings, crust, and size.
FEEDBACK & CONTINUOUS DEVELOPMENT:
- Establish easy-to-access feedback mechanisms for continuous improvement, give customers rewards after sending feedback.
- Analyze customer feedback to identify areas for enhancement in service, menu, or overall experience.
- Improve information systems to reduce unknowns in Gender category, helping companies gain specific insight into appropriate marketing plans and selling strategy for special holidays such as International Women’s Day, Vietnamese Women’s Day or Men’s Day.