Analyze customer purchase patterns using statistical analysis and hypothesis testing
💡 Tip: Use this notebook to practice alongside the solved project. Type your own code and take notes!
Note: To execute Python code, you'll need to set up a backend service. Currently, this is a placeholder.
Options: Pyodide (browser-based), Thebe (JupyterHub), or a custom FastAPI backend with a Python execution service.
We will analyze e-commerce purchase data to understand customer behavior and validate hypotheses using statistical testing.
Dataset Source: Purchase Data on Google Drive
import pandas as pd import numpy as np from scipy import stats from sklearn.preprocessing import LabelEncoder import matplotlib.pyplot as plt import seaborn as sns
Why this code?
We import essential libraries: - **pandas**: For data manipulation and analysis - **numpy**: For numerical operations - **scipy**: For statistical testing - **sklearn**: For data preprocessing (LabelEncoder) - **matplotlib & seaborn**: For data visualizationdf = pd.read_csv('/path/to/purchase_data.csv')
df.head()Why this code?
Load the dataset and display the first few rows to understand its structure.Before building any model or drawing conclusions, we need to:
df.info()
Why this code?
Check data types and missing values. This helps us understand what preprocessing is needed.df.isnull().sum()
Why this code?
Identify how many missing values exist in each column. This is crucial for data quality assessment.# For categorical columns with missing values, we can fill with 0 df['Product_Category_2'] = df['Product_Category_2'].fillna(0) df['Product_Category_3'] = df['Product_Category_3'].fillna(0) # Drop rows with remaining null values df.dropna(inplace=True)
Why this code?
Handle missing data: - **For Product_Category_2 & 3**: Fill with 0 (indicating no secondary/tertiary category) - **Why?** Many products don't have secondary/tertiary categories - **Alternative rejected**: Deleting all rows with nulls would lose too much data - **Alternative considered**: Forward/backward fill - not suitable for categorical datadf.duplicated().sum()
Why this code?
Check for duplicate rows. If same customer bought identical products, that's valid data, not a duplicate.Why do we need to encode?
from sklearn.preprocessing import LabelEncoder
encoding_dict = {}
for column in df.columns:
if df[column].dtype == "object":
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
encoding_dict[column] = dict(zip(le.classes_, le.transform(le.classes_)))
for key, value in encoding_dict.items():
print(f"Mappings for {key}: {value}")Why this code?
Apply Label Encoding to all categorical columns: - Store the mapping so we can interpret results later - Why Label Encoding? Simple, efficient, and works well for statistical tests - Why not One-Hot? Would create too many columns for this analysisIt was observed historically that males aged 18-25 spent an average of 10,000. Is this still true?
from scipy.stats import ttest_1samp
# Filter: Males (Gender=1), Age 18-25 (Age=1)
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]
# Take a sample for statistical validity
sample = male_18_25.sample(3600, random_state=0)
# Perform one sample t-test
population_mean = 10000
t_statistic, p_value = ttest_1samp(sample["Purchase"], population_mean)
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Sample Mean: {sample["Purchase"].mean()}")
if p_value > 0.05:
print("\n✓ Fail to reject Null Hypothesis: Mean is still approximately 10,000")
else:
print("\n✗ Reject Null Hypothesis: Mean has significantly changed from 10,000")Why this code?
**One Sample T-Test**: Compare sample mean to known population mean **Why t-test?** - We're comparing one sample to a known value - Sample size < 30: t-test is more appropriate than z-test - Assumes normal distribution (valid for large samples) **Random State**: Ensures reproducibility **Interpretation**: - If p > 0.05: Historical average still holds - If p < 0.05: Customer behavior has changedDo males and females in the 18-25 age group have the same average purchase amount?
from scipy.stats import ttest_ind
# Filter data
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]
female_18_25 = df[(df["Gender"] == 0) & (df["Age"] == 1)]
# Create samples
male_sample = male_18_25.sample(3600, random_state=0)
female_sample = female_18_25.sample(1100, random_state=0)
# Check variances
male_variance = male_sample["Purchase"].var()
female_variance = female_sample["Purchase"].var()
print(f"Male Variance: {male_variance}")
print(f"Female Variance: {female_variance}")
# Perform two sample t-test (assuming equal variance)
t_statistic, p_value = ttest_ind(male_sample["Purchase"], female_sample["Purchase"], equal_var=True)
print(f"\nT-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Male Mean: {male_sample["Purchase"].mean()}")
print(f"Female Mean: {female_sample["Purchase"].mean()}")
if p_value > 0.05:
print("\n✓ Fail to reject Null: Males and females spend similarly")
else:
print("\n✗ Reject Null: Spending differs between genders")Why this code?
**Independent Two Sample T-Test**: Compare means of two independent groups **Why this test?** - Two separate groups (Male vs Female) - Testing if their means are equal - Assumes independence and normal distribution **Equal Variance Assumption**: - We set `equal_var=True` based on variance comparison - If variances are very different, use `equal_var=False` (Welch's t-test) **Why not reject based on variance difference?** - Small differences in variance are acceptable - t-test is robust to moderate variance differences with large samples✅ What we learned:
⚡ Key Statistical Concepts:
💡 Real-World Application:
We will analyze e-commerce purchase data to understand customer behavior and validate hypotheses using statistical testing.
Dataset Source: Purchase Data on Google Drive
import pandas as pd import numpy as np from scipy import stats from sklearn.preprocessing import LabelEncoder import matplotlib.pyplot as plt import seaborn as sns
Why this code?
We import essential libraries: - **pandas**: For data manipulation and analysis - **numpy**: For numerical operations - **scipy**: For statistical testing - **sklearn**: For data preprocessing (LabelEncoder) - **matplotlib & seaborn**: For data visualizationdf = pd.read_csv('/path/to/purchase_data.csv')
df.head()Why this code?
Load the dataset and display the first few rows to understand its structure.Before building any model or drawing conclusions, we need to:
df.info()
Why this code?
Check data types and missing values. This helps us understand what preprocessing is needed.df.isnull().sum()
Why this code?
Identify how many missing values exist in each column. This is crucial for data quality assessment.# For categorical columns with missing values, we can fill with 0 df['Product_Category_2'] = df['Product_Category_2'].fillna(0) df['Product_Category_3'] = df['Product_Category_3'].fillna(0) # Drop rows with remaining null values df.dropna(inplace=True)
Why this code?
Handle missing data: - **For Product_Category_2 & 3**: Fill with 0 (indicating no secondary/tertiary category) - **Why?** Many products don't have secondary/tertiary categories - **Alternative rejected**: Deleting all rows with nulls would lose too much data - **Alternative considered**: Forward/backward fill - not suitable for categorical datadf.duplicated().sum()
Why this code?
Check for duplicate rows. If same customer bought identical products, that's valid data, not a duplicate.Why do we need to encode?
from sklearn.preprocessing import LabelEncoder
encoding_dict = {}
for column in df.columns:
if df[column].dtype == "object":
le = LabelEncoder()
df[column] = le.fit_transform(df[column])
encoding_dict[column] = dict(zip(le.classes_, le.transform(le.classes_)))
for key, value in encoding_dict.items():
print(f"Mappings for {key}: {value}")Why this code?
Apply Label Encoding to all categorical columns: - Store the mapping so we can interpret results later - Why Label Encoding? Simple, efficient, and works well for statistical tests - Why not One-Hot? Would create too many columns for this analysisIt was observed historically that males aged 18-25 spent an average of 10,000. Is this still true?
from scipy.stats import ttest_1samp
# Filter: Males (Gender=1), Age 18-25 (Age=1)
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]
# Take a sample for statistical validity
sample = male_18_25.sample(3600, random_state=0)
# Perform one sample t-test
population_mean = 10000
t_statistic, p_value = ttest_1samp(sample["Purchase"], population_mean)
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Sample Mean: {sample["Purchase"].mean()}")
if p_value > 0.05:
print("\n✓ Fail to reject Null Hypothesis: Mean is still approximately 10,000")
else:
print("\n✗ Reject Null Hypothesis: Mean has significantly changed from 10,000")Why this code?
**One Sample T-Test**: Compare sample mean to known population mean **Why t-test?** - We're comparing one sample to a known value - Sample size < 30: t-test is more appropriate than z-test - Assumes normal distribution (valid for large samples) **Random State**: Ensures reproducibility **Interpretation**: - If p > 0.05: Historical average still holds - If p < 0.05: Customer behavior has changedDo males and females in the 18-25 age group have the same average purchase amount?
from scipy.stats import ttest_ind
# Filter data
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]
female_18_25 = df[(df["Gender"] == 0) & (df["Age"] == 1)]
# Create samples
male_sample = male_18_25.sample(3600, random_state=0)
female_sample = female_18_25.sample(1100, random_state=0)
# Check variances
male_variance = male_sample["Purchase"].var()
female_variance = female_sample["Purchase"].var()
print(f"Male Variance: {male_variance}")
print(f"Female Variance: {female_variance}")
# Perform two sample t-test (assuming equal variance)
t_statistic, p_value = ttest_ind(male_sample["Purchase"], female_sample["Purchase"], equal_var=True)
print(f"\nT-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Male Mean: {male_sample["Purchase"].mean()}")
print(f"Female Mean: {female_sample["Purchase"].mean()}")
if p_value > 0.05:
print("\n✓ Fail to reject Null: Males and females spend similarly")
else:
print("\n✗ Reject Null: Spending differs between genders")Why this code?
**Independent Two Sample T-Test**: Compare means of two independent groups **Why this test?** - Two separate groups (Male vs Female) - Testing if their means are equal - Assumes independence and normal distribution **Equal Variance Assumption**: - We set `equal_var=True` based on variance comparison - If variances are very different, use `equal_var=False` (Welch's t-test) **Why not reject based on variance difference?** - Small differences in variance are acceptable - t-test is robust to moderate variance differences with large samples✅ What we learned:
⚡ Key Statistical Concepts:
💡 Real-World Application:
💡 Tip: Use this notebook to practice alongside the solved project. Type your own code and take notes!
Note: To execute Python code, you'll need to set up a backend service. Currently, this is a placeholder.
Options: Pyodide (browser-based), Thebe (JupyterHub), or a custom FastAPI backend with a Python execution service.