One ParasolOne Parasol
← Back to Projects

E-commerce Purchase Data Analysis

Data AnalysisIntermediate

📊 Dataset: Purchase Dataset

Analyze customer purchase patterns using statistical analysis and hypothesis testing

⬇ Download Dataset

📚Solved Project with Notes

Good Morning Everyone! Purchase Data Analysis

We will analyze e-commerce purchase data to understand customer behavior and validate hypotheses using statistical testing.

Dataset Information

Dataset Source: Purchase Data on Google Drive

3Python Code
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
💡

Why this code?

We import essential libraries: - **pandas**: For data manipulation and analysis - **numpy**: For numerical operations - **scipy**: For statistical testing - **sklearn**: For data preprocessing (LabelEncoder) - **matplotlib & seaborn**: For data visualization
4Python Code
df = pd.read_csv('/path/to/purchase_data.csv')
df.head()
💡

Why this code?

Load the dataset and display the first few rows to understand its structure.

Data Dictionary

  • User_ID: Unique identifier for each customer
  • Product_ID: Unique identifier for each product
  • Gender: Gender of the customer (M/F)
  • Age: Age group of the customer
  • Occupation: Occupation code of the customer
  • City_Category: Region category (A, B, C)
  • Stay_In_Current_City_Years: Years living in current city
  • Marital_Status: 0 = Single, 1 = Married
  • Product_Category_1/2/3: Product categories
  • Purchase: Purchase amount in currency units

Exploratory Data Analysis (EDA)

Why EDA?

Before building any model or drawing conclusions, we need to:

  1. Understand the data structure
  2. Check for missing values (null values)
  3. Identify duplicates
  4. Detect outliers
  5. Understand distributions and relationships
7Python Code
df.info()
💡

Why this code?

Check data types and missing values. This helps us understand what preprocessing is needed.
8Python Code
df.isnull().sum()
💡

Why this code?

Identify how many missing values exist in each column. This is crucial for data quality assessment.
9Python Code
# For categorical columns with missing values, we can fill with 0
df['Product_Category_2'] = df['Product_Category_2'].fillna(0)
df['Product_Category_3'] = df['Product_Category_3'].fillna(0)

# Drop rows with remaining null values
df.dropna(inplace=True)
💡

Why this code?

Handle missing data: - **For Product_Category_2 & 3**: Fill with 0 (indicating no secondary/tertiary category) - **Why?** Many products don't have secondary/tertiary categories - **Alternative rejected**: Deleting all rows with nulls would lose too much data - **Alternative considered**: Forward/backward fill - not suitable for categorical data
10Python Code
df.duplicated().sum()
💡

Why this code?

Check for duplicate rows. If same customer bought identical products, that's valid data, not a duplicate.

Data Encoding

Why do we need to encode?

  • Machine learning models require numerical input
  • Categorical columns (Gender, City_Category) must be converted to numbers

Encoding Methods:

  1. Label Encoding: Converts to 0, 1, 2... (alphabetical order)
  2. Ordinal Encoding: Specific order (e.g., Low, Medium, High)
  3. One-Hot Encoding: Creates binary columns for each category
12Python Code
from sklearn.preprocessing import LabelEncoder

encoding_dict = {}
for column in df.columns:
    if df[column].dtype == "object":
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        encoding_dict[column] = dict(zip(le.classes_, le.transform(le.classes_)))

for key, value in encoding_dict.items():
    print(f"Mappings for {key}: {value}")
💡

Why this code?

Apply Label Encoding to all categorical columns: - Store the mapping so we can interpret results later - Why Label Encoding? Simple, efficient, and works well for statistical tests - Why not One-Hot? Would create too many columns for this analysis

Statistical Hypothesis Testing

Steps in Hypothesis Testing:

  1. Data Collection: We have our dataset
  2. Sample Preparation: Extract relevant data subset
  3. Hypothesis Formation:
    • Null Hypothesis (H₀): No difference/relationship exists
    • Alternate Hypothesis (H₁): Difference/relationship exists
  4. Apply Statistical Test: Choose appropriate test
  5. Interpret Results:
    • If p-value > 0.05: Fail to reject Null Hypothesis
    • If p-value < 0.05: Reject Null Hypothesis (statistically significant)

Hypothesis 1: Is male 18-25 average purchase still 10,000?

It was observed historically that males aged 18-25 spent an average of 10,000. Is this still true?

  • H₀: Mean purchase = 10,000
  • H₁: Mean purchase ≠ 10,000
  • Test: One Sample T-Test
15Python Code
from scipy.stats import ttest_1samp

# Filter: Males (Gender=1), Age 18-25 (Age=1)
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]

# Take a sample for statistical validity
sample = male_18_25.sample(3600, random_state=0)

# Perform one sample t-test
population_mean = 10000
t_statistic, p_value = ttest_1samp(sample["Purchase"], population_mean)

print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Sample Mean: {sample["Purchase"].mean()}")

if p_value > 0.05:
    print("\n✓ Fail to reject Null Hypothesis: Mean is still approximately 10,000")
else:
    print("\n✗ Reject Null Hypothesis: Mean has significantly changed from 10,000")
💡

Why this code?

**One Sample T-Test**: Compare sample mean to known population mean **Why t-test?** - We're comparing one sample to a known value - Sample size < 30: t-test is more appropriate than z-test - Assumes normal distribution (valid for large samples) **Random State**: Ensures reproducibility **Interpretation**: - If p > 0.05: Historical average still holds - If p < 0.05: Customer behavior has changed

Hypothesis 2: Do men and women (18-25) spend equally?

Do males and females in the 18-25 age group have the same average purchase amount?

  • H₀: Mean purchase (Males) = Mean purchase (Females)
  • H₁: Mean purchase (Males) ≠ Mean purchase (Females)
  • Test: Independent Two Sample T-Test
17Python Code
from scipy.stats import ttest_ind

# Filter data
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]
female_18_25 = df[(df["Gender"] == 0) & (df["Age"] == 1)]

# Create samples
male_sample = male_18_25.sample(3600, random_state=0)
female_sample = female_18_25.sample(1100, random_state=0)

# Check variances
male_variance = male_sample["Purchase"].var()
female_variance = female_sample["Purchase"].var()
print(f"Male Variance: {male_variance}")
print(f"Female Variance: {female_variance}")

# Perform two sample t-test (assuming equal variance)
t_statistic, p_value = ttest_ind(male_sample["Purchase"], female_sample["Purchase"], equal_var=True)

print(f"\nT-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Male Mean: {male_sample["Purchase"].mean()}")
print(f"Female Mean: {female_sample["Purchase"].mean()}")

if p_value > 0.05:
    print("\n✓ Fail to reject Null: Males and females spend similarly")
else:
    print("\n✗ Reject Null: Spending differs between genders")
💡

Why this code?

**Independent Two Sample T-Test**: Compare means of two independent groups **Why this test?** - Two separate groups (Male vs Female) - Testing if their means are equal - Assumes independence and normal distribution **Equal Variance Assumption**: - We set `equal_var=True` based on variance comparison - If variances are very different, use `equal_var=False` (Welch's t-test) **Why not reject based on variance difference?** - Small differences in variance are acceptable - t-test is robust to moderate variance differences with large samples

Key Takeaways

What we learned:

  1. Data Preprocessing: Handle nulls, duplicates, and encoding
  2. EDA: Understand data before analysis
  3. Hypothesis Testing: Validate business assumptions
  4. Statistical Interpretation: Use p-values correctly

Key Statistical Concepts:

  • Null vs Alternate hypotheses
  • P-values and significance levels
  • One-sample vs Two-sample tests
  • The importance of random sampling

💡 Real-World Application:

  • Validate business assumptions before decisions
  • Use statistical tests to avoid biased conclusions
  • Always consider sample size and limitations

💻Practice Here

💡 Tip: Use this notebook to practice alongside the solved project. Type your own code and take notes!

1📝 Markdown

Note: To execute Python code, you'll need to set up a backend service. Currently, this is a placeholder.

Options: Pyodide (browser-based), Thebe (JupyterHub), or a custom FastAPI backend with a Python execution service.

💡 Tips for Learning

  • ✓ First, carefully read through the solved project to understand the approach
  • ✓ Pay attention to the "why" behind each decision and what alternatives were considered
  • ✓ Then, switch to the practice section and try to replicate the steps yourself
  • ✓ Don't copy-paste; type the code to build muscle memory
  • ✓ Experiment with the code - change parameters and see what happens
  • ✓ Take notes of your learnings and insights
Copyright © 2026. Made with ♥ by OneParasol Illustrations from