Mergers & Acquisitions • Business Valuation • Startup Sparks • Global Growth & Exports • Business Strategies

Value Investing • Real Estate Playbook • Cross-Border Wealth & NRIs • Alternative Investments • Legal Awareness

Career Growth Grid

FP&A • Data Scientist • Career Growth • Human Skills • Learning Lab

Soul & Stories

Travel & Food • Poetry World • Mythological Stories • Soulful Living • Tales Unbound

View all Knowledge Hub content →

Offerings

← Back to Projects

E-commerce Purchase Data Analysis

Data AnalysisIntermediate

📊 Dataset: Purchase Dataset

Analyze customer purchase patterns using statistical analysis and hypothesis testing

⬇ Download Dataset

💻Practice Here

💡 Tip: Use this notebook to practice alongside the solved project. Type your own code and take notes!

1📝 Markdown

Note: To execute Python code, you'll need to set up a backend service. Currently, this is a placeholder.

Options: Pyodide (browser-based), Thebe (JupyterHub), or a custom FastAPI backend with a Python execution service.

📚Solved Project with Notes

Good Morning Everyone! Purchase Data Analysis

We will analyze e-commerce purchase data to understand customer behavior and validate hypotheses using statistical testing.

Dataset Information

Dataset Source: Purchase Data on Google Drive

3Python Code

▼

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

💡

Why this code?

We import essential libraries: - **pandas**: For data manipulation and analysis - **numpy**: For numerical operations - **scipy**: For statistical testing - **sklearn**: For data preprocessing (LabelEncoder) - **matplotlib & seaborn**: For data visualization

4Python Code

▼

df = pd.read_csv('/path/to/purchase_data.csv')
df.head()

💡

Why this code?

Load the dataset and display the first few rows to understand its structure.

Data Dictionary

User_ID: Unique identifier for each customer
Product_ID: Unique identifier for each product
Gender: Gender of the customer (M/F)
Age: Age group of the customer
Occupation: Occupation code of the customer
City_Category: Region category (A, B, C)
Stay_In_Current_City_Years: Years living in current city
Marital_Status: 0 = Single, 1 = Married
Product_Category_1/2/3: Product categories
Purchase: Purchase amount in currency units

Exploratory Data Analysis (EDA)

Why EDA?

Before building any model or drawing conclusions, we need to:

Understand the data structure
Check for missing values (null values)
Identify duplicates
Detect outliers
Understand distributions and relationships

7Python Code

▼

df.info()

💡

Why this code?

Check data types and missing values. This helps us understand what preprocessing is needed.

8Python Code

▼

df.isnull().sum()

💡

Why this code?

Identify how many missing values exist in each column. This is crucial for data quality assessment.

9Python Code

▼

# For categorical columns with missing values, we can fill with 0
df['Product_Category_2'] = df['Product_Category_2'].fillna(0)
df['Product_Category_3'] = df['Product_Category_3'].fillna(0)

# Drop rows with remaining null values
df.dropna(inplace=True)

💡

Why this code?

Handle missing data: - **For Product_Category_2 & 3**: Fill with 0 (indicating no secondary/tertiary category) - **Why?** Many products don't have secondary/tertiary categories - **Alternative rejected**: Deleting all rows with nulls would lose too much data - **Alternative considered**: Forward/backward fill - not suitable for categorical data

10Python Code

▼

df.duplicated().sum()

💡

Why this code?

Check for duplicate rows. If same customer bought identical products, that's valid data, not a duplicate.

Data Encoding

Why do we need to encode?

Machine learning models require numerical input
Categorical columns (Gender, City_Category) must be converted to numbers

Encoding Methods:

Label Encoding: Converts to 0, 1, 2... (alphabetical order)
Ordinal Encoding: Specific order (e.g., Low, Medium, High)
One-Hot Encoding: Creates binary columns for each category

12Python Code

▼

from sklearn.preprocessing import LabelEncoder

encoding_dict = {}
for column in df.columns:
    if df[column].dtype == "object":
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        encoding_dict[column] = dict(zip(le.classes_, le.transform(le.classes_)))

for key, value in encoding_dict.items():
    print(f"Mappings for {key}: {value}")

💡

Why this code?

Apply Label Encoding to all categorical columns: - Store the mapping so we can interpret results later - Why Label Encoding? Simple, efficient, and works well for statistical tests - Why not One-Hot? Would create too many columns for this analysis

Statistical Hypothesis Testing

Steps in Hypothesis Testing:

Data Collection: We have our dataset
Sample Preparation: Extract relevant data subset
Hypothesis Formation:
- Null Hypothesis (H₀): No difference/relationship exists
- Alternate Hypothesis (H₁): Difference/relationship exists
Apply Statistical Test: Choose appropriate test
Interpret Results:
- If p-value > 0.05: Fail to reject Null Hypothesis
- If p-value < 0.05: Reject Null Hypothesis (statistically significant)

Hypothesis 1: Is male 18-25 average purchase still 10,000?

It was observed historically that males aged 18-25 spent an average of 10,000. Is this still true?

H₀: Mean purchase = 10,000
H₁: Mean purchase ≠ 10,000
Test: One Sample T-Test

15Python Code

▼

from scipy.stats import ttest_1samp

# Filter: Males (Gender=1), Age 18-25 (Age=1)
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]

# Take a sample for statistical validity
sample = male_18_25.sample(3600, random_state=0)

# Perform one sample t-test
population_mean = 10000
t_statistic, p_value = ttest_1samp(sample["Purchase"], population_mean)

print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Sample Mean: {sample["Purchase"].mean()}")

if p_value > 0.05:
    print("\n✓ Fail to reject Null Hypothesis: Mean is still approximately 10,000")
else:
    print("\n✗ Reject Null Hypothesis: Mean has significantly changed from 10,000")

💡

Why this code?

**One Sample T-Test**: Compare sample mean to known population mean **Why t-test?** - We're comparing one sample to a known value - Sample size < 30: t-test is more appropriate than z-test - Assumes normal distribution (valid for large samples) **Random State**: Ensures reproducibility **Interpretation**: - If p > 0.05: Historical average still holds - If p < 0.05: Customer behavior has changed

Hypothesis 2: Do men and women (18-25) spend equally?

Do males and females in the 18-25 age group have the same average purchase amount?

H₀: Mean purchase (Males) = Mean purchase (Females)
H₁: Mean purchase (Males) ≠ Mean purchase (Females)
Test: Independent Two Sample T-Test

17Python Code

▼

from scipy.stats import ttest_ind

# Filter data
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]
female_18_25 = df[(df["Gender"] == 0) & (df["Age"] == 1)]

# Create samples
male_sample = male_18_25.sample(3600, random_state=0)
female_sample = female_18_25.sample(1100, random_state=0)

# Check variances
male_variance = male_sample["Purchase"].var()
female_variance = female_sample["Purchase"].var()
print(f"Male Variance: {male_variance}")
print(f"Female Variance: {female_variance}")

# Perform two sample t-test (assuming equal variance)
t_statistic, p_value = ttest_ind(male_sample["Purchase"], female_sample["Purchase"], equal_var=True)

print(f"\nT-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Male Mean: {male_sample["Purchase"].mean()}")
print(f"Female Mean: {female_sample["Purchase"].mean()}")

if p_value > 0.05:
    print("\n✓ Fail to reject Null: Males and females spend similarly")
else:
    print("\n✗ Reject Null: Spending differs between genders")

💡

Why this code?

**Independent Two Sample T-Test**: Compare means of two independent groups **Why this test?** - Two separate groups (Male vs Female) - Testing if their means are equal - Assumes independence and normal distribution **Equal Variance Assumption**: - We set `equal_var=True` based on variance comparison - If variances are very different, use `equal_var=False` (Welch's t-test) **Why not reject based on variance difference?** - Small differences in variance are acceptable - t-test is robust to moderate variance differences with large samples

Key Takeaways

✅ What we learned:

Data Preprocessing: Handle nulls, duplicates, and encoding
EDA: Understand data before analysis
Hypothesis Testing: Validate business assumptions
Statistical Interpretation: Use p-values correctly

⚡ Key Statistical Concepts:

Null vs Alternate hypotheses
P-values and significance levels
One-sample vs Two-sample tests
The importance of random sampling

💡 Real-World Application:

Validate business assumptions before decisions
Use statistical tests to avoid biased conclusions
Always consider sample size and limitations

📚Solved Project with Notes

Good Morning Everyone! Purchase Data Analysis

We will analyze e-commerce purchase data to understand customer behavior and validate hypotheses using statistical testing.

Dataset Information

Dataset Source: Purchase Data on Google Drive

3Python Code

▼

import pandas as pd
import numpy as np
from scipy import stats
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns

💡

Why this code?

4Python Code

▼

df = pd.read_csv('/path/to/purchase_data.csv')
df.head()

💡

Why this code?

Load the dataset and display the first few rows to understand its structure.

Data Dictionary

User_ID: Unique identifier for each customer
Product_ID: Unique identifier for each product
Gender: Gender of the customer (M/F)
Age: Age group of the customer
Occupation: Occupation code of the customer
City_Category: Region category (A, B, C)
Stay_In_Current_City_Years: Years living in current city
Marital_Status: 0 = Single, 1 = Married
Product_Category_1/2/3: Product categories
Purchase: Purchase amount in currency units

Exploratory Data Analysis (EDA)

Why EDA?

Before building any model or drawing conclusions, we need to:

Understand the data structure
Check for missing values (null values)
Identify duplicates
Detect outliers
Understand distributions and relationships

7Python Code

▼

df.info()

💡

Why this code?

Check data types and missing values. This helps us understand what preprocessing is needed.

8Python Code

▼

df.isnull().sum()

💡

Why this code?

Identify how many missing values exist in each column. This is crucial for data quality assessment.

9Python Code

▼

# For categorical columns with missing values, we can fill with 0
df['Product_Category_2'] = df['Product_Category_2'].fillna(0)
df['Product_Category_3'] = df['Product_Category_3'].fillna(0)

# Drop rows with remaining null values
df.dropna(inplace=True)

💡

Why this code?

10Python Code

▼

df.duplicated().sum()

💡

Why this code?

Check for duplicate rows. If same customer bought identical products, that's valid data, not a duplicate.

Data Encoding

Why do we need to encode?

Machine learning models require numerical input
Categorical columns (Gender, City_Category) must be converted to numbers

Encoding Methods:

Label Encoding: Converts to 0, 1, 2... (alphabetical order)
Ordinal Encoding: Specific order (e.g., Low, Medium, High)
One-Hot Encoding: Creates binary columns for each category

12Python Code

▼

from sklearn.preprocessing import LabelEncoder

encoding_dict = {}
for column in df.columns:
    if df[column].dtype == "object":
        le = LabelEncoder()
        df[column] = le.fit_transform(df[column])
        encoding_dict[column] = dict(zip(le.classes_, le.transform(le.classes_)))

for key, value in encoding_dict.items():
    print(f"Mappings for {key}: {value}")

💡

Why this code?

Statistical Hypothesis Testing

Steps in Hypothesis Testing:

Data Collection: We have our dataset
Sample Preparation: Extract relevant data subset
Hypothesis Formation:
- Null Hypothesis (H₀): No difference/relationship exists
- Alternate Hypothesis (H₁): Difference/relationship exists
Apply Statistical Test: Choose appropriate test
Interpret Results:
- If p-value > 0.05: Fail to reject Null Hypothesis
- If p-value < 0.05: Reject Null Hypothesis (statistically significant)

Hypothesis 1: Is male 18-25 average purchase still 10,000?

It was observed historically that males aged 18-25 spent an average of 10,000. Is this still true?

H₀: Mean purchase = 10,000
H₁: Mean purchase ≠ 10,000
Test: One Sample T-Test

15Python Code

▼

from scipy.stats import ttest_1samp

# Filter: Males (Gender=1), Age 18-25 (Age=1)
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]

# Take a sample for statistical validity
sample = male_18_25.sample(3600, random_state=0)

# Perform one sample t-test
population_mean = 10000
t_statistic, p_value = ttest_1samp(sample["Purchase"], population_mean)

print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Sample Mean: {sample["Purchase"].mean()}")

if p_value > 0.05:
    print("\n✓ Fail to reject Null Hypothesis: Mean is still approximately 10,000")
else:
    print("\n✗ Reject Null Hypothesis: Mean has significantly changed from 10,000")

💡

Why this code?

Hypothesis 2: Do men and women (18-25) spend equally?

Do males and females in the 18-25 age group have the same average purchase amount?

H₀: Mean purchase (Males) = Mean purchase (Females)
H₁: Mean purchase (Males) ≠ Mean purchase (Females)
Test: Independent Two Sample T-Test

17Python Code

▼

from scipy.stats import ttest_ind

# Filter data
male_18_25 = df[(df["Gender"] == 1) & (df["Age"] == 1)]
female_18_25 = df[(df["Gender"] == 0) & (df["Age"] == 1)]

# Create samples
male_sample = male_18_25.sample(3600, random_state=0)
female_sample = female_18_25.sample(1100, random_state=0)

# Check variances
male_variance = male_sample["Purchase"].var()
female_variance = female_sample["Purchase"].var()
print(f"Male Variance: {male_variance}")
print(f"Female Variance: {female_variance}")

# Perform two sample t-test (assuming equal variance)
t_statistic, p_value = ttest_ind(male_sample["Purchase"], female_sample["Purchase"], equal_var=True)

print(f"\nT-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")
print(f"Male Mean: {male_sample["Purchase"].mean()}")
print(f"Female Mean: {female_sample["Purchase"].mean()}")

if p_value > 0.05:
    print("\n✓ Fail to reject Null: Males and females spend similarly")
else:
    print("\n✗ Reject Null: Spending differs between genders")

💡

Why this code?

Key Takeaways

✅ What we learned:

Data Preprocessing: Handle nulls, duplicates, and encoding
EDA: Understand data before analysis
Hypothesis Testing: Validate business assumptions
Statistical Interpretation: Use p-values correctly

⚡ Key Statistical Concepts:

Null vs Alternate hypotheses
P-values and significance levels
One-sample vs Two-sample tests
The importance of random sampling

💡 Real-World Application:

Validate business assumptions before decisions
Use statistical tests to avoid biased conclusions
Always consider sample size and limitations

💻Practice Here

💡 Tip: Use this notebook to practice alongside the solved project. Type your own code and take notes!

1📝 Markdown

Note: To execute Python code, you'll need to set up a backend service. Currently, this is a placeholder.

Options: Pyodide (browser-based), Thebe (JupyterHub), or a custom FastAPI backend with a Python execution service.

💡 Tips for Learning

✓ First, carefully read through the solved project to understand the approach
✓ Pay attention to the "why" behind each decision and what alternatives were considered
✓ Then, switch to the practice section and try to replicate the steps yourself
✓ Don't copy-paste; type the code to build muscle memory
✓ Experiment with the code - change parameters and see what happens
✓ Take notes of your learnings and insights

One Parasol

Product Features Pricing Company Blog

Terms Privacy Legal