Customer Segmentation Using Python

Image source: Pixabay

1. DATA SET DESCRIPTION

Customer segmentation is one of the most integral parts of marketing. Customer segmentation refers to the process of dividing the entire customer base into groups based on certain characteristics like age, income, geography, spending habits, and more. The variables in the dataset are:

CustomerID : unique customer ID
Gender : gender of the customer
Married : marital status of the customer
Age : age of the customer
Graduated : specifies whether the customer is a graduate
Profession : the profession of the customer
WorkExperience : work experience of the customer in years
SpendingScore : spending score of the customer
FamilySize : number of family members of the customer (including the customer)
Category : anonymized category for the customer
Segmentation : (target variable) customer segment of the customer

The data set can accessed through the following link: Dataset

The project has been done on Google Colab

2. THE CODING

STEP -1: SETTING UP THE FILE & LIBRARIES

from google.colab import drive
drive.mount('/content/drive')

The connection of Google Drive to Google Colab is established using this code.

import pandas as pd

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer

The above code imports all the required libraries for this particular project.

folder_path = "/content/drive/MyDrive/Customer segmentation"  # Folder path (Google Drive)
test_data = pd.read_csv("/content/drive/MyDrive/Customer segmentation/Test.csv")
train_data = pd.read_csv("/content/drive/MyDrive/Customer segmentation/Train.csv")

The above connect points to the location of the files in Google Drive. The "test_data" variable stores the file "Test.csv". Similarly the variable "train_data" holds the data in the file "Train.csv".

# Identify and handle missing values

numerical_features = ["Age", "Work_Experience", "Family_Size"]

missing_numerical = train_data[numerical_features].isnull().any()

if missing_numerical.any():

    imputer = SimpleImputer(strategy="mean")

    train_data[numerical_features] = imputer.fit_transform(train_data[numerical_features])

    test_data[numerical_features] = imputer.transform(test_data[numerical_features])

STEP 2: DATA PREPROCESSING

# Identify and handle categorical features

categorical_features = ["Gender", "Graduated", "Profession"]  # Adjust as needed

missing_categorical = train_data[categorical_features].isnull().any()

if missing_categorical.any():

    train_data[categorical_features].fillna(value="Missing", inplace=True)

    test_data[categorical_features].fillna(value="Missing", inplace=True)  # Fill with same value for consistency

# Encode categorical features (one-hot encoding)

train_encoded = pd.get_dummies(train_data[categorical_features], prefix="category_", drop_first=True)

test_encoded = pd.get_dummies(test_data[categorical_features], prefix="category_", drop_first=True)

if missing_categorical.any():

    train_data.loc[:, categorical_features] = train_data[categorical_features].fillna(value="Missing")

    test_data.loc[:, categorical_features] = test_data[categorical_features].fillna(value="Missing")

# Combine encoded features with numerical features

train_x = pd.concat([train_data[numerical_features], train_encoded], axis=1)

test_x = pd.concat([test_data[numerical_features], test_encoded], axis=1)

The above code does the cleaning and preprocessing of data. We have done the following data preprocessing:

1. Finding and replacing blank columns in each data point and replacing it with the value "Missing".

2. One-hot encoding of categorical data is used to convert data values that contain characters to numeric data.

Then, we combine the encoded features with the numerical features in each case and store it in the variables "train_x" and "test_x" respectively.

STEP 3: K-MEANS CLUSTERING

k = 4
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(train_x)
train_labels = kmeans.labels_
if test_x is not None:
    test_labels = kmeans.predict(test_x)
print(pd.Series(train_labels).value_counts())  # Examine cluster sizes

This code performs k-means clustering, a technique that groups similar data points together. It takes a dataset (train_x) and splits it into a predefined number of clusters (k, here 4). Using the k-means algorithm, each data point is assigned to the closest cluster based on similarity (often measured by distance). The code then provides labels for each data point in train_x, indicating its assigned cluster. If you have new data (test_x), it can predict cluster labels for those points as well. Finally, it analyzes the distribution of data points across the clusters by printing the number of points in each cluster.

OUTPUT:

STEP -4: ANALYZING CHARACTERISTICS OF EACH CLUSTER

for feature in numerical_features:
  plt.figure(figsize=(8, 6))
  for cluster in range(k):
    cluster_data = train_data[train_labels == cluster]
    plt.hist(cluster_data[feature], label=f"Cluster {cluster}", alpha=0.5)
  plt.title(f"Distribution of {feature} Across Clusters")
  plt.xlabel(feature)
  plt.ylabel("Frequency")
  plt.legend()
  plt.show()

plt.figure(figsize=(8, 6))
plt.scatter(train_data["Age"], train_data["Work_Experience"], c=train_labels)
plt.xlabel("Age")
plt.ylabel("Work Experience")
plt.title("Relationship Between Age and Work Experience (Colored by Cluster)")
plt.show()

The age distribution within the clusters reveals interesting trends. Both Cluster 0 and Cluster 1 show a peak frequency in the 30s and 40s age range. However, there seems to be a divergence in the younger and older demographics. Cluster 1 appears to have a larger population in their 20s, while Cluster 0 has a higher frequency of individuals in their 50s and 60s. It is important to note that data for only two clusters are available. Including data from the remaining clusters would provide a clearer picture of the overall age distribution across all groups.

The graph titled "Distribution of Work_Experience Across Clusters" shows the distribution of work experience in two out of four clusters, Cluster 0 and Cluster 3. The x-axis labeled "Work_Experience" shows the number of years of work experience, ranging from 0 to 14. The y-axis labeled "Frequency" indicates the number of people in each category.

The data suggests that Cluster 0 has a significantly higher number of people with little to no work experience (0-2 years) compared to Cluster 3. There are around 350 people in Cluster 0 with 0-2 years of experience, whereas Cluster 3 only has about 50 people in that category. Conversely, Cluster 3 appears to have a higher frequency of people with more work experience (4 or more years) than Cluster 0. There are around 200 people in Cluster 3 with 4-6 years of experience, whereas Cluster 0 only has about 50 people in that category.

The graph titled "Distribution of Family_Size Across Clusters" shows the distribution of family size across four clusters. The x-axis labeled "Family_Size" shows the number of family members, ranging from 1 to 9. The y-axis labeled "Frequency" indicates the number of families in each category.

The graph shows that Cluster 1 has the most families, followed by Cluster 4, Cluster 0 and Cluster 3 (in descending order). Cluster 1 has the highest frequency of families with 4 members, while Cluster 4 has the highest frequency of families with 3 members. Interestingly, Clusters 0 and 3 have a very similar distribution of family sizes, with a peak of around 2-3 family members.

The graph titled "Relationship Between Age and Work Experience (Colored by Cluster)" shows the average work experience for people in different age groups, across four clusters. The x-axis labeled "Age" shows the age of the person, ranging from 20 to 90. The y-axis labeled "Work Experience" shows the average number of years of work experience. The data for each age group is represented by a colored line, where the color corresponds to a specific cluster. However, there is no legend explaining which color corresponds to which cluster.

Search This Blog

Aishwarya Jayakrishnan