Clustering

Agglomerative Clustering using supermarket data to analyst customer attribute trends and their relation to a customers spending score

Introduce the Problem

Introducing the Data

Data Visualization

Pre-Processing Data

Modeling (Clustering)

Storytelling

Impact Session

References &
Codes

The objective of this project will be to identify what features of a customer affect their spending score when shopping at a supermarket. Some of the questions that will be asked are:

What features affect a customer's spending score the most?
Are there any stronger correlations between a certain feature and the spending score?
What would be the meaning behind the outcomes of the relationships?

These problems will be explored through a method called clustering.

Clustering is the process of taking data and separating them into unlabeled groups, or clusters. The data in each group cluster are similar to each other. In this project we will be using agglomerative clustering, which is a type of hierarchical clustering. The algorithm is hierarchical due to its bottom up approach; starting with clusters and merging them based on similarity, it then moves up the hierarchy.

The data I will be using is called the Mall Customer Segmentation Data, found from kaggle: https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python

The data contains information about customers membership card for a supermarket mall. The features within the data are Customer ID, age, gender, annual income (by the thousands) and spending score. The spending score is from 1 to 100, one being that the spender barely spends money and 100 being that the customer spends a lot of money.

To better understand the data, I wanted to take a look at the distribution of different features of the data. I created histograms of the spending score, annual income and ages of the customers.

Screen Shot 2022-12-04 at 10.54.44 PM.png

Next, I created a heat map of the different features in the dataset, this was created in order to view the features with stronger correlations to each other. Ignoring the Customer ID mapping, we can see that the highest relational value is the Spending score and annual income. This gave me an idea of finding clusters within the dataset when comparing both features together. To be sure that this relationship was the way to begin with, a scatter plot with all attributes was create

Screen Shot 2022-12-04 at 11.11.50 PM.png

Screen Shot 2022-12-04 at 11.19.53 PM.png

The steps I took into pre-processing the data was determining the optimal number of clusters. I used the Dendrogram technique, which is specific to the agglomerative method of clustering. This method works by starting with a single point as a cluster and slowly grouping points together to form clusters in a hierarchical manner based on the distance of each “cluster” to the next. We can see in dendrogram that at y = 150 five lines will cross the horizontal line, meaning that 5 clusters would be optimal for out agglomerative clustering model

Screen Shot 2022-12-04 at 11.44.57 PM.png

Screen Shot 2022-12-04 at 11.31.21 PM.png

I used the Agglomerative clustering method. I chose this method over K-mens because the algorithm is less sensitive to outliers, being that my data had larger standard deviations than anticipated. Another reason why I chose Agglomerative clustering was due to its ability to get the most similarity of the observations of the data. Since my dataset was not large, I decided this would be the best route.

Screen Shot 2022-12-05 at 12.20.04 AM.png

Through my analysis of using the Agglomerative and creating visualizations, we can conclude the following:

The supermarket has the majority of customers in the age groups of 20-30 and 30-40 which is approximately 52% of the population of customers in the dataset. The average annual salary of the customers is $60,000. We can see that within our clustering that majority of the customers with a median income also have a mean spending score of 50. The second largest cluster with the most customers was cluster 2 which consisted of customers with a high annual income and high spending score. The third most saturated cluster is cluster 0, it shows us customers with a high annual income and low spending score. We can infer that although the majority of the customers have a median income, this super market has 31% of its customers with a high annual income. Being that the only scatter plot in which we could identify a trend was with annual income, we can deduce that the annual income contributes most to the customers who shop at this supermarket.

After the analysis of this data, I find that the number of observations (200) may not be a good enough sample to reflect the demographics of a supermarket. Since the clustering yielded that the second and third largest clusters of spending scores were individuals with a high income, this could show that this supermarket is located in a high income area. This would be harmful data due to the fact that the supermarket could feel comfortable with increasing item prices in order to displace the results found. This could lead to the supermarket trying to increase profitability in a biased manner by catering to those with at least an average income or higher.