Market Basket Analysis — Simple Guide using Python

Royce Dcunha
6 min readJan 27, 2021

--

This article is about Market Basket Analysis, the Apriori algorithm & the Association Rule-Mining behind it.

Find the source of this image here

Have you ever entered a store to buy a dozen eggs, and left the store with bread, milk, yogurt… and more?! So have I. Most of us give in to the urge of impulsive shopping, and this is precisely how supermarkets (any physical shopping stores) make more and more profits. Apriori Algorithm is behind the Sneaky Psychology of Supermarkets.

I recently completed my Data Analytics Internship at Suven Consultants and Technology Pvt. Ltd. and performed Market Basket Analysis on Groceries dataset. Shout-out to George Wong for his extremely simple and easy to understand blog on MBA which can be found here.

Why Association Rule Mining?

Association Rule Mining is used when we want to find an association between various objects in a set, find frequent patterns in a transaction database, relational databases, or any other information repository. It gives us what items do customers frequently buy together by generating a set of rules called Association Rules.

Why Market Basket?

  • To change the store layout based on Association Rules
  • To change the design of Catalog
  • How to cross-market on online stores
  • Which are the trending items customers buy
  • Customized emails with add-on sales

Also, to get a clear understanding of the applications of MBA, make sure you checkout Example of Walmart’s Beer-Diaper Parable.

Importing the required packages:

Let’s take a look at our dataset:

Which are the Top 20 “Hot” items:

Visualization of the top 20 “Hot” items:

Contribution of Top 20 “Hot” items to Total Sales:

This shows us that the top five items are responsible for 21.4% of the entire sales and only the top 20 items are responsible for over 50% of the sales! This is important for us, as we don’t want to find association rules for items which are bought very infrequently. With this information we can limit the items we want to explore for creating our association rules. This also helps us in keeping our possible item set number to a manageable figure.

Pruning the dataset for Frequently Bought Items:

here length_trans=2 indicates that we are interested in transactions with at least two items and their cumulative sales should account for 40% of the total sales.

Association Rule Mining with FP Growth:

Based on Minimum Support:

The support of an itemset X, supp(X) is the proportion of transaction in the database in which the item X appears. It signifies the popularity of an itemset.

supp(X)=Number of transaction in which X appears / Total number of transactions.

If the sales of a particular product (item) above a certain proportion have a meaningful effect on profits, that proportion can be considered as the support threshold. Furthermore, we can identify itemsets that have support values beyond this threshold as significant itemsets.

Based on Confidence:

Confidence of a rule is defined as follows:

conf(X⟶Y)=supp(X∪ Y) / supp(X)

It signifies the likelihood of item Y being purchased when item X is purchased. So, for the rule {Onion, Potato} => {Burger}

It can also be interpreted as the conditional probability P(Y|X), i.e, the probability of finding the itemset Y in transactions given the transaction already contains X.

It can give some important insights, but it also has a major drawback. It only takes into account the popularity of the itemset X and not the popularity of Y. If Y is equally popular as X then there will be a higher probability that a transaction containing X will also contain Y thus increasing the confidence. To overcome this drawback there is another measure called lift.

Lift:

The lift of a rule is defined as:

lift(X⟶Y)=supp(X∪Y) / ( supp(X)∗ supp(Y) )

This signifies the likelihood of the itemset Y being purchased when item X is purchased while taking into account the popularity of Y.

If the value of lift is greater than 1, it means that the itemset Y is likely to be bought with itemset X, while a value less than 1 implies that itemset Y is unlikely to be bought if the itemset X is bought.

Sorting the Association Rules:

Here, we have collected rules having maximum lift for each of the items that can be a consequent (that appears on the right side).

Support of the rule is 228, which means all the items together appear in 228 transactions in the dataset.

Confidence of the rule is 46%, which means that 46% of the time the antecedent items occurred, we also had the consequent in the transaction (i.e., 46% of times, customers who bought the left side items also bought root vegetables).

Another essential metric is Lift. Lift means that the probability of finding root vegetables in the transactions which have yogurt, whole milk, and tropical fruit is higher than the reasonable likelihood of finding root vegetables in the previous transactions (2.23). Typically, a lift value of 1 indicates that the probability of occurrence of the antecedent and consequent together are independent of each other. Hence, the idea is to look for rules having a lift much higher than 1.

This is a significant piece of information, as this can prompt a retailer to bundle specific products like these together or run a marketing scheme that offers discount on buying root vegetables along with these other three products.

--

--

Royce Dcunha
Royce Dcunha

No responses yet