Friday, November 19, 2021

Market Basket Analysis using Python

Introduction 
MARKET Basket Analysis(MB) is an association analysis and is a popular data mining technique. It’s a kind of knowledge discovery in data (KDD) and this technique can be applied in various fields of work. This blog tells  how to generate out association rules using python. Here I, have used grocery dataset and python programming language to perform market basket analysis. There are various data mining techniques and Association Rule mining is one of them.

Association Rule mining
Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is Market Based Analysis.The association rule learning is one of the very important concepts of machine learning, and it is employed in Market Basket analysis, Web usage mining, continuous production, etc. Association rule learning is a type of unsupervised learning technique that checks for the dependency of one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting relations or associations among the variables of dataset. It is based on different rules to discover the interesting relations between variables in the database. For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are stored within a shelf or mostly nearby.

Association rule learning works on the concept of If and Else Statement, such as if A then B. Here the If element is called antecedent, and then statement is called as Consequent. These types of relationships where we can find out some association or relation between two items is known as single cardinality. It is all about creating rules, and if the number of items increases, then cardinality also increases accordingly. So, to measure the associations between thousands of data items, there are several metrics. These metrics are given below:
  • Support
  • Confidence
  • Lift


Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought together. 

Applications of Association Rule Learning

It has various applications in machine learning and data mining. Below are some popular applications of association rule learning:

  • Market Basket Analysis: It is one of the popular examples and applications of association rule mining. This technique is commonly used by big retailers to determine the association between items.
  • Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in identifying the probability of illness for a particular disease.
  • Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
  • It is also used for the Catalog Design and Loss-leader Analysis and many more other applications.
Implementation of Association Rule Technique using Python
Step 1: The first step in the implementation of the Association rule technique is to import the necessary libraries. If the libraries are not installed and we get ModuleNotFoundError, then using pip install command we need to first install the necessary libraries.
Following libraries are required for the implementation of Association Rule Technique:
import pandas as pd
import numpy as np
import plotly.express as px
import Orange
from Orange.data import Table,Domain, DiscreteVariable, ContinuousVariable
from orangecontrib.associate.fpgrowth import *
from apyori import apriori

Step 2: Upload the dataset in appropriate format. The following code create the data frame and places the 1 when the item is presented in the transaction and 0 otherwise.
data_list = []
data = pd.DataFrame()
f = open("groceries.csv""r")
for x in f:
    row = x.replace("\n","").split(",")
    data_list.append(row)
f.close()

for row in range(len(data_list)):
    for col in data_list[row]:
        data.loc[row,col]=1
data.fillna(0,inplace=True)

Step 3: This step performs the Exploratory Data Analysis to find the shape of the data, the item count, item's contribution in total sales etc.

print("Shape of Data: ",data.shape)
print('Number of Transactions: ',data.shape[0])
print('Number of Items: ',data.shape[1])

Shape of Data: (9835, 169) Number of Transactions: 9835 Number of Items: 169

Top 20 sold Items and their contribution to the total sales

total_count_of_items = sum(data.sum())
print("Total count of items: ", total_count_of_items)
item_sort_df = data.sum().sort_values(ascending = False).reset_index()
item_sort_df.rename(columns={item_sort_df.columns[0]:'item_name',item_sort_df.columns[1]:'item_count'}, inplace=True)
item_sort_df['item_perc'] = item_sort_df['item_count']/total_count_of_items #each item's contribution 
item_sort_df['total_perc'] = item_sort_df.item_perc.cumsum() #cumulative contribution of top items
print(item_sort_df[item_sort_df.total_perc <= 0.5].shape)

item_sort_df.head(20)


Step 4: Visualization of top 20 items
item_sort_df_top20.plot.bar(x='item_name',
               y='item_perc',
               color='green',
               figsize=(87),
               legend=True,
               fontsize=12,
               title="Top 20 Items with Highest Sales Percentage ")




Step 5: Dataset Pruning:
in this step we reduce the size of the dataset by applying some parameters.We will be includingonly those transactions to perform the market basket analysis which includes at least2 items i.e length of transaction =2 and total sales percentage of the item should be 0.4 or more.
After applying this function on original dataset we get the following result.
output_df, item_counts = prune_dataset(input_df=data, length_trans=2,total_sales_perc=0.4)
print("Shape: ",output_df.shape)
print("Selected items: "list(output_df.columns))

Shape: (4585, 13)
This means now we have 4585 transactions and 13 items in our pruned dataset.

Selected items:  ['bottled beer', 'bottled water', 'citrus fruit', 'other vegetables', 'pastry', 'rolls/buns', 'root vegetables', 'sausage', 'shopping bags', 'soda', 'tropical fruit', 'whole milk', 'yogurt']

Step 6: Performing one-hot encoding
input_assoc_rules = output_df 
domain_grocery = Domain([DiscreteVariable.make(name=item,values=['0''1']) for item in input_assoc_rules.columns])
data_gro_1 = Orange.data.Table.from_numpy(domain=domain_grocery,  X=input_assoc_rules.to_numpy(),Y= None)
data_gro_1_en, mapping = OneHot.encode(data_gro_1, include_class=False)

Step 7: Generating Association Rules using min support = 0.01 and confidence =0.3

min_support=0.01
num_trans = input_assoc_rules.shape[0]*min_support
print("Number of required transactions = "int(num_trans))
itemsets = dict(frequent_itemsets(data_gro_1_en, min_support=min_support))   #dict-- key:value pair
print(len(itemsets), " itemsets have a support of ", min_support*100"%")

Number of required transactions = 45 166886 itemsets have a support of 1.0 %

995182 Raw rules data frame of 16628 rules generated using FP-Growth Algorithm

(pruned_rules_df[['antecedent','consequent',
                  'support','confidence','lift']].groupby('consequent')
                                                 .max()
                                                 .reset_index()
                                                 .sort_values(['lift''support','confidence'],
                                                              ascending=False))



Conclusion 1:
Whenever the customer purchases yogurt, whole milk and  other vegetables, it may also purchase root vegetables for  confidence is 46% and lift is 2.23 which means all the 4 products have a high probability to occur in the same shopping basket.

Step 8: Applying the Apriori algorithm to the pruned dataset with minimum support = 1% and confidence = 20%.

freq_items = apriori(input_assoc_rules, min_support=0.01,use_colnames=True)
rules = association_rules(freq_items, metric="confidence", min_threshold=0.2)
rules = rules[['antecedents''consequents' , 'support''confidence''lift']]
rules.sort_values(["confidence"],ascending=False).reset_index(drop= True).head(20)




Conclusion 2:
Whenever the customer purchases root vegetables and yogurt and tropical fruit, it may also purchase  whole milk for which support is 1%, confidence is 70% and lift is 1.58 which means all the 4 products have a high probability to occur in the same shopping basket.

Acknowledgement

“I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com".

I would like to thank Kunal Gupta for his article that helped me a lot in completing the project.


No comments:

Post a Comment