manasi

Friday, November 19, 2021

Market Basket Analysis using Python

Introduction

MARKET Basket Analysis(MB) is an association analysis and is a popular data mining technique. It’s a kind of knowledge discovery in data (KDD) and this technique can be applied in various fields of work. This blog tells how to generate out association rules using python. Here I, have used grocery dataset and python programming language to perform market basket analysis. There are various data mining techniques and Association Rule mining is one of them.

Association Rule mining

Association rule mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is Market Based Analysis.The association rule learning is one of the very important concepts of machine learning, and it is employed in Market Basket analysis, Web usage mining, continuous production, etc. Association rule learning is a type of unsupervised learning technique that checks for the dependency of one data item on another data item and maps accordingly so that it can be more profitable. It tries to find some interesting relations or associations among the variables of dataset. It is based on different rules to discover the interesting relations between variables in the database. For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk, so these products are stored within a shelf or mostly nearby.

Association rule learning works on the concept of If and Else Statement, such as if A then B. Here the If element is called antecedent, and then statement is called as Consequent. These types of relationships where we can find out some association or relation between two items is known as single cardinality. It is all about creating rules, and if the number of items increases, then cardinality also increases accordingly. So, to measure the associations between thousands of data items, there are several metrics. These metrics are given below:

Support
Confidence
Lift

Apriori Algorithm

This algorithm uses frequent datasets to generate association rules. It is designed to work on the databases that contain transactions. This algorithm uses a breadth-first search and Hash Tree to calculate the itemset efficiently.

It is mainly used for market basket analysis and helps to understand the products that can be bought together.

Applications of Association Rule Learning

It has various applications in machine learning and data mining. Below are some popular applications of association rule learning:

Market Basket Analysis: It is one of the popular examples and applications of association rule mining. This technique is commonly used by big retailers to determine the association between items.
Medical Diagnosis: With the help of association rules, patients can be cured easily, as it helps in identifying the probability of illness for a particular disease.
Protein Sequence: The association rules help in determining the synthesis of artificial Proteins.
It is also used for the Catalog Design and Loss-leader Analysis and many more other applications.

Implementation of Association Rule Technique using Python

Step 1: The first step in the implementation of the Association rule technique is to import the necessary libraries. If the libraries are not installed and we get ModuleNotFoundError, then using pip install command we need to first install the necessary libraries.

Following libraries are required for the implementation of Association Rule Technique:

import pandas as pd

import numpy as np

import plotly.express as px

import Orange

from Orange.data import Table,Domain, DiscreteVariable, ContinuousVariable

from orangecontrib.associate.fpgrowth import *

from apyori import apriori

Step 2: Upload the dataset in appropriate format. The following code create the data frame and places the 1 when the item is presented in the transaction and 0 otherwise.

data_list = []

data = pd.DataFrame()
f = open("groceries.csv", "r")
for x in f:
    row = x.replace("\n","").split(",")
    data_list.append(row)
f.close()

for row in range(len(data_list)):
    for col in data_list[row]:
        data.loc[row,col]=1
data.fillna(0,inplace=True)

Step 3: This step performs the  Exploratory Data Analysis to find the shape of the data, the item count, item's contribution in total sales etc.

print("Shape of Data: ",data.shape)

print('Number of Transactions: ',data.shape[0])

print('Number of Items: ',data.shape[1])

Shape of Data:  (9835, 169)
Number of Transactions:  9835
Number of Items:  169

Top 20 sold Items and their contribution to the total sales
total_count_of_items = sum(data.sum())

print("Total count of items: ", total_count_of_items)

item_sort_df = data.sum().sort_values(ascending = False).reset_index()

item_sort_df.rename(columns={item_sort_df.columns[0]:'item_name',item_sort_df.columns[1]:'item_count'}, inplace=True)

item_sort_df['item_perc'] = item_sort_df['item_count']/total_count_of_items #each item's contribution 

item_sort_df['total_perc'] = item_sort_df.item_perc.cumsum() #cumulative contribution of top items

print(item_sort_df[item_sort_df.total_perc <= 0.5].shape)

item_sort_df.head(20)

Step 4: Visualization of top 20 items
item_sort_df_top20.plot.bar(x='item_name',
               y='item_perc',
               color='green',
               figsize=(8, 7),
               legend=True,
               fontsize=12,
               title="Top 20 Items with Highest Sales Percentage ")

Step 5: Dataset Pruning:

in this step we reduce the size of the dataset by applying some parameters.We will be includingonly those transactions to perform the market basket analysis which includes at least2 items i.e length of transaction =2 and total sales percentage of the item should be 0.4 or more.

After applying this function on original dataset we get the following result.

output_df, item_counts = prune_dataset(input_df=data, length_trans=2,total_sales_perc=0.4)

print("Shape: ",output_df.shape)

print("Selected items: ", list(output_df.columns))

Shape:  (4585, 13)

This means now we have 4585 transactions and 13 items in our pruned dataset.

Selected items: ['bottled beer', 'bottled water', 'citrus fruit', 'other vegetables', 'pastry', 'rolls/buns', 'root vegetables', 'sausage', 'shopping bags', 'soda', 'tropical fruit', 'whole milk', 'yogurt']

Step 6: Performing one-hot encoding

input_assoc_rules = output_df 

domain_grocery = Domain([DiscreteVariable.make(name=item,values=['0', '1']) for item in input_assoc_rules.columns])

data_gro_1 = Orange.data.Table.from_numpy(domain=domain_grocery,  X=input_assoc_rules.to_numpy(),Y= None)

data_gro_1_en, mapping = OneHot.encode(data_gro_1, include_class=False)

Step 7: Generating Association Rules using min support = 0.01 and confidence =0.3

min_support=0.01

num_trans = input_assoc_rules.shape[0]*min_support

print("Number of required transactions = ", int(num_trans))

itemsets = dict(frequent_itemsets(data_gro_1_en, min_support=min_support))   #dict-- key:value pair

print(len(itemsets), " itemsets have a support of ", min_support*100, "%")

Number of required transactions =  45
166886  itemsets have a support of  1.0 %

995182
Raw rules data frame of 16628 rules generated using FP-Growth Algorithm

(pruned_rules_df[['antecedent','consequent',

                  'support','confidence','lift']].groupby('consequent')

                                                 .max()

                                                 .reset_index()

                                                 .sort_values(['lift', 'support','confidence'],

                                                              ascending=False))

Conclusion 1:

Whenever the customer purchases yogurt, whole milk and other vegetables, it may also purchase root vegetables for confidence is 46% and lift is 2.23 which means all the 4 products have a high probability to occur in the same shopping basket.

Step 8: Applying the Apriori algorithm to the pruned dataset with minimum support = 1% and confidence = 20%.

freq_items = apriori(input_assoc_rules, min_support=0.01,use_colnames=True)
rules = association_rules(freq_items, metric="confidence", min_threshold=0.2)
rules = rules[['antecedents', 'consequents' , 'support', 'confidence', 'lift']]
rules.sort_values(["confidence"],ascending=False).reset_index(drop= True).head(20)

Conclusion 2: 

Whenever the customer purchases root vegetables and yogurt and tropical fruit, it may also purchase  whole milk for which support is 1%, confidence is 70% and lift is 1.58 which means all the 4 products have a high probability to occur in the same shopping basket.

Acknowledgement

“I am thankful to mentors at https://internship.suvenconsultants.com for providing awesome problem statements and giving many of us a Coding Internship Experience. Thank you www.suvenconsultants.com".

I would like to thank Kunal Gupta for his article that helped me a lot in completing the project.

Friday, January 22, 2010

Software as a Service

Introduction:
The world has seen numerous so-called disruptive technologies come and go. Some have had a profound impact on how we run our businesses and go about our daily lives and many have not. Some were long lasting and others were gone in a flash. Software as a Service (SaaS) is proving to have great potential to impact our lives in almost every way. Take for example a small business in Kansas that now has immediate access to a global market by listing its products on eBay. Consider a mid-size company that, due to the cost of the software license and required infrastructure, could never afford a true sales force automation tool, but now thanks to salesforce.com has a best-of-breed CRM system for $59.00 per user per month, with no upfront cost Families can now share photos with friends across the country with services like flickr. The list goes on and on. The list goes on and on.

The examples above share the following key elements:
Ø The software is paid for as it is consumed;
Ø The consumer has no software, hardware, or infrastructure to purchase, install, or maintain;
Ø Apart from a personal computer and an Internet connection, all parts of the solution are provided by the software vendor.

SaaS, an acronym for Software as a Service, is a method of “Gaining Software functionality and benefits via web, at a lower cost and reduced complexity, as compared to its commercially licensed, internally operated counterpart”. Basically, SaaS refers to deliver software, providing its n remotely as a web based service, on a pay –per-use basis.
SaaS, can also be referred as an on-demand service, provided by the software vendors to the customers, either from their own servers, or by downloading and enabling it to customer’s system, within the period of contract.

SaaS requires very less download as compared to locally housed applications. For this reason, it is also referred to as “Thin Client”. All the data is present as Hosted Server, none is stored at user’s site.

In the scenario of recession and cost cutting, SaaS is specially considered helpful for the businesses having frequent changes in software service requirements, as it eliminates the risk of obsolescence associated with traditional software implementations and provides immediate access to new functionality . It also significantly reduces the initial cost, incorporated in deploying new software. Also, for home users, it provides a cheaper way to access software, by subscribing and charging them only for the required functionalities, saving the costs in licensing for all the available functionalities in traditional software, even if some of them are not of use.

History of SaaS:
ASP (Application Service Provider) came into picture in 1990’s. ASP offered all the functionalities offered by SaaS i.e. “Application hosted on server, delivered to users over internet”. The drawback with ASP was that they offered all the features of the application to all the users, in contrast to SaaS, where only selected features are provided to users, based on their needs.

The concept of "software as a service" started to circulate before 1999. One of the first SaaS applications was SiteEasy, a web-site-in-a-box for small businesses that launched in 1998 at Siteeasy.com. Developed by Atlanta-based firm WebTransit (co-founded by Gary Troutman and Drew Wilkins), SiteEasy was sold on a subscription-basis for a monthly fee to its first customer in the fall of 1998

SaaS, overcomes all the limitations by using “Multi-Tenant” architecture, which allows multiple users to utilize a single set of application and database. This architecture allows the vendors to keep infrastructure and maintenance cist low. But the question comes into mind “what about the security of the data?” as the database is shared among all the customers.

Next generation SaaS overcomes this by delivering “Single Tenant SaaS” where each user is provided with a unique instance of software application and database. In this architecture, the overall cost is kept low through data center automation and virtualization.

SaaS Architecture and Functionality

SaaS architecture can be called as extension of Distributed Application Architecture. A distributed application architecture has its database and application distributed across various locations, each location communication with each other. This architecture is a solution for scalability issues and operational overheads of the distribution, installation and maintenance of the client software on user’s desktop present in centralized database architecture. Recent SaaS architecture includes additional components, needed to facilitate the operation and management of different environments in which SaaS customer run the software.

The recent SaaS architecture is divided into various components, each having its own specific functionality.

1) Distribution Tier:
With the increasingly popularity of SaaS, more and more customers have joined hands with it, with this proliferation in numbers of customers, the need of load balancing mechanism became crucial for SaaS. Distribution tier is responsible to distribute the load evenly across servers, thereby improving the overall throughput and resource utilization. It also ensures high availability, thereby reducing overall response time.

2) Application Tier Component:
The application server in SaaS is divided into several components, each having a specific task to perform. This approach provides a systematic functionality-division, thereby reducing the overall complexity and ensuring high availability. The application server in SaaS architecture can be divided into following components:

Ø Identity Management Server: With the proliferating number of customers, the need for proper security management has become indispensable in SaaS architecture. Identity management server is responsible for handling User Identity Management in a standardized way, thereby ensuring the security of user’s data.
Ø Integration Server: With increasing popularity, more and more customers, who were previously using traditional software, are now joining hands with SaaS, which gives rise to the need to integrate a SaaS solution with existing software system. Integration server is responsible for this integration, hence making it easier for customer to join with SaaS.
Ø Communication Server: It is responsible to handle all kind of communications with SaaS users.

3) Administration Tier:
SaaS has emerged as a huge market in recent times, and to handle it properly, it needs proper administration, which makes a separate Administration Tier crucial in SaaS architecture. This tier is responsible for handling tasks like metering the usage of software by a customer, billing and payments by a customer according to usage policies, and maintenance and support to SaaS environments to keep it up and running.

4) Infrastructure monitoring:
For proper functioning of business, SaaS infrastructure needs to be locally and geographically resilient. For this, the environment needs to be operationally monitored. There are certain tools used for this purpose. These includes “S.L.A. (Service Level Agreement) Reporting tool”, used to monitor and manage S.A.L., “Support matrix” for managing overall support functions, including customer service details, performance of support staff and customer satisfaction level and “Quality Matrix” for maintaining quality of service by business, based on the customer feedback s in the feedback loop created by this process.

5) Configuration tier:
Various types of customers require different configuration of the same software according to suitability. This tier is responsible for managing and customizing the software based on the user’s choice. The customization includes changing functionality as well as look and feel of the software.

Characteristics of SaaS Software:

Ø Network-based access to and managed of, commercially available software.
Ø Activities managed from central locations rather than at each customer’s site, enabling customers to access applications remotely via the web.
Ø Application delivery typically closer to a one –to- many models than to a one-to-one model, including architecture, pricing, partnering and management characteristics.
Ø Centralized features updating, which obviates the need for end users to download patches and upgrades.
Ø Frequent integration into a larger network of communication software- either as a part of a mash up or as a plug-in to a platform as a service.

Conclusion:

Software as a Service is a kind of “Recession Repellent Tools” for organizations. It has achieved huge popularity and business proliferation in recession scenario and is continuing to foster. The lot of features like data security, fault tolerance, high availability, easy integration, low initial cost and new add-ons in SaaS architecture makes it a beneficial choice fro both the vendors as well as customers.

Manasi Kulkarni

Wednesday, January 20, 2010

Dos command for knowing the mac address of another machine is

nbtstat -a IP address of another machine

Tuesday, January 19, 2010

Hi this is Manasi Kulkarni

manasi

Friday, November 19, 2021

Market Basket Analysis using Python

Friday, January 22, 2010

Wednesday, January 20, 2010

Tuesday, January 19, 2010

Followers

Blog Archive

About Me