Python for exploratory data analysis and association rules applied to an e-commerce dataset
2020 was a historic year for e-commerce. With social isolation, the online sales broke many records. And despite the start of vaccination and a possible return to mobility, the trend is to keep growing in 2021.
Many professionals who work with e-commerce do not know the potential that data analysis can bring to the business. In this article, I’m going to talk a little bit about how to do some simple analysis with python. Those analytics solutions can open up new opportunities, identify problems and provide useful information for managing e-commerce.
We will conduct an Exploratory Data Analysis (EDA) on a e-commerce data set available on Kaggle. After that we are going to use the Apriori association algoritm , which is nothing more than a method of exploring relationships between items.
The data set is available at: https://www.kaggle.com/roshansharma/online-retail
First we are going to import the libraries and the data:
As usual, i like to check the first lines of the data.
Let’s check the shape, columns types and null values:
Initially we can see that we have 135080 null values and the CustomerID variable is not in string format. If we give describe(), it will be possible to see negative values in the data. These amounts are due to canceled orders. I will remove them, as we will not need it.
We will make the necessary transformations:
Now, let’s take a new look at the headers and answer the first question that came to my mind (Which country has the highest sales value?):
We can see this in a bar chart:
This is a London store, so it makes sense to have a much higher sales to UK.
We will use this information to apply apriori in the two countries with the highest sales: UK and Netherlands.
First, let’s understand the measures that we will work on: Support, Confidence, Lift and Conviction.
- Support: The measure indicate the proportion of X in Y.
- Confidence: The Confidence measure is calculated on top of a rule (X => Y). It expresses the proportion of “If X is bought, what is the chance of Y being bought?”
- Lift: The lift measure indicates the chance of Y being bought, if X is bought, and considering all of Y quantities.
- Conviction: The Conviction measure is interested in calculating the frequency that X occurs and Y does not occur, that is, it is interested in when the rule fails.
If you want to know more details on how the whole theory behind it works, you can find it in the references at the end of the article.
Let’s get back to the code:
The first part of the code above generates a database with UK orders only. A pivot table is generated where each column corresponds to a product and each row corresponds to the sum of the quantity purchased for that product in a given order.
Next, we’ll apply the apriori algorithm:
Here we did the generation of frequent itemsets and association rules.
Note that we need to define the support measure.
For this database, we found only two association rules for 0.03 support.
We will perform the same procedure for the Netherlands database. This time let’s try a 0.07 support:
This code is similar to the code used to gnerate the UK rules. The objective is to show how the minimum support and the minimum confidence can vary from one base to another. One country may have a more homogeneous purchasing profile and generate rules with greater support, while another country may generate rules with less support.
Finaly we come to the end of this article, whose main objective was to carry out an exploratory analysis and identify some association rules for a data set extracted from e-commerce.
I hope you guys enjoyed and see you soon!