2014 - 2020 Bangladeshi Ecommerce

Analysis and Dashboard of Ecommerce Data Using SQLite, Python, and Tableau.
DataData Addendum 1Data Addendum 2CodeDashboard
Overview & Objectives
The goal of this project was to identify any revenue trends, most purchased items, and most productive stores for an ecommerce business in Bangladesh, as well as potential avenues to take to increase productivity and sales. In order to do this, the analysis was performed in sections: KPIs over time, stores, products, and customers. The data used for this project was posted on Kaggle by user M Mohaiminul Islam and consists of customer orders, transaction information, item and store descriptions, and times from 2014 to 2020. The questions that inspired the project and that I wished to find answers to are as follows:
  1. Which divisions within Bangladesh are performing the best? Which ones are performing the worst?
  2. What items are the most popular and bringing in the most revenue?
  3. Is there any area in which the business could focus on or any marketing ventures they could purse to improve sales?
Method & Skills
The dataset has a little under 1million orders, with the original data being organized into multiple csv files. To improve the analysis, two more sources of data were used, namely Wikipedia for Bangladeshi network operator codes and Citypopulation for population values indifferent districts and divisions in Bangladesh. To analyze this data, SQLite and SQLiteStudio were used to create a SQLite database from the csv files. SQLite and Python were used to perform the exploration, cleaning, and joining of data, Python was used for analysis and data visualization, and Tableau was used to create a final dashboard that could be used by end users to monitor future quarterly performance in different divisions. The following skills were valuable in the data wrangling, analysis, and visualization phases, respectively:
  • SQLite: COUNT, AVG, SUM, DATE, SUBSTR, CAST, CASE, WHERE, JOIN, GROUP BY, and ORDER BY.
  • Python: NumPy, Pandas, Matplotlib, seaborn, and sqlite3.
  • Tableau: Tables, Bar Charts, and Dynamic Filtering.
Key Insights
The data does not include any costs, making profitability analyses infeasible. Regarding the questions that inspired this project, the answers are as follows:
  1. While there is minimal difference in performance between most stores in different divisions, Dhaka is the division that brings in the most customers and revenue per store. Rajshahi brings the least of both. It might be wise to place more efforts into Dhaka expansion and reduce stores in Rajshahi. Particularly, if we rate each of the districts within the divisions for the highest performer Dhaka and worst performer Rajshahi, we can see that the best and worst regions are not necessarily dictated by their population. It will be important to perform a deeper analysis of store locations within the divisions to see which one of the stores should be removed or where new stores should be propped.
  2. Food and beverages are the most sold products and should be prioritized to increase revenue. When isolating the top 10% of customers (in terms of amount of money spent), popularity of items shifts slightly to favor the healthier popular options. For example, healthy food over soda and protein and energy drinks over chocolate. Perhaps pricing of those items or deals should be rethought in order to increase profits among the top purchasers.
  3. KPIs dropped in Q2-2019 and have been climbing since then. It would be wise to inspect why this happened in terms of systems and operations so it can be safeguarded against in the future. There is not a specific time in which most revenue is made, signifying that there isn't a particular time in which to pursue advertising or sale offers. More customers pay with card than cash or mobile. It could be wise to pursue deals or promos with banks in order to improve sales and revenue.  40% of customers use Robi as their network operator. It could be beneficial to explore marketing ventures like cold calling or sending promos to Robi phone numbers.
Project Summary

This analysis will be performed in four parts: KPIs over time, stores, products, and customers. It will be followed by a dashboard allowing potential end users the ability to view quarterly performance of different divisions if more data were to flow in. We begin with the analysis of KPIs over time. The most important KPIs here are revenue, number of items purchased, and number of transactions made. These are all related metrics, meaning that they will likely show similar trends over time. Below is the initial query grouping the total revenue, total items purchased, and total transactions by hour, day, week, month, quarter, and year. Before continuing, note that all visualizations but the dashboard are static, with the link for the dashboard found above if one would like to use it.

Here we see a triple axis line graph depicting the hourly averages within each quarter of major KPIs that were previously introduced, namely revenue, number of items purchased, and number of transactions. Average revenue per hour is the bold purple line in the middle, average number of items purchased per hour is the dashed dark blue line at the top, and average number of transactions per hour is the dashed and dotted light blue line at the bottom. Note that each of these lines have a separate axis with a range of values distinct to the others. The main reason why this visualization was created in this way was to show how related each of the metrics are, with highs and lows occurring in the same or nearby quarters. This is an indication that looking at all of these metrics might be redundant; it would be beneficial to speak to the executive team at the business to have a better sense of what they are looking for when attempting to evaluate the business' performance. Also note the region surrounding each of the lines, as it corresponds to the standard error of the hourly averages in each quarter. As a matter of fact, there is a lot of hourly variability within the dataset regarding revenue, which one can see in the original code linked at the top of the page; there are many outliers in the higher extremes of revenue, indicating that averages might not be as useful as a metric as medians. This is explained by the fact that the metrics are being applied hourly, so it would also be useful to consider total KPI values like total quarterly revenue. Overall, however, even considering medians over averages and total KPI values, it is clear that fluctuations in hourly or even quarterly revenue are not too concerning. In general, KPIs remain constant throughout the years included in the dataset.

Here we see a heat map of total quarterly revenue over all the years in the dataset, including yearly averages on the far right column. From this map it is clear that the ecommerce business rocketed up at the beginning of 2014 and proceeded to plateau throughout the remaining years of the dataset.

One final note regarding KPIs is whether there is any trends within a single day. Maybe if revenue is highest during midday, it would be worthwhile to increase marketing efforts during those times of the day, for example. However, as we can see below, average hourly revenue remains constant no matter the hour of the day.

Let's now focus on the stores located in different divisions across Bangladesh. For context, divisions and districts are administrative divisions in Bangladesh, much like states and counties in the US. Bangladesh can be split into seven divisions, and each division will have a certain number of districts within it. Below we see a query creating a table counting the population, number of customers, number of stores, and total revenue for each district within each division. The goal for now will be to identify which divisions are achieving greater number of customers and revenue per store.

Below we see a bubble chart, with each bubble of a different color representing a different division. The size of the bubble corresponds to the total population of the division. The x-axis represents the total revenue of the division divided by the total number of stores in the division while the y-axis represents the total number of customers ordering within the division divided by the total number of stores in the division. Before touching on the two most noticeable divisions, let's first consider the two smallest divisions, namely Barisal and Sylhet. Does it make sense for them to be near the bottom in terms of both variables? This could lead to another interesting path to explore, as arguments could be made against expanding in those divisions only due to their populations. Nonetheless, as we see from the purple dot representing Rajshahi, a small population is not a necessity when it comes to poor performance. Rajshahi is the third largest division but also the division with the least number of customers and revenue per store. Clearly, population isn't the only factor to consider when analyzing division productivity. On the other side of the chart we have Dhaka, which is the largest division in terms of population and the best in terms of the two axes variables. Let's now dive further into the districts within the best performing division, Dhaka, and worst performing division, Rajshahi.

Here we have the table of all districts in Dhaka ordered by revenue per store. Although the largest district, Dhaka, dwarfs all other districts in terms of population and number of stores, it isn't the best with respect to our two variables. This poses the question: is having these two variables alone good enough to identify best and worst performing districts or even divisions for that matter? In my personal opinion, the answer is no, but these two variables offer valuable insight that might lead to the beginning of a larger analysis. In general, if we were to have to choose districts in which we would like to open new stores, I would also want to look at total population, average and median incomes in each district, how the districts are run in terms of local policy and geographical setting, and other variables.  

On the other hand, here we have the districts in the worst performing division Rajshahi. Here we have ordered the table in the opposite manner, with the districts with the lowest revenue per store listed above all the others. This table poses the same questions as the one above, as it would be just as unfortunate to remove stores in locations that have potential future growth than to add ones in locations that have no potential future growth. A further analysis and data is required to pick with of the stores within these districts are the worst performing or will become obsolete in the future.  

Now let's move on to looking at general trends for the products or items sold by the business. Here is a query creating a table noting the main supplier, total number purchased, and total revenue accrued from all items purchased. Note, item_name and desc are two separate variables in the query; item_name refers to the specific item whilst desc categorizes the item into groups. Therefore, the table created is grouped by all the items that were purchased, which allows for a better idea as to how many items suppliers are involved with given that multiple suppliers can be involved with a single category of item. For the remainder of this analysis, I have decided to break items up into classes and types. The way one could think about this could be in terms of supermarket aisles. For this analysis, item class can be thought of as aisle signs at a supermarket reading "produce", "dairy", or "bread". On the other hand, item types can be thought of as different sets of items found within each aisle; examples include "fruits" and "vegetables" at the "produce" aisle, "milk" and "yogurt" at the "dairy" aisle, or "loaves" and "tortillas" at the "bread" aisle. In sum, they are two hierarchical categorizations of items.

Here we have the table after the items were separated into their class and type. As we can see, the effect of grouping by item_name and not desc is that one can have multiple tuples of the same item class and item type but also have different (or the same!) supplier depending on what item was actually ordered there (think "Lays" vs. "Pringles").

Here we have a double axis bar graph representing both the average revenue per item and number of items purchased for all the item classes within the dataset. As we can see, although beverage and food aren't the most revenue-accruing items classes on average, they are the most purchased. As a matter of fact, if we consider the total revenue, they are in fact the most revenue-accruing item classes. As such, it would be a good idea to focus operational and marketing efforts to increase sales of items within these two item classes.

We also see the top ten item types (here written plainly as "Items") on the left of this combined plot. Soda, which belongs in the beverage class and healthy-food, which belongs in the food class are the top two most popular items, with up to five percent more purchases than chips or sparkling water. It would be wise to push these items out more than others, especially the ones at the bottom of the list. On the right of the combines plot we see the top ten most important suppliers in terms of the total revenue the items they provided accrued. In order to make a powerful statement regarding suppliers however, it is crucial to also have the cost of their products included in the dataset. Without the costs, it is impossible to know whether purchasing from each supplier actually leads to profit. If the costs were available, it would be more plausible to make a stronger judgement as to which products to push and which suppliers to drop or reinforce business with.

Finally, to finish off the analysis before jumping to the dashboard, we will be taking a look at customers. This is a query creating a table indicating the telephone network operator, main transaction type (as in card, cash, or mobile), bank name (if using card), amount spent, and amount of items purchased for each customer.

Here we have the distribution of the total amount spent by customers. The distribution seems to be quite normal in form, with a very slight right skew. Indicated below is also the 90th percentile of spenders. We will be visualizing what item classes and types are most popular within the top 10% of spenders to see if there are any changes and any items that should be specifically pushed to them with deals or bundle options.

Below we see the item class and top six item type popularities for the top 10% of spenders. The main difference that is spotted here is in the right plot depicting the popularity of item types. Here, healthier options like healthy-food and energy/protein-drinks win over the unhealthier options that were in their position in the previous popularity plot found previously in the analysis, i.e. soda and chocolate, respectively. There might be a case in favor of pushing healthier products to the higher spenders, or this visualization might just be telling us that healthier options are slightly more expensive. Further analysis is needed to reach a firm conclusion. For now let's move on to see how customers are spending their money, either via card, cash, or mobile.

Below we see a pie chart showing the distribution of transaction types used by customers. It is clear that the majority of customers use card. Now, it might be possible to come up with special cash back deals with banks, so it would be useful to look at which banks have customers that spend the most money in the ecommerce business.

An initial look at this plot shows us that overall, there isn't a big difference between the banks, and also that there are many different banks used by customers. This could be division or even district dependent, so an individual analysis for each division and district is recommended to see which banks are preferred and used the most there. Even so, in general, we have that the highest spending customers use NRB Global Bank Ltd, while the lowest spending customers use Dhaka Band Limited. The y-axis values aren't shown here to declutter the plot, but the value difference aren't very high between bars.

Finally, let's take a quick look at which network operators are the most popular. From the table below, we see that Robi provides for around 40% of customers. This might be a successful route to take in terms of marketing, as deals or direct marketing ventures could be performed to captures more of the Robi customer base.

That marks the conclusion of the analysis, and we will now finish off with a peak into the dashboard created specifically for end users wanting to monitor revenue and number of products sold per division in a quarterly fashion. Here we begin with filters at the top allowing the user to input or choose which division, year, and quarter to look at. Below those filters we have the quantity of products sold and total revenue for that quarter, alongside the percent change experienced from the previous quarter. Finally, at the bottom left, we have the top ten products (item types from before) ranked by their percentage of total purchases, while at the bottom right we have the performance of each district within the division by revenue per capita (of each division). This dashboard could be useful to quickly compare how the recently completed quarter compared with the last, for example. It could also be used to monitor changes in performance in districts within a specific division. Admittedly, this level of precision in terms of grouping by division, year, and quarter leads to a lack of trends being visualized anywhere, so this would be a dashboard to accompany a larger analysis like the one presented above.

Overall, answers to the posed questions were sufficiently provided, although it would intriguing to dive into deeper analyses if more data could be found, especially of cost. Some additionally questions I would pose for future analyses are as follows: Within Dhaka and Rajshahi districts, what further insights can more detailed geographical data lend us regarding which stores should be closed and where new stores should be propped?What items are bought together and/or in bulk? This could be asked like the following: how many people who buy sodas also by snacks? What items do the top suppliers supply? If more data is attained regarding costs, what items are leading to the most profit? What are the main fixed and variables costs? Which suppliers are the most expensive and which ones are bringing the most profits?