This hackathon project aims to explore a retail sales dataset. The purpose is to exercise ETL skills, the creation of visualisations, as well as of overall project management.
- The dataset, obtained from Kaggle, is comprised of 3 different CSV files:
- Stores (577 bytes): anonymized information about the 45 stores, indicating the type and size of store.
- Features (600 kilobytes): additional data related to the store, department, and regional activity for the given dates.
- Sales (13.3 megabytes): historical sales data, covering the period 2010-02-05 to 2012-11-01.
- The main aims are:
- Identify sales trends
- Evaluate the impact of promotional markdowns on sales
- Provide useful visualisations
- The project's aim is mostly exploratory in nature, there is no specific hypothesis to be tested. However, there are several expectations needing confirmation:
- There is a significant difference in sales between Holiday and non Holiday weeks;
- MarkDowns affect sales significantly;
- Unemployment and CPI influence sales.
-
High-level project layout:
- Extraction
- Preliminary exploration
- Cleaning
- Transformation and feature engineering
- Exploratory data analysis + visualisations
-
Data was 'collected' (i.e. downloaded) in CSV format. Data from each CSV file was then extracted as Pandas dataframe, subsequently merged into a unitary dataframe. It was kept in dataframe format for the remainder of the project.
-
Pandas dataframe format was chosen for it versatility: as the project required extensive manipulation of the data, Pandas' facilitation of data wrangling tasks made it the ideal framework.
-
Understanding the target variable - Weekly_Sales. Exploration always starts with the target variable. Since this is a continuous variables, looking into the distribution is a crucial step.
- Fig1 provides a histogram of the distribution, prior to any tampering.
- Fig3 is a histogram for after cleaning.
- Figs 4 and 5 offer additional visualisations of its distribution.
-
Sale trends were investigated at a high-level (i.e. store level over the whole period).
- Fig12 offers an interactive view of the overall sales trend.
- Fig13 explores the differences between high and low performers.
-
The effects of Mark Downs were explored using:
- Fig10, a correlation matrix, shows the relationship (linear) between various features
- Fig14, a scatterplot, investigates in more detail the effects of the best correlated Mark Down (no. 5) on sales.
The analysis applied structured data cleaning, transformation, and visualization for retail sales.
-
Notable decisions:
- Negative and anomalous sales were corrected or removed using conditional logic and median imputation.
- Outliers were addressed via quantile-based capping and log–z-score filtering.
- MarkDown features were cleaned, capped, and normalized through log transformation and z-score standardization.
- Temporal trends were explored with interactive Plotly line plots and Seaborn time-series visuals.
- Correlations and distributional patterns were examined through heatmaps, histograms, and scatterplots.
-
Further justification for each step can be found in the project notebook.
- The data was already anonymised, there are no identified ethical issues.
- No unfixed bugs.
- Weaknesses:
- Comparative analysis does not cover inter-store or inter-department.
- Trend analysis does not go into detail at department level, or year by year.
- Only explored linear relationships between the features.
- Improvements:
- More in-depth statistical analysis of the relationship between Weekly_Sales and the other features.
- Web-based method of disseminating the visualisations.
- An early challenge was merging the original 3 CSV files. I first needed to establish which is the main table (i.e. where the target variable is), then to understand how to map Features onto Sales (by a combination of Store + Department, rather than Store alone).
- Sales follows the expected Exponential distribution;
- There is a noticeable increase in sales during Holiday;
- High performing stores manage to create big sales spikes during Holidays, whilst low performing ones exhibit a constant level all year round;
- Between the 4 major holidays (mentioned by the dataset creators), only Christmas and Thanksgiving/Black Friday result in significant trend changes;
- There are mini spikes in sales at the beginning of each month;
- The Mark Downs are not very well correlated with sales (the best having a mere 0.17 positive correlation). Suggesting that they do not play a major role. Though, further analysis using statistical tests is required to confirm this.
- It remains unclear whether Mark Downs increase sales during Holidays (see Fig14).
- Unemployment and CPI are (at least linearly) not correlated to sales;
- The following are the main libraries used in the project:
- Pandas - create and manipulate dataframes;
- Matplotlib - foundation layer for visualisations;
- Seaborn - more advanced and visually pleasing visualisations;
- Plotly express - interactive visualisations.
- Data was obtained from Kaggle (https://www.kaggle.com/datasets/manjeetsingh/retaildataset).
