Demand forecasting for grocery stores using XGBoost on the Kaggle Favorita dataset (~3M rows of daily sales across 54 stores and 33 product families). Served through a Flask API with a React frontend and LLM-powered insight generation.
Download the Store Sales - Time Series Forecasting dataset from Kaggle and place the CSVs in backend/data/:
pip install kaggle
kaggle competitions download -c store-sales-time-series-forecasting
unzip store-sales-time-series-forecasting.zip -d backend/data/You should have train.csv, stores.csv, oil.csv, holidays_events.csv, etc. in backend/data/.
cd backend
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python train.py
python app.py cd frontend
npm install
npm run devOpens on http://localhost:5173. Vite proxies /api to the Flask backend.
export OPENROUTER_API_KEY=sk-or-...Uses openai/gpt-oss-safeguard-20b on OpenRouter for ultra blazing fast insight generation. Without it, the app shows a demo insight.
Features: day_of_week, month, is_weekend, is_holiday (from Ecuador holidays dataset), rolling averages (7/30-day), lag features (1/7-day), days since last promotion, oil price (forward-filled), plus encoded store/family/type/cluster IDs.
Model: XGBoost regressor with time-based train/test split (last 15 days held out). Gets 214 RMSE and 64 MAE on the test set. Feature importance is dominated by lag and rolling features, as expected for time series.
Forecasting: Multi-step forecasting by feeding predictions back as lag inputs for subsequent days. Not ideal (error compounds) but simple and works for a 14-day horizon.
API: Flask app loads the model at startup. The insight endpoint builds a prompt with forecast data and recent averages, sends it to an LLM, and returns a plain-English summary.
