This repository contains the end-to-end pipeline, analysis, and results for the CSAI 253 – Machine Learning course project (Phase 2). We built and evaluated multiple tree‐based and ensemble classifiers to distinguish between benign and various attack types (DDoS, DoS, Mirai, Recon, MITM) using per-flow network features. The project was organized into data preparation, exploratory analysis, feature engineering, imbalance handling, outlier treatment, modeling, and final submission to the Kaggle competition csai-253-project-phase-2.
.
├── cache/ # Temporary files and CatBoost logs
├── catboost\_info/ # TensorBoard event logs for CatBoost training
├── data/
│ ├── train.csv # Training data
│ ├── test.csv # Test data
│ ├── phase2\_students\_before\_cleaning.csv
│ ├── sample\_submission.csv
│ └── Our\_Competition\_Submission.csv # Final Kaggle submission (Private 0.9163 / Public 0.9146)
├── figures/
│ ├── class\_distribution.png
│ ├── correlation\_matrix.png
│ └── feature\_importance.png
├── imbalance\_analysis/ # Imbalance diagnostics and plots
├── Models/ # Model artifacts and notebooks
│ ├── scaler.joblib
│ ├── selector.joblib
│ ├── xgb\_model.joblib
│ ├── stacking\_model.joblib
│ └── \*.ipynb
├── notebooks/ # Data profiling, cleaning, and preprocessing notebooks
│ ├── data\_profilling.ipynb
│ ├── Feature\_Descriptions.ipynb
│ ├── handling\_duplicates.ipynb
│ ├── handling\_imbalance.ipynb
│ ├── handling\_outliers.ipynb
│ ├── model.ipynb
│ ├── scaling.ipynb
│ └── ydata\_profiling\_code.ipynb
├── Report/ # PDF reports on methodology and rationale
│ ├── Columns Report.pdf
│ ├── Encoding Techniques.pdf
│ ├── Feature Descriptions & Preprocessing Report.pdf
│ ├── Feature Engineering Report.pdf
│ ├── \[FINAL] PHASE 2 REPORT.pdf
│ ├── Handling Duplicates.pdf
│ ├── Handling Outliers.pdf
│ ├── Models Scaling.pdf
│ ├── Numerical Features Skewness Report.pdf
│ ├── Proper Treatment of Test Data in SMOTE Workflows.pdf
│ ├── Why You Should Split Your Data Before Correcting Skewness.pdf
│ └── skewness\_report.txt
├── LICENSE
└── README.md
-
Clone the repository
git clone https://github.com/amr-yasser226/intrusion-detection-kaggle.git cd Machine-Learning-Phase2 -
Dependencies A typical environment includes:
- Python 3.8+
pandas,numpy,scikit-learn,xgboost,lightgbm,catboost,imbalanced-learn,ydata-profiling,optuna,matplotlib,seaborn,joblib,pdfkit
-
Data
- Place
train.csvandtest.csvin/data. - Inspect
phase2_students_before_cleaning.csvfor raw, uncleaned data.
- Place
-
Exploratory Analysis & Profiling
- Run
notebooks/data_profilling.ipynbto generate profiling reports. - Visualize distributions, skewness, and correlations.
- Run
-
Preprocessing Pipelines
- Deduplication:
handling_duplicates.ipynbexplores direct removal, weighting, and train–test aware grouping. - Skew correction: log1p, Yeo–Johnson, Box–Cox — always fit on training split only (
Why You Should Split Your Data…). - Outlier treatment: winsorization, Z-score, isolation forest (
handling_outliers.ipynb). - Scaling: Standard, MinMax, Robust, Quantile (
scaling.ipynb). - Imbalance handling: SMOTE, SMOTE-Tomek, ClassWeights, EasyEnsemble, RUSBoost (
handling_imbalance.ipynb).
- Deduplication:
-
Feature Engineering
- Additional features (e.g.
rate_ratio,avg_pkt_size,burstiness,payload_entropy, time-cyclic features) inFeature Engineering Report.pdfand implemented inscaling.ipynb.
- Additional features (e.g.
-
Model Training & Evaluation
- XGBoost and Stacking in
Model.ipynb/Phase_2 model.ipynb. - Hyperparameter tuning and Optuna‐based lightGBM/CatBoost ensembles in
data_profilling.ipynb. - Final models saved as
.joblibin/Models/.
- XGBoost and Stacking in
-
Results & Submission
- Final private score: 0.916289, public score: 0.914581 on Kaggle.
- Submission file:
data/Our_Competition_Submission.csv.
- Deduplication first prevents leakage and skewed statistics.
- Skew correction must be fit only on the training data to avoid over-optimistic metrics.
- Tree-based models are largely scale-invariant, but scaling benefits pipelines that mix learners.
- Outlier handling (winsorization, isolation forest) improves model robustness.
- Class imbalance addressed via SMOTE (training only) and ensemble methods.
- XGBoost with tuned hyperparameters achieved the best standalone performance; stacking did not outperform it.
- Run the notebooks in order within a Jupyter environment, starting with data profiling and ending with
model.ipynb. - Generate figures in
/figuresand/imbalance_analysis. - Train final models and export
xgb_model.joblib,stacking_model.joblib. - Create submission by loading
test.csv, applying preprocessing, predicting, and savingOur_Competition_Submission.csv.
This project is released under the MIT License.