Create a descriptive statistics program that analyzes the Hogwarts dataset without using built-in statistical functions like describe(), mean(), std(), etc.
- CSV Parsing: Manual reading and parsing of the training dataset
- Data Type Handling: Automatic detection and conversion of numerical features
- Missing Value Management: Identification and exclusion of empty/invalid entries
Feature 1 Feature 2 Feature 3 Feature 4
Count 149.000000 149.000000 149.000000 149.000000
Mean 5.848322 3.051007 3.774497 1.205369
Std 5.906338 3.081445 4.162021 1.424286 #Standard deviation
Min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.400000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
Max 7.900000 4.400000 6.900000 2.500000
python describe.py dataset_train.csvIdentify which Hogwarts course has the most homogeneous score distribution between all four houses (Gryffindor, Slytherin, Hufflepuff, Ravenclaw).
- CSV Parsing: Read the training dataset and extract student data by house
- Data Cleaning: Handle missing values and convert string scores to floats
- House Segregation: Separate students by their Hogwarts house for each subject
We used the Coefficient of Variation (CV) to measure homogeneity between houses:
CV = (standard_deviation / mean) Γ 100
The CV measures relative dispersion - the lower the CV, the more homogeneous the distribution.
Data Normalization:
To ensure fair comparison and avoid issues with negative or very small values, all grades are normalized using the global minimum across all subjects. Each score is shifted so that the minimum value becomes 1:
normalized_score = original_score - global_min + 1
This normalization is applied before calculating house averages and the CV.
- For each subject:
- Apply global minimum normalization to all scores
- Calculate average score per house
- Compute standard deviation of the four house averages
- Calculate CV of house averages
- Identify subject with lowest CV
After applying global minimum normalization, the results show:
| Subject | CV (%) |
|---|---|
| Arithmancy | 0.57 |
| Astronomy | 2.02 |
| Herbology | 0.02 |
| Defense Against the Dark Arts | 0.02 |
| Divination | 0.02 |
| Muggle Studies | 1.76 |
| Ancient Runes | 0.39 |
| History of Magic | 0.02 |
| Transfiguration | 0.17 |
| Potions | 0.01 |
| Care of Magical Creatures | 0.00 |
| Charms | 0.03 |
| Flying | 0.40 |
π Most Homogeneous Course: Care of Magical Creatures (CV: 0.00%)
Ξ£[(xi - xΜ)(yi - Θ³)]
r = βββββββββββββββββββββββββββββ
β[Ξ£(xi - xΜ)Β²] Γ β[Ξ£(yi - Θ³)Β²]
- Averages of each of the two subjects (x and y)
- Numerator
- Variances of each of the two subjects
- Numerator / Multiplication of the two variances
- Loop through each course except the last one
- For each course, loop through all subsequent courses until the last one
- Keep the names of the two courses and the correlation coefficient result
- Display the students' marks for the two subjects with the highest correlation coefficient in absolute value in a scatter plot
For two features, the coefficient is always between [-1; 1], meaning that if :
- coef is close to 0 => no correlation between the two features
- coef is close to 1 => strong correlation, the two features evolve in the same way
- coef is close to -1 => strong correlation but when the first feature evolves, the second one goes the opposite way
Highest correlation is between Astronomy and Defense Against the Dark Arts: -1.0000. It means that there is a strong correlation between the two courses but a decreasing one. In other words, the higher a student's grade in Astronomy, the lower their grade in Dark Arts and vice versa.
Create a comprehensive pair plot visualization that displays the relationships between all Hogwarts courses, showing both distribution patterns and correlations between subjects for each house.
- CSV Parsing: Read the training dataset and extract student data
- House Segregation: Separate students by their Hogwarts house (Gryffindor, Slytherin, Hufflepuff, Ravenclaw)
- Feature Extraction: Extract all 13 course grades for visualization
The pair plot is a 13x13 matrix where each cell represents the relationship between two courses:
- Diagonal Elements: Histograms showing the distribution of grades for each individual course, separated by house
- Off-Diagonal Elements: Scatter plots showing the correlation between pairs of courses, with points colored by house
Diagonal Histograms: Show the grade distribution for each course. Overlapping histograms by house reveal:
- Which houses perform better in specific subjects
- The spread and skewness of grades within each house
- Outliers and grade clustering patterns
Scatter Plots: Reveal correlations between course pairs:
- Positive correlation: Points form an upward trend (students good at one subject tend to be good at the other)
- Negative correlation: Points form a downward trend (inverse relationship between subjects)
- No correlation: Points appear randomly scattered
- House-specific patterns: Different colored clusters can reveal house-specific correlations
The pair plot provides a comprehensive overview of the Hogwarts academic landscape, enabling identification of:
- Courses with similar difficulty levels across houses
- Subject combinations that students excel at together
- House-specific academic strengths and weaknesses
- Overall correlation structure in the curriculum
This visualization complements the individual analyses by providing a holistic view of all course relationships simultaneously.
Model type : Ε· = Ο(ΞΈ_0 + ΞΈ_1 x_1 + β¦ + ΞΈ_n x_n)
Output nature : Probability β [0, 1]
Logistic regression is a classification algorithm that predicts the probability of an instance belonging to a particular class. Unlike linear regression, it uses the sigmoid function to constrain outputs between 0 and 1.
The sigmoid (or logistic) function transforms any real-valued number into a probability:
1
Ο(z) = βββββββββ
1 + e^(-z)
Where z = ΞΈ^T x (the linear combination of features and weights)
Properties:
Ο(z) β 1whenz β +βΟ(z) β 0whenz β -βΟ(0) = 0.5
For binary classification, we use the log loss (cross-entropy):
1 m
J(ΞΈ) = - βββ Ξ£ [y_i log(h_ΞΈ(x_i)) + (1 - y_i) log(1 - h_ΞΈ(x_i))]
m i=1
Where:
h_ΞΈ(x) = Ο(ΞΈ^T x)is the hypothesis (predicted probability)y_i β {0, 1}is the true labelmis the number of training examples
This cost function is convex, ensuring gradient descent converges to the global minimum.
To minimize the cost function, we update weights iteratively:
βJ(ΞΈ) 1 m
ΞΈ_j := ΞΈ_j - Ξ± βββββ = ΞΈ_j - Ξ± βββ Ξ£ (h_ΞΈ(x_i) - y_i) x_i^j
βΞΈ_j m i=1
Where:
Ξ±is the learning rate (step size)- The gradient tells us the direction to adjust weights
- We repeat until convergence
Since logistic regression is inherently binary (2 classes), but we need to classify into 4 houses, we use the One-vs-All (also called One-vs-Rest) multi-class strategy.
Principle:
-
Train K binary classifiers (one for each of the K classes)
- Classifier 1: Gryffindor vs. {Hufflepuff, Ravenclaw, Slytherin}
- Classifier 2: Hufflepuff vs. {Gryffindor, Ravenclaw, Slytherin}
- Classifier 3: Ravenclaw vs. {Gryffindor, Hufflepuff, Slytherin}
- Classifier 4: Slytherin vs. {Gryffindor, Hufflepuff, Ravenclaw}
-
Each classifier learns to answer: "Is this student in MY house?"
- Output: Probability that the student belongs to that specific house
-
Prediction: For a new student, run all K classifiers and:
predicted_house = argmax(P(house_i | student))
iβ{1,2,3,4}
Choose the house with the highest probability
Mathematical Formulation:
For each class k:
- Transform labels:
y_binary = 1ify = k, else0 - Train classifier to minimize:
J_k(ΞΈ_k) - Store weights:
ΞΈ_k
At prediction time:
class = argmax Ο(ΞΈ_k^T x)
k
To capture non-linear relationships between courses, we apply polynomial feature expansion:
Original features (13 courses):
x = [Arithmancy, Astronomy, Herbology, ..., Flying]
Polynomial features (degree 2):
- Linear terms: x_1, x_2, ..., x_13
- Quadratic terms: x_1Β², x_2Β², ..., x_13Β²
- Interaction terms: x_1Β·x_2, x_1Β·x_3, ..., x_12Β·x_13
Result: ~104 features that capture:
- Individual course performance (linear terms)
- Course difficulty patterns (quadratic terms)
- Cross-course correlations (interaction terms)
Example: Potions Γ Herbology might reveal students good at both (Hufflepuff pattern)
To ensure stable gradient descent, we normalize all features:
x - mean(x)
x_norm = ββββββββββββ
std(x)
Benefits:
- All features on same scale (mean=0, std=1)
- Faster convergence
- Prevents features with large ranges from dominating
- Allows higher learning rate (Ξ± = 0.1)
Learning Rate (Ξ± = 0.1)
- Fixed theoretically based on normalized features
- With normalization, gradient is already scaled
- Higher LR β faster convergence
- Standard value for normalized data
Polynomial Degree (2)
- Degree 1: Too simple, linear boundaries only
- Degree 2: Captures non-linear interactions (~104 features)
- Degree 3: Too complex, ~455 features, overfitting risk
- Optimal: Degree 2 balances expressiveness and complexity
Iterations (300)
- Only hyperparameter tuned empirically
- Tested: 100, 200, 300, 400, 500, 750, 1000, 1500, 2000
- Convergence observed around 250-300 iterations
- Beyond 300: accuracy stagnates or decreases (overfitting)
- Load and parse training data (dataset_train.csv)
- Split into train/test sets (80/20)
- Apply polynomial features (degree 2)
- Normalize features (mean=0, std=1)
- Train 4 OvA classifiers using gradient descent
- Evaluate on test set to find optimal iterations
- Save model (weights, normalization params, label mapping)
- Load model (model.json)
- Parse test data (dataset_test.csv)
- Apply same transformations:
- Polynomial features (degree 2)
- Normalization (using training mean/std)
- Run all 4 classifiers
- Select house with highest probability
- Save predictions (houses.csv)
Target: 98% accuracy (Sorting Hat standard) Achieved: ~98.5% on test set
The model successfully replicates the Sorting Hat's decisions with high accuracy, demonstrating that:
- Course performance patterns encode house characteristics
- Polynomial features capture complex student profiles
- One-vs-All strategy effectively handles multi-class sorting
# Train the model
python logreg_train.py datasets/dataset_train.csv
# Make predictions
python logreg_predict.py datasets/dataset_test.csv
# Evaluate accuracy
python evaluate.pymodel.json: Complete model (weights, normalization params, mappings)weights.csv: Weight matrices only (compatibility)houses.csv: Predictions for test set
Constraints respected:
- β No sklearn or similar ML libraries
- β Gradient descent implemented from scratch
- β One-vs-All strategy manually coded
- β NumPy used only for matrix operations (not ML functions)
Key features:
- Robust error handling
- Reproducible results (fixed random seed)
- Complete model serialization
- Modular, object-oriented design