Skip to content

socallmebertille/42-Data_Science_Logistic_Regression

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

27 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

42-DSLR

last-commit repo-top-language repo-language-count

Built with the tools and technologies:

Markdown GNU%20Bash Python

Table of Contents

Building the 42 DSLR project

Data Analysis

Objective

Create a descriptive statistics program that analyzes the Hogwarts dataset without using built-in statistical functions like describe(), mean(), std(), etc.

Data Processing

  1. CSV Parsing: Manual reading and parsing of the training dataset
  2. Data Type Handling: Automatic detection and conversion of numerical features
  3. Missing Value Management: Identification and exclusion of empty/invalid entries

Output Format

                    Feature 1    Feature 2    Feature 3    Feature 4
Count               149.000000   149.000000   149.000000   149.000000
Mean                5.848322     3.051007     3.774497     1.205369
Std                 5.906338     3.081445     4.162021     1.424286     #Standard deviation
Min                 4.300000     2.000000     1.000000     0.100000
25%                 5.100000     2.800000     1.600000     0.300000
50%                 5.800000     3.000000     4.400000     1.300000
75%                 6.400000     3.300000     5.100000     1.800000
Max                 7.900000     4.400000     6.900000     2.500000

Usage

python describe.py dataset_train.csv

Histogram Analysis

Objective

Identify which Hogwarts course has the most homogeneous score distribution between all four houses (Gryffindor, Slytherin, Hufflepuff, Ravenclaw).

Data Processing

  1. CSV Parsing: Read the training dataset and extract student data by house
  2. Data Cleaning: Handle missing values and convert string scores to floats
  3. House Segregation: Separate students by their Hogwarts house for each subject

Homogeneity Measurement

We used the Coefficient of Variation (CV) to measure homogeneity between houses:

CV = (standard_deviation / mean) Γ— 100

The CV measures relative dispersion - the lower the CV, the more homogeneous the distribution.

Data Normalization:

To ensure fair comparison and avoid issues with negative or very small values, all grades are normalized using the global minimum across all subjects. Each score is shifted so that the minimum value becomes 1:

normalized_score = original_score - global_min + 1

This normalization is applied before calculating house averages and the CV.

Algorithm Steps

  1. For each subject:
    • Apply global minimum normalization to all scores
    • Calculate average score per house
    • Compute standard deviation of the four house averages
    • Calculate CV of house averages
  2. Identify subject with lowest CV

Results

After applying global minimum normalization, the results show:

Subject CV (%)
Arithmancy 0.57
Astronomy 2.02
Herbology 0.02
Defense Against the Dark Arts 0.02
Divination 0.02
Muggle Studies 1.76
Ancient Runes 0.39
History of Magic 0.02
Transfiguration 0.17
Potions 0.01
Care of Magical Creatures 0.00
Charms 0.03
Flying 0.40

πŸ† Most Homogeneous Course: Care of Magical Creatures (CV: 0.00%)

Scatter Plot Analysis

Pearson's correlation coefficient

        Ξ£[(xi - xΜ„)(yi - Θ³)]
r = ─────────────────────────────
    √[Ξ£(xi - xΜ„)Β²] Γ— √[Ξ£(yi - Θ³)Β²]
  1. Averages of each of the two subjects (x and y)
  2. Numerator
  3. Variances of each of the two subjects
  4. Numerator / Multiplication of the two variances

Algorithm Steps

  1. Loop through each course except the last one
  2. For each course, loop through all subsequent courses until the last one
  3. Keep the names of the two courses and the correlation coefficient result
  4. Display the students' marks for the two subjects with the highest correlation coefficient in absolute value in a scatter plot

Correlation coefficient analysis

For two features, the coefficient is always between [-1; 1], meaning that if :

  • coef is close to 0 => no correlation between the two features
  • coef is close to 1 => strong correlation, the two features evolve in the same way
  • coef is close to -1 => strong correlation but when the first feature evolves, the second one goes the opposite way

Results

Highest correlation is between Astronomy and Defense Against the Dark Arts: -1.0000. It means that there is a strong correlation between the two courses but a decreasing one. In other words, the higher a student's grade in Astronomy, the lower their grade in Dark Arts and vice versa.

Pair Plot Analysis

Objective

Create a comprehensive pair plot visualization that displays the relationships between all Hogwarts courses, showing both distribution patterns and correlations between subjects for each house.

Data Processing

  1. CSV Parsing: Read the training dataset and extract student data
  2. House Segregation: Separate students by their Hogwarts house (Gryffindor, Slytherin, Hufflepuff, Ravenclaw)
  3. Feature Extraction: Extract all 13 course grades for visualization

Visualization Structure

The pair plot is a 13x13 matrix where each cell represents the relationship between two courses:

  • Diagonal Elements: Histograms showing the distribution of grades for each individual course, separated by house
  • Off-Diagonal Elements: Scatter plots showing the correlation between pairs of courses, with points colored by house

Matrix Interpretation

Diagonal Histograms: Show the grade distribution for each course. Overlapping histograms by house reveal:

  • Which houses perform better in specific subjects
  • The spread and skewness of grades within each house
  • Outliers and grade clustering patterns

Scatter Plots: Reveal correlations between course pairs:

  • Positive correlation: Points form an upward trend (students good at one subject tend to be good at the other)
  • Negative correlation: Points form a downward trend (inverse relationship between subjects)
  • No correlation: Points appear randomly scattered
  • House-specific patterns: Different colored clusters can reveal house-specific correlations

Results

The pair plot provides a comprehensive overview of the Hogwarts academic landscape, enabling identification of:

  • Courses with similar difficulty levels across houses
  • Subject combinations that students excel at together
  • House-specific academic strengths and weaknesses
  • Overall correlation structure in the curriculum

This visualization complements the individual analyses by providing a holistic view of all course relationships simultaneously.

Logistic Regression

Definition

Model type : Ε· = Οƒ(ΞΈ_0 + ΞΈ_1 x_1 + … + ΞΈ_n x_n) Output nature : Probability ∈ [0, 1]

Logistic regression is a classification algorithm that predicts the probability of an instance belonging to a particular class. Unlike linear regression, it uses the sigmoid function to constrain outputs between 0 and 1.

Sigmoid Function

The sigmoid (or logistic) function transforms any real-valued number into a probability:

         1
Οƒ(z) = ─────────
       1 + e^(-z)

Where z = ΞΈ^T x (the linear combination of features and weights)

Properties:

  • Οƒ(z) β†’ 1 when z β†’ +∞
  • Οƒ(z) β†’ 0 when z β†’ -∞
  • Οƒ(0) = 0.5

Cost Function

For binary classification, we use the log loss (cross-entropy):

          1   m
J(ΞΈ) = - ───  Ξ£ [y_i log(h_ΞΈ(x_i)) + (1 - y_i) log(1 - h_ΞΈ(x_i))]
          m  i=1

Where:

  • h_ΞΈ(x) = Οƒ(ΞΈ^T x) is the hypothesis (predicted probability)
  • y_i ∈ {0, 1} is the true label
  • m is the number of training examples

This cost function is convex, ensuring gradient descent converges to the global minimum.

Gradient Descent

To minimize the cost function, we update weights iteratively:

               βˆ‚J(ΞΈ)            1   m
ΞΈ_j := ΞΈ_j - Ξ± ───── = ΞΈ_j - Ξ± ───  Ξ£ (h_ΞΈ(x_i) - y_i) x_i^j
               βˆ‚ΞΈ_j             m  i=1

Where:

  • Ξ± is the learning rate (step size)
  • The gradient tells us the direction to adjust weights
  • We repeat until convergence

One-vs-All (OvA) Strategy

Since logistic regression is inherently binary (2 classes), but we need to classify into 4 houses, we use the One-vs-All (also called One-vs-Rest) multi-class strategy.

Principle:

  1. Train K binary classifiers (one for each of the K classes)

    • Classifier 1: Gryffindor vs. {Hufflepuff, Ravenclaw, Slytherin}
    • Classifier 2: Hufflepuff vs. {Gryffindor, Ravenclaw, Slytherin}
    • Classifier 3: Ravenclaw vs. {Gryffindor, Hufflepuff, Slytherin}
    • Classifier 4: Slytherin vs. {Gryffindor, Hufflepuff, Ravenclaw}
  2. Each classifier learns to answer: "Is this student in MY house?"

    • Output: Probability that the student belongs to that specific house
  3. Prediction: For a new student, run all K classifiers and:

   predicted_house = argmax(P(house_i | student))
                            i∈{1,2,3,4}

Choose the house with the highest probability

Mathematical Formulation:

For each class k:

  • Transform labels: y_binary = 1 if y = k, else 0
  • Train classifier to minimize: J_k(ΞΈ_k)
  • Store weights: ΞΈ_k

At prediction time:

class = argmax Οƒ(ΞΈ_k^T x)
         k

Feature Engineering

To capture non-linear relationships between courses, we apply polynomial feature expansion:

Original features (13 courses):

x = [Arithmancy, Astronomy, Herbology, ..., Flying]

Polynomial features (degree 2):

- Linear terms: x_1, x_2, ..., x_13
- Quadratic terms: x_1Β², x_2Β², ..., x_13Β²
- Interaction terms: x_1Β·x_2, x_1Β·x_3, ..., x_12Β·x_13

Result: ~104 features that capture:

  • Individual course performance (linear terms)
  • Course difficulty patterns (quadratic terms)
  • Cross-course correlations (interaction terms)

Example: Potions Γ— Herbology might reveal students good at both (Hufflepuff pattern)

Feature Normalization

To ensure stable gradient descent, we normalize all features:

         x - mean(x)
x_norm = ────────────
          std(x)

Benefits:

  • All features on same scale (mean=0, std=1)
  • Faster convergence
  • Prevents features with large ranges from dominating
  • Allows higher learning rate (Ξ± = 0.1)

Hyperparameters

Learning Rate (Ξ± = 0.1)

  • Fixed theoretically based on normalized features
  • With normalization, gradient is already scaled
  • Higher LR β†’ faster convergence
  • Standard value for normalized data

Polynomial Degree (2)

  • Degree 1: Too simple, linear boundaries only
  • Degree 2: Captures non-linear interactions (~104 features)
  • Degree 3: Too complex, ~455 features, overfitting risk
  • Optimal: Degree 2 balances expressiveness and complexity

Iterations (300)

  • Only hyperparameter tuned empirically
  • Tested: 100, 200, 300, 400, 500, 750, 1000, 1500, 2000
  • Convergence observed around 250-300 iterations
  • Beyond 300: accuracy stagnates or decreases (overfitting)

Training Process

  1. Load and parse training data (dataset_train.csv)
  2. Split into train/test sets (80/20)
  3. Apply polynomial features (degree 2)
  4. Normalize features (mean=0, std=1)
  5. Train 4 OvA classifiers using gradient descent
  6. Evaluate on test set to find optimal iterations
  7. Save model (weights, normalization params, label mapping)

Prediction Process

  1. Load model (model.json)
  2. Parse test data (dataset_test.csv)
  3. Apply same transformations:
    • Polynomial features (degree 2)
    • Normalization (using training mean/std)
  4. Run all 4 classifiers
  5. Select house with highest probability
  6. Save predictions (houses.csv)

Performance

Target: 98% accuracy (Sorting Hat standard) Achieved: ~98.5% on test set

The model successfully replicates the Sorting Hat's decisions with high accuracy, demonstrating that:

  • Course performance patterns encode house characteristics
  • Polynomial features capture complex student profiles
  • One-vs-All strategy effectively handles multi-class sorting

Usage

# Train the model
python logreg_train.py datasets/dataset_train.csv

# Make predictions
python logreg_predict.py datasets/dataset_test.csv

# Evaluate accuracy
python evaluate.py

Files Generated

  • model.json: Complete model (weights, normalization params, mappings)
  • weights.csv: Weight matrices only (compatibility)
  • houses.csv: Predictions for test set

Implementation Notes

Constraints respected:

  • βœ… No sklearn or similar ML libraries
  • βœ… Gradient descent implemented from scratch
  • βœ… One-vs-All strategy manually coded
  • βœ… NumPy used only for matrix operations (not ML functions)

Key features:

  • Robust error handling
  • Reproducible results (fixed random seed)
  • Complete model serialization
  • Modular, object-oriented design

About

Harry Potter and the Data Scientist : an introduction to logistic regression.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 100.0%