ReinforcedTutor for Recommending Question Difficulty

🚀 Benefits for Users

This API helps students optimize their study time by:

Personalized recommendations using reinforcement learning
Efficient question selection based on previous performance data
Adaptive difficulty progression preventing wasted effort
Confidence building through strategic challenge scaling
Smart preparation for exams with mixed-difficulty questions

Designed as the AI core for an educational app

handles the recommendation logic while the app handles UI/question delivery.

Workflow Diagram

          --------------------------
 +---------------+                  +---------------------+
 | 1. Take Initial|                  |                     |
 |    Assessment  |<-----------------| Question Bank       |
 +---------------+                  | (By Difficulty)     |
         |                          +---------------------+
         |                                  |
         v                                  v
 +---------------+                  +---------------------+
 | 2. Enter Test  |                  | Reinforcement       |
 | Scores & Needs |----------------->| Learning API        |
 +---------------+   User History    | (This Project)      |
         |                          +---------------------+
         |                                  |
         v                                  |
 +---------------+                  +---------------------+
 | 3. Get Question|                  | Study Interface    |
 | Recommendations|----------------->| (Displays Questions|
 +---------------+   Difficulty List +---------------------+
         |
         v
 +---------------+                    +-------------------------------------+
 | 4. Practice   |                    |                                     |
 | with Smart    |   ---------------> | Results are saved for future inputs |
 | Recommendations                    +-------------------------------------+
 +---------------+

🔍 How It Works

Input Requirements:

{
  "difficulty_scores": {
    "0": {"obtained": 95, "total": 100},  # Easiest
    "1": {"obtained": 85, "total": 100},
    "2": {"obtained": 70, "total": 100},
    "3": {"obtained": 60, "total": 100},
    "4": {"obtained": 40, "total": 100}   # Hardest
  },
  "questions_needed": 20
}

Sample Output

For 20 questions: [4, 3, 1, 1, 4, 4, 3, 4, 0, 3, 3, 0, 4, 3, 1, 4, 1, 1, 4, 4]

🎓 Training the Model

After cloning the repo please install the required packages with

pip install -r requirements.txt

Quick Start Training

# Train model with default settings (saves to models/agent.pth)
python src.train.py

# Train with progress visualization
python src.train --plot

Note: The model was trained on synthetic data. For better performance, it would be better to train the model on real data.

Evaluation

python src.main

Student Model (Knowledge Tracking)

This model shows the student's knowledge progression. For difficulty level i
Probability update when correct is:

Correct Answer Update:

$p_i^{(t+1)} = p_i^{(t)} + c \cdot \alpha p_i(1-p_i)$

Incorrect Answer Update:

$p_i^{(t+1)} = p_i^{(t)} - w \cdot \alpha p_i(1-p_i)$

Original Paper Equations (Malpani et al.)

Where:

alpha = learning rate
c = correct coefficient
w = wrong coefficient
N = total difficulty types
In this class, we maintain probabilities for different question types.
We update probabilities using a logistic-like curve (X term is a derivative of sigmoid)
Harder questions get bigger boosts for correct answers but smaller penalties for wrong ones.

Actor-Critic Network Architecture

This is the neural network that makes decisions (actor) and evaluates the states (critic).

Architecture Diagram

Input (State)
│
└─ Shared Layers (ReLU → Tanh)
├─ Actor Head (Softmax) → Action Probabilities
└─ Critic Head (Tanh) → State Value

The shared layers learn general features, and the actor outputs the probability distribution over actions using softmax: $\pi(a|s) = \frac{e^{z_a}}{\sum_{b}e^{z_b}}$ The critic estimates the state value using a tanh activation:

$V(s) = \tanh(w^T h + b)$

where h is the hidden layer output.

The Actor-Critic algorithm combines policy optimization with value estimation.

Advantage:
The advantage measures how much better an action was compared to the critic's expectation.

Positive advantage = action was better than average
Negative advantage = action was worse than average

$$ A(s_t,a_t) = \sum_{k=0}^{T-t} \gamma^k, r_{t+k} - V(s_t) $$

In the code, advantage was returns - values where returns was the discounted cumulative rewards.

Policy Loss:
This adjusts the Actor network to favor actions that lead to higher than expected rewards. $L_{\pi} = -\frac{1}{T} \sum_{t=0}^T \log\pi(a_t|s_t) \cdot A(s_t,a_t)$

Value Loss:
Trains the critic to better estimate state values $L_{V} = \frac{1}{T} \sum_{t=0}^T (V(s_t) - \text{Actual Return})^2$

Environment (TutoringEnv)

Simulates the tutoring process and calculates rewards

$$ \text{State} = \Big[p_1,; p_2,; p_3,; p_4,; p_5,; \frac{\text{current step}}{\text{total steps}},; 1 - \frac{\text{current step}}{\text{total steps}}\Big] $$

where $p_i$ = probability for difficulty i

Reward Function

$$ R = 5(a+1) + 6c + \begin{cases} 0.5(1 - \sigma), & \text{if } n \ge 5 \\ 0, & \text{if } n < 5 \end{cases} $$

where:

$ a $: Action (difficulty level selected by the agent).
$ c $: Binary correctness indicator ($ 1 $ for correct, $ 0 $ for incorrect).
$ n $: Number of stored difficulty levels (from \texttt{self.difficulties}).
$ \sigma $: Standard deviation of difficulties (\texttt{np.std(self.difficulties)}).

The function incentivizes harder questions via $ 5(a+1) $, rewards improvement through $ 10(p_{\text{new}} - p_{\text{old}}) $, grants $ 6c $ for correct answers, and encourages consistency by penalizing variability in difficulties when $ n \geq 5 $.

Gradient Policy Theorem

The core update rule for policy parameters $\theta$:

$$ \nabla J(\theta) \approx \mathbb{E}\left[\sum_{t=0}^T \nabla_\theta\log\pi_\theta(a_t|s_t) \cdot A(s_t,a_t)\right] $$

where $A(s_t,a_t)$ is the advantage function.

Value Function Update

The critic's temporal difference update: $V(s) \leftarrow V(s) + \alpha\left[\sum_{k=0}^\infty \gamma^k r_{t+k} - V(s)\right]$

Entropy Regularization

Encourages exploration through: $H(\pi) = -\sum_a \pi(a|s)\log\pi(a|s)$

Hyperparameters

Hyperparameters (changable) and Their Roles

Parameter	Value	Purpose
$\gamma$ (gamma)	0.99	Discount factor for future rewards
$\alpha$ (alpha)	0.1	Student model's learning rate
`entropy_coeff`	0.01	Strength of entropy regularization
$c$ (correct)	0.2	Probability boost magnitude for correct answers
$w$ (wrong)	0.05	Probability reduction for incorrect answers

Action Selection

Action Selection Process

def get_action(self, state):
    probs, value = self.model(state)  # Get network outputs
    probs = 0.9*probs + 0.1*uniform  # Add exploration noise
    dist = Categorical(probs)        # Create distribution
    action = dist.sample()           # Sample action
    return action, dist.log_prob(action), value, dist.entropy()

Evaluation Metrics

Key performance indicators:

Final student proficiencies per question type: $\mathbf{p} = [p_1, p_2, p_3, p_4, p_5]$
Average total reward across trials: $\bar{R} = \frac{1}{N}\sum_{i=1}^N \left( \sum_{t=0}^T r_t^{(i)} \right)$
Action distribution statistics: $\text{std}(\pi), \text{entropy}(\pi)$

Implementation Balance

The system combines:

Student modeling: Spaced repetition principles
Deep RL: Actor-Critic with stabilization techniques
Exploration strategies: Entropy regularization + noise injection
Curriculum learning: Difficulty-based rewards

The Average total reward is a metric that gives insight into Agent Effectiveness, Training Progress and System Stability (depending on the standard deviation across trials)

Training Process

Agent receives state containing student's current knowledge
Actor network selects question type (action)
Environment provides question and evaluates answer
Student model updates knowledge probabilities
New state and reward are calculated
Episode repeats for 20 questions
Agent updates network using all episode experiences

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReinforcedTutor for Recommending Question Difficulty

🚀 Benefits for Users

Workflow Diagram

🔍 How It Works

Input Requirements:

Sample Output

🎓 Training the Model

Quick Start Training

Evaluation

Student Model (Knowledge Tracking)

Actor-Critic Network Architecture

Environment (TutoringEnv)

Reward Function

Gradient Policy Theorem

Value Function Update

Entropy Regularization

Hyperparameters

Action Selection

Evaluation Metrics

Implementation Balance

Training Process

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReinforcedTutor for Recommending Question Difficulty

🚀 Benefits for Users

Workflow Diagram

🔍 How It Works

Input Requirements:

Sample Output

🎓 Training the Model

Quick Start Training

Evaluation

Student Model (Knowledge Tracking)

Actor-Critic Network Architecture

Environment (TutoringEnv)

Reward Function

Gradient Policy Theorem

Value Function Update

Entropy Regularization

Hyperparameters

Action Selection

Evaluation Metrics

Implementation Balance

Training Process

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages