Tune the reward of the TD_LVFA algorithm

# Goal

The goal of this issue is to improve the performance of the [TD-LVFA agent](https://github.com/Mowibox/Checkers-RL/blob/main/TDLambda_LVFA.py) by tuning the reward function.
This involves adjusting the different weight coefficients $w_i$ to make the reward better reflect strategic depth of checkers, and to discover what aspects matter strategically through the lens of the reward function.

## Current Reward Function

As defined in the [documentation](https://github.com/Mowibox/Checkers-RL/wiki/Documentation#rewards) the reward is computed as follows

```math
R = W + w_0 p + w_1 t + w_2 c_m + w_3 d + w_4 b + w_5 c_c + w_6 c_{kc}
```
where:

Symbol | Meaning | Description
-- | -- | --
$W$ | Win/Loss/Draw reward | +250 for win, -250 for loss, 0 for draw
$p$ | Pawn advantage | Difference in pawn count
$t$ | Threatened pawns | Pawns threatened by the opponent
$c_m$ | Captures available | Number of captures that can be made
$d$ | Diagonal pairs | Number of diagonally aligned pawns
$b$ | Backrow bridge control | Whether the backrow is controlled
$c_c$ | Central control (pawns) | Pawns controlling central tiles
$c_{kc}$ | Central control (kings) | Kings controlling central tiles

The weights $w_i$ can be adjusted to modify the reward function based on strategic importance (Some components would have a greater impact because they occur more frequently).

_The reward is based on the feature representation defined on the Neto, H.C., Julia, R.M.S., Caexeta, G.S. et al.  paper [[1]](https://github.com/Mowibox/Checkers-RL?tab=readme-ov-file#references)._

## Where to Modify in Code

* Intermediate rewards are computed in the [`compute_intermediaite_rewards`](https://github.com/Mowibox/Checkers-RL/blob/main/CheckersRL.py#L247-L307) function.
* Terminal rewards (win/loss/draw) are handled in the [`step`](https://github.com/Mowibox/Checkers-RL/blob/main/CheckersRL.py#L239-L244") function.

You can experiment by modifying the weights or change the form of the reward to optimize the agent's performance.

## Reward shaping evaluation

You can compare performance using the  [`benchmark.ipynb`](https://github.com/Mowibox/Checkers-RL/blob/main/benchmark.ipynb) notebook.

This allows you to benchmark the tuned TD(λ) agent against:
* a random agent,
* a TD(λ) agent,
* or a MCTS agent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune the reward of the TD_LVFA algorithm #2

Goal

Current Reward Function

Where to Modify in Code

Reward shaping evaluation

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Symbol	Meaning	Description
$W$	Win/Loss/Draw reward	+250 for win, -250 for loss, 0 for draw
$p$	Pawn advantage	Difference in pawn count
$t$	Threatened pawns	Pawns threatened by the opponent
$c_m$	Captures available	Number of captures that can be made
$d$	Diagonal pairs	Number of diagonally aligned pawns
$b$	Backrow bridge control	Whether the backrow is controlled
$c_c$	Central control (pawns)	Pawns controlling central tiles
$c_{kc}$	Central control (kings)	Kings controlling central tiles

Tune the reward of the TD_LVFA algorithm #2

Description

Goal

Current Reward Function

Where to Modify in Code

Reward shaping evaluation

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions