Skip to content

Tune the reward of the TD_LVFA algorithm #2

@Mowibox

Description

@Mowibox

Goal

The goal of this issue is to improve the performance of the TD-LVFA agent by tuning the reward function.
This involves adjusting the different weight coefficients $w_i$ to make the reward better reflect strategic depth of checkers, and to discover what aspects matter strategically through the lens of the reward function.

Current Reward Function

As defined in the documentation the reward is computed as follows

$$R = W + w_0 p + w_1 t + w_2 c_m + w_3 d + w_4 b + w_5 c_c + w_6 c_{kc}$$

where:

Symbol Meaning Description
$W$ Win/Loss/Draw reward +250 for win, -250 for loss, 0 for draw
$p$ Pawn advantage Difference in pawn count
$t$ Threatened pawns Pawns threatened by the opponent
$c_m$ Captures available Number of captures that can be made
$d$ Diagonal pairs Number of diagonally aligned pawns
$b$ Backrow bridge control Whether the backrow is controlled
$c_c$ Central control (pawns) Pawns controlling central tiles
$c_{kc}$ Central control (kings) Kings controlling central tiles

The weights $w_i$ can be adjusted to modify the reward function based on strategic importance (Some components would have a greater impact because they occur more frequently).

The reward is based on the feature representation defined on the Neto, H.C., Julia, R.M.S., Caexeta, G.S. et al. paper [1].

Where to Modify in Code

You can experiment by modifying the weights or change the form of the reward to optimize the agent's performance.

Reward shaping evaluation

You can compare performance using the benchmark.ipynb notebook.

This allows you to benchmark the tuned TD(λ) agent against:

  • a random agent,
  • a TD(λ) agent,
  • or a MCTS agent.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions