|
| 1 | +# Understanding the Reinforcement Learning Loop |
| 2 | + |
| 3 | +The following diagram illustrates the key mechanisms underlying the learning process in model-free reinforcement learning algorithms. |
| 4 | +It shows how the agent interacts with the environment, collects experiences, and periodically updates its policy based on those experiences. |
| 5 | + |
| 6 | +<div id="carousel" style="width: 100%; text-align:center; margin-bottom: 16px; border: 1px solid #ddd;"> |
| 7 | + <img id="carousel-image" src="../_static/images/agent-env-step1.png" style="width: 100%; border-radius:8px;" alt="RL Loop"> |
| 8 | + <div style="margin-bottom: 10px;"> |
| 9 | + <button onclick="prevImage()" style="padding: 0 10px; margin-right: 10px;">⇦ Prev</button> |
| 10 | + <button onclick="nextImage()" style="padding: 0 10px;">Next ⇨</button> |
| 11 | + </div> |
| 12 | + <div id="caption" style="margin: 10px; text-align: left;"></div> |
| 13 | +</div> |
| 14 | + |
| 15 | +<script> |
| 16 | + const images = [ |
| 17 | + {src: '../_static/images/agent-env-step1.png', caption: '<b>Step 1</b>: The agent receives the observable state from the environment.'}, |
| 18 | + {src: '../_static/images/agent-env-step2.png', caption: '<b>Step 2</b>: The agent uses its policy to select an action, passing it to the environment.'}, |
| 19 | + {src: '../_static/images/agent-env-step3.png', caption: '<b>Step 3</b>: The execution of the action results in a new state and produces a reward. The environment is specifically designed to provide feedback to the agent, returning (high) positive rewards for desirable states and low/negative rewards for undesirable states; rewards may be sparse, i.e. rewards for intermediate states may be zero. The agent records the action taken, the state transition and the reward received in its database of experiences (replay buffer). The agent repeats the process several times in order to fill the replay buffer with new transitions.'}, |
| 20 | + {src: '../_static/images/agent-env-step4.png', caption: '<b>Step 4</b>: Periodically, after having collected enough experiences, the agent uses the experience data to update its policy, i.e. the way it selects actions. The learning algorithm defines the corresponding update mechanism.'}, |
| 21 | + ]; |
| 22 | + let index = 0; |
| 23 | + |
| 24 | + function updateImage() { |
| 25 | + document.getElementById('carousel-image').src = images[index].src; |
| 26 | + document.getElementById('caption').innerHTML = images[index].caption; |
| 27 | + } |
| 28 | + |
| 29 | + function nextImage() { |
| 30 | + index = (index + 1) % images.length; |
| 31 | + updateImage(); |
| 32 | + } |
| 33 | + |
| 34 | + function prevImage() { |
| 35 | + index = (index - 1 + images.length) % images.length; |
| 36 | + updateImage(); |
| 37 | + } |
| 38 | + |
| 39 | + updateImage(); |
| 40 | +</script> |
| 41 | + |
| 42 | +Accordingly, the key entities involved in the learning process are: |
| 43 | + * The **environment**: This is the system the agent interacts with. |
| 44 | + It provides the agent with observable states and rewards based on the actions taken by the agent. |
| 45 | + * The agent's **policy**: This is the strategy used by the agent to decide which action to take in a given state. |
| 46 | + The policy can be deterministic or stochastic and is typically represented by a neural network in deep reinforcement learning. |
| 47 | + * The **replay buffer**: This is a data structure used to store the agent's experiences, which consist of state transitions, |
| 48 | + actions taken, and rewards received. |
| 49 | + The agent learns from past experience by sampling mini-batches from the buffer during the policy update phase. |
| 50 | + * The **learning algorithm**: This defines how the agent updates its policy based on the experiences stored in the replay buffer. |
| 51 | + Different algorithms have different update mechanisms, which can significantly affect the learning performance. |
| 52 | + In some cases, the algorithm may also involve additional components (specifically neural networks), such as target networks or value |
| 53 | + functions. |
| 54 | + |
| 55 | +These entities have direct correspondences in Tianshou's codebase: |
| 56 | + * The environment is represented by an instance of a class that inherits from `gymnasium.Env`, which is a standard interface for |
| 57 | + reinforcement learning environments. |
| 58 | + In practice, environments are typically vectorized to enable parallel interactions, increasing efficiency. |
| 59 | + * The policy is encapsulated in the `Policy` class, which provides methods for action selection. |
| 60 | + * The replay buffer is implemented in the `ReplayBuffer` class. |
| 61 | + A `Collector` instance is used to manage the addition of new experiences to the replay buffer as the agent interacts with the |
| 62 | + environment. |
| 63 | + During the learning phase, the replay buffer may be sampled, providing an instance of `Batch` for the policy update. |
| 64 | + * The abstraction for learning algorithms is given by the `Algorithm` class, which defines how to update the policy using data from the |
| 65 | + replay buffer. |
| 66 | + |
| 67 | +## The Training Process |
| 68 | + |
| 69 | +The learning process itself is reified in Tianshou's `Trainer` class, which orchestrates the interaction between the agent and the |
| 70 | +environment, manages the replay buffer, and coordinates the policy updates according to the specified learning algorithm. |
| 71 | + |
| 72 | +In general, the process can be described as executing a number of epochs as follows: |
| 73 | + |
| 74 | +* **Epoch**: |
| 75 | + * Repeat until a sufficient number of steps is reached (for online learning, typically environment step count) |
| 76 | + * **Training Step**: |
| 77 | + * For online learning algorithms … |
| 78 | + * **Collection Step**: collect state transitions in the environment by running the agent |
| 79 | + * (Optionally) conduct a test step if collected data indicates promising behaviour |
| 80 | + * **Update Step**: Apply gradient updates using the algorithm’s update logic. |
| 81 | + The update is based on … |
| 82 | + * data from the preceding collection step only (on-policy learning) |
| 83 | + * data from the collection step and previous data (off-policy learning) |
| 84 | + * data from a user-provided replay buffer (offline learning) |
| 85 | + * **Test Step** |
| 86 | + * Collect test episodes from dedicated test environments and evaluate agent performance |
| 87 | + * (Optionally) stop training early if performance is sufficiently high |
| 88 | + |
| 89 | +```{admonition} Glossary |
| 90 | +:class: note |
| 91 | +The above introduces some of the key terms used throughout Tianshou. |
| 92 | +``` |
| 93 | + |
| 94 | +Note that the above description encompasses several modes of model-free reinforcement learning, including: |
| 95 | + * online learning (where the agent continuously interacts with the environment in order to collect new experiences) |
| 96 | + * on-policy learning (where the policy is updated based on data collected using the current policy only) |
| 97 | + * off-policy learning (where the policy is updated based on data collected using the current and previous policies) |
| 98 | + * offline learning (where the replay buffer is pre-filled and not updated during training) |
| 99 | + |
| 100 | +In Tianshou, the `Trainer` and `Algorithm` classes are specialised to handle these different modes accordingly. |
0 commit comments