Skip to content

Commit b07d1df

Browse files
committed
Minor typos and tweaks.
1 parent 9fc7e68 commit b07d1df

File tree

3 files changed

+8
-8
lines changed

3 files changed

+8
-8
lines changed

docs/Cheatsheets/reinforcement_learning_terms.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ We refer to a single interaction between the agent and the environment as a *tim
3030

3131
We are concerned with two types of sequences:
3232
- An *episode*: A complete sequence of timesteps starting from an initial state $s_0$ and ending with a terminal state $s_T$
33-
- A *trajectory*: Any sequence of timesteps from some arbitrary state $s_t$ to $s_{t+n}$
33+
- A *trajectory*: Any sequence of timesteps from some state $s_t$ to another state $s_{t+n}$
3434
If all sequences of actions from any $s_t$ are guaranteed to eventually reach some terminal state $s_T$, we refer to this as a *finite-horizon problem*. If instead we allow the trajectory to continue indefinitely, we refer to this as an *infinite-horizon problem*.
3535

3636
A *return* $G$ is the cumulative reward obtained over a trajectory. In the finite-horizon case, the return is can be written simply as
@@ -99,9 +99,9 @@ V(s_t) &= r_t + \gamma V(s_{t+1}) \\
9999
&\vdots
100100
\end{aligned}
101101
$$
102-
These equalities are important because they show us that there are as many ways to write $V(s_t)$ as there are timesteps in a trajectory. We care about that because, in practice, we don't know the actual value of $V(s_t)$ for any state. Instead, we collect one trajectory at a time, and consider the return we calculate from each timestep as a *sample* from the return distribution at that state. We then train our critic $v(s)$ to predict the return we calculate for each state. This works because when we encounter the same state more than once we'll get a different return for it, so the critic will learn to predict the average return at that state. If we do this enough times, the critic will learn to predict the true value function.
102+
These equalities are important because they show us that there are as many ways to write $V(s_t)$ as there are timesteps in a trajectory. We care about that because, in practice, we don't know the actual value of $V(s_t)$ for any state. Instead, we collect one trajectory at a time, and consider the return we calculate from each timestep as a *sample* from the return distribution at that state. We then train our critic $v(s)$ to predict the return we calculate for each state. This works because when we encounter the same state more than once we'll get a different return each time, so the critic will learn to predict the average return at that state. If we do this enough times, the critic will learn to predict the true value function.
103103

104-
However, when training the critic, one might look at the above equivalent ways of writing $V(s_t)$ and wonder, "which of these equations should I train the critic to predict?" To answer that question we will first rewrite the above equations by denoting each form of $V(s_t)$ as $V^{n}_t$, and we will introduce our critic to the calculation by replacing $V(s)$ with $v(s)$:
104+
However, when training the critic, one might look at the above equivalent ways of writing $V(s_t)$ and wonder, "which of these quantities should I train the critic to predict?" To answer that question we will first rewrite the above equations by denoting each form of $V(s_t)$ as $V^{n}_t$, and we will introduce our critic to the calculation by replacing $V(s)$ with $v(s)$:
105105
$$
106106
\begin{aligned}
107107
V^{(1)}_t &= r_t + \gamma v(s_{t+1}) \\

docs/Getting Started/quickstart.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,8 @@ def build_rlgym_v2_env():
7575
if __name__ == "__main__":
7676
from rlgym_ppo import Learner
7777

78-
# 32 processes
79-
n_proc = 32
78+
# 8 processes
79+
n_proc = 8
8080

8181
# educated guess - could be slightly higher or lower
8282
min_inference_size = max(1, int(round(n_proc * 0.9)))
@@ -86,8 +86,8 @@ if __name__ == "__main__":
8686
min_inference_size=min_inference_size,
8787
metrics_logger=None,
8888
ppo_batch_size=50000, # batch size - set this number to as large as your GPU can handle
89-
policy_layer_sizes=[2048, 2048, 1024, 1024], # policy network
90-
critic_layer_sizes=[2048, 2048, 1024, 1024], # value network
89+
policy_layer_sizes=[512, 512], # policy network
90+
critic_layer_sizes=[512, 512], # value network
9191
ts_per_iteration=50000, # timesteps per training iteration - set this equal to the batch size
9292
exp_buffer_size=150000, # size of experience buffer - keep this 2 - 3x the batch size
9393
ppo_minibatch_size=50000, # minibatch size - set this less than or equal to the batch size

docs/Rocket League/Configuration Objects/observation_builders.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ To make your own observation builder, you'll need three methods:
1414
def get_obs_space(self, agent: AgentID) -> ObsSpaceType:
1515

1616
# Called every time `TransitionEngine.create_base_state()` is called.
17-
def reset(self, initial_state: StateType, shared_info: Dict[str, Any]) -> None:
17+
def reset(self, agents: List[AgentID], initial_state: StateType, shared_info: Dict[str, Any]) -> None:
1818

1919
# Called every time `TransitionEngine.step()` or `TransitionEngine.create_base_state()` is called.
2020
def build_obs(self, agents: List[AgentID], state: StateType, shared_info: Dict[str, Any]) -> Dict[AgentID, ObsType]:

0 commit comments

Comments
 (0)