Minor typos and tweaks.

AechPro · AechPro · commit b07d1dffead9 · 2025-01-21T11:45:40.000-05:00
diff --git a/docs/Cheatsheets/reinforcement_learning_terms.md b/docs/Cheatsheets/reinforcement_learning_terms.md
@@ -30,7 +30,7 @@ We refer to a single interaction between the agent and the environment as a *tim
 
 We are concerned with two types of sequences:
 - An *episode*: A complete sequence of timesteps starting from an initial state $s_0$ and ending with a terminal state $s_T$
-- A *trajectory*: Any sequence of timesteps from some arbitrary state $s_t$ to $s_{t+n}$
+- A *trajectory*: Any sequence of timesteps from some state $s_t$ to another state $s_{t+n}$
 If all sequences of actions from any $s_t$ are guaranteed to eventually reach some terminal state $s_T$, we refer to this as a *finite-horizon problem*. If instead we allow the trajectory to continue indefinitely, we refer to this as an *infinite-horizon problem*.
 
 A *return* $G$ is the cumulative reward obtained over a trajectory. In the finite-horizon case, the return is can be written simply as 
@@ -99,9 +99,9 @@ V(s_t) &= r_t + \gamma V(s_{t+1}) \\
        &\vdots
 \end{aligned}
 $$
-These equalities are important because they show us that there are as many ways to write $V(s_t)$ as there are timesteps in a trajectory. We care about that because, in practice, we don't know the actual value of $V(s_t)$ for any state. Instead, we collect one trajectory at a time, and consider the return we calculate from each timestep as a *sample* from the return distribution at that state. We then train our critic $v(s)$ to predict the return we calculate for each state. This works because when we encounter the same state more than once we'll get a different return for it, so the critic will learn to predict the average return at that state. If we do this enough times, the critic will learn to predict the true value function.
+These equalities are important because they show us that there are as many ways to write $V(s_t)$ as there are timesteps in a trajectory. We care about that because, in practice, we don't know the actual value of $V(s_t)$ for any state. Instead, we collect one trajectory at a time, and consider the return we calculate from each timestep as a *sample* from the return distribution at that state. We then train our critic $v(s)$ to predict the return we calculate for each state. This works because when we encounter the same state more than once we'll get a different return each time, so the critic will learn to predict the average return at that state. If we do this enough times, the critic will learn to predict the true value function.
 
-However, when training the critic, one might look at the above equivalent ways of writing $V(s_t)$ and wonder, "which of these equations should I train the critic to predict?" To answer that question we will first rewrite the above equations by denoting each form of $V(s_t)$ as $V^{n}_t$, and we will introduce our critic to the calculation by replacing $V(s)$ with $v(s)$:
+However, when training the critic, one might look at the above equivalent ways of writing $V(s_t)$ and wonder, "which of these quantities should I train the critic to predict?" To answer that question we will first rewrite the above equations by denoting each form of $V(s_t)$ as $V^{n}_t$, and we will introduce our critic to the calculation by replacing $V(s)$ with $v(s)$:
 $$
 \begin{aligned}
 V^{(1)}_t &= r_t + \gamma v(s_{t+1}) \\
diff --git a/docs/Getting Started/quickstart.md b/docs/Getting Started/quickstart.md
@@ -75,8 +75,8 @@ def build_rlgym_v2_env():
 if __name__ == "__main__":
     from rlgym_ppo import Learner
 
-    # 32 processes
-    n_proc = 32
+    # 8 processes
+    n_proc = 8
 
     # educated guess - could be slightly higher or lower
     min_inference_size = max(1, int(round(n_proc * 0.9)))
@@ -86,8 +86,8 @@ if __name__ == "__main__":
                       min_inference_size=min_inference_size,
                       metrics_logger=None,
                       ppo_batch_size=50000, # batch size - set this number to as large as your GPU can handle
-                      policy_layer_sizes=[2048, 2048, 1024, 1024], # policy network
-                      critic_layer_sizes=[2048, 2048, 1024, 1024], # value network
+                      policy_layer_sizes=[512, 512], # policy network
+                      critic_layer_sizes=[512, 512], # value network
                       ts_per_iteration=50000, # timesteps per training iteration - set this equal to the batch size
                       exp_buffer_size=150000, # size of experience buffer - keep this 2 - 3x the batch size
                       ppo_minibatch_size=50000, # minibatch size - set this less than or equal to the batch size
diff --git a/docs/Rocket League/Configuration Objects/observation_builders.md b/docs/Rocket League/Configuration Objects/observation_builders.md
@@ -14,7 +14,7 @@ To make your own observation builder, you'll need three methods:
 def get_obs_space(self, agent: AgentID) -> ObsSpaceType:
     
 # Called every time `TransitionEngine.create_base_state()` is called.
-def reset(self, initial_state: StateType, shared_info: Dict[str, Any]) -> None:
+def reset(self, agents: List[AgentID], initial_state: StateType, shared_info: Dict[str, Any]) -> None:
 
 # Called every time `TransitionEngine.step()` or `TransitionEngine.create_base_state()` is called.
 def build_obs(self, agents: List[AgentID], state: StateType, shared_info: Dict[str, Any]) -> Dict[AgentID, ObsType]: