diff --git a/units/en/unit3/deep-q-algorithm.mdx b/units/en/unit3/deep-q-algorithm.mdx index 28e7fd50..5de8597f 100644 --- a/units/en/unit3/deep-q-algorithm.mdx +++ b/units/en/unit3/deep-q-algorithm.mdx @@ -94,11 +94,11 @@ We face a simple problem by calculating the TD target: how are we sure that **t We know that the accuracy of Q-values depends on what action we tried **and** what neighboring states we explored. -Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.** +Consequently, we don’t have enough information about the best action to take at the beginning of the training. Therefore, taking the action with the maximum Q-value (which is noisy) as the best action to take can lead to false positives. If non-optimal actions are regularly **given a higher Q value than the optimal best action, the learning will be complicated.** The solution is: when we compute the Q target, we use two networks to decouple the action selection from the target Q-value generation. We: -- Use our **DQN network** to select the best action to take for the next state (the action with the highest Q-value). -- Use our **Target network** to calculate the target Q-value of taking that action at the next state. +- Use our **DQN network** to select the best action to take at the current state (the action with the highest Q-value) +- Use our **Target network** to calculate the target Q-value of this _(state, action)_ pair: as the sum of the immediate reward and of an estimation of the Q-value of the next state, given by this target network. Therefore, Double DQN helps us reduce the overestimation of Q-values and, as a consequence, helps us train faster and with more stable learning. diff --git a/units/en/unit3/glossary.mdx b/units/en/unit3/glossary.mdx index 2e408664..94c5d9c3 100644 --- a/units/en/unit3/glossary.mdx +++ b/units/en/unit3/glossary.mdx @@ -27,8 +27,8 @@ In order to obtain temporal information, we need to **stack** a number of frames our Deep Q-Network after certain **C steps**. - **Double DQN:** Method to handle **overestimation** of **Q-Values**. This solution uses two networks to decouple the action selection from the target **Value generation**: - - **DQN Network** to select the best action to take for the next state (the action with the highest **Q-Value**) - - **Target Network** to calculate the target **Q-Value** of taking that action at the next state. + - **DQN Network** to select the best actions to take during sampling (i.e. the actions with the highest **Q-Value**) and to estimate the values of these actions during training + - **Target Network** to calculate the target **Q-Value** of the selected _(state, action)_ pairs. This approach reduces the **Q-Values** overestimation, it helps to train faster and have more stable learning. If you want to improve the course, you can [open a Pull Request.](https://github.com/huggingface/deep-rl-class/pulls) diff --git a/units/en/unit3/quiz.mdx b/units/en/unit3/quiz.mdx index 13f02957..bccccf52 100644 --- a/units/en/unit3/quiz.mdx +++ b/units/en/unit3/quiz.mdx @@ -94,9 +94,9 @@ But, with experience replay, **we create a replay buffer that saves experience s When we compute the Q target, we use two networks to decouple the action selection from the target Q value generation. We: - - Use our *DQN network* to **select the best action to take for the next state** (the action with the highest Q value). + - Use our *DQN network* to **select the best action to take at the current state** (the action with the highest Q value) - - Use our *Target network* to calculate **the target Q value of taking that action at the next state**. + - Use our *Target network* to calculate **the target Q value of this _(state, action)_ pair**: as the sum of the immediate reward and of an estimation of the Q-value of the next state, given by this target network.