You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We provide five configurations for several different GPT model sizes: 126M, 5B, 20B,
1159
-
40B, and 175B parameters. These configurations include carefully selected
1158
+
We provide nine configurations for several different GPT model sizes: 126M, 400M_improved, 1B_improved, 5B, 7B_improved, 20B,
1159
+
40B, 40B_improved, and 175B parameters. These configurations include carefully selected
1160
1160
hyperparameters, which should be used as a guideline for any custom model
1161
1161
configurations. All these configurations are provided in the `conf/training/gpt3/`
1162
1162
directory. The desired configuration can be chosen by selecting the `training`
@@ -5545,7 +5545,7 @@ The table and chart below show the performance results.
5545
5545
* Tensor and Pipeline Parallelism Conversion Support for GPT and T5
5546
5546
* Supervised Fine-Tuning Support for GPT
5547
5547
* RLHF (Reinforcement Learning from Human Feedback) for GPT
5548
-
* New GPT model sizes - 843M, 2B, 8B, 43B based on new and improved model configurations.
5548
+
* New GPT model sizes - 400M_improved, 1B_improved, 7B_improved, 40B_improved based on new and improved model configurations
5549
5549
* List of GPT model configuration changes
5550
5550
5551
5551
| Configuration | Previous | New |
@@ -5557,9 +5557,6 @@ The table and chart below show the performance results.
5557
5557
| Bias terms | Yes | No |
5558
5558
| Normalization | LayerNorm | LayerNorm1p |
5559
5559
5560
-
* Added the option to use RMSNorm normalization with GPT models. Can be configured by setting `model.normalization` to `rmsnorm`. Default is `layernorm1p`.
5561
-
* Added `fast` versions of SwiGLU, GeGLU and ReGLU. Can be configured by setting `model.activation=fast-swiglu`, `model.activation=fast-reglu` or `model.activation=fast-geglu`. Checkpoints trained with `fast` and regular versions of SwiGLU, GeGLU and ReGLU are *not* compatible with each since the weight state dictionaries are different.
5562
-
5563
5560
**NeMo Framework 23.03**
5564
5561
* Per micro-batch data loader for GPT and BERT
5565
5562
* SquaredReLU and SwiGLU activation function support for GPT and T5
0 commit comments