-
-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Overview
Our nllb models use the standard fp16 floating point format under the hood. This issue is proposing to explore the benefits of using bf16 ("brain floating point") as it may:
- improve accuracy
- reduce training time
- reduce memory requirements
Proposal
- understand what support nvidia GPU's have for bf16
- add an experimental flag that allows enabling bf16 (current default behavior won't change)
- for nllb, investigate what effect using bf16 instead of fp16 has in terms of
- model accuracy
- training time
- GPU memory usage
Investigation related to tf32 is out of scope.
In the future if bf16 proves to be stable and generally better, we may make that the default, but that's out of scope for this issue.
Current settings
Currently we use this logic to set the preferred floating point:
def _create_training_arguments(self) -> Seq2SeqTrainingArguments:
...
merge_dict(
args,
{
"fp16": self._mixed_precision and not self._is_t5, # <------------
"bf16": self._mixed_precision and self._is_t5, # <------------
"tf32": self._mixed_precision,
},
)
...
(For context, the _is_t5
field is related to the google madlad model and will be False
for nllb)
For nllb, mixed precision is enabled by default, so usually the above will reduce to:
{
"fp16": True,
"bf16": False,
"tf32": True,
}
This means the models are using fp16 currently.
Notes on floating point formats
fp16
fp16 = "half precision floating point"
https://en.wikipedia.org/wiki/Half-precision_floating-point_format
It's the standard 16 bit floating point representation (see IEEE 754 standard)
The 16 bits are used like so:
1 5 10
s eeeee ffffffffff
exp. significand
s = sign bit
e = exponent bits
f = fractional bits
Because fp16 has only 5 exponent bits, the largest normal value that can be represented is 65504.
fp32
fp32 = "single precision floating point"
https://en.wikipedia.org/wiki/Single-precision_floating-point_format
The big brother of fp16
1 8 23
fp32 s eeeeeeee fffffffffffffffffffffff
s = sign bit
e = exponent bits
f = fractional bits
fp16, fp32 and mixed precision
32 bit floating points are more accurate, but it halves the number of values you can fit into some memory area.
In the context of GPU's and model training, this can mean it's much slower to train models.
That is why it can be helpful to used "mixed precision" where you sometimes lower to 16 bit values like fp16.
There is a trade off of accuracy to speed. Using fp16 also introduces problems because it can't represent very large numbers and also very small positive numbers. These issues can lower the accuracy of the model and increase the processing time and memory doing conversions related to loss scaling.
By default our model training uses mixed precision, but it can be disabled with --disable-mixed-precision
.
bf16
bf16 = "brain floating point"
Created by the Google Brain AI research group.
https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
1 8 7
s eeeeeeee fffffff
exp. significand
s = sign bit
e = exponent bits
f = fractional bits
You can think of it as fp16 where 3 bits moved from the significand to the exponent. This means it can represent larger numbers, and the smallest positive number is more accurate, but you have less significant figures.
Another way to understand it is in relation to fp32:
bf16 s eeeeeeee fffffff (7)
fp32 s eeeeeeee fffffffffffffffffffffff (23)
^^^^^^^^^^^^^^^^
16
additional precision
It is really just fp32 with the last 16 precision bits "chopped off", and they have the same number of exponent bits.
Conversions between fp32 and bf16 are very efficient because of this bit layout:
- fp32 -> bf16: chop off the back 16 bits
- bf16 -> fp32: pad with 16x0 bits
There's a nice summary in this post of the potential benefits for bf16 over fp16.
tf32
tf32 = "tensor float 32"
https://en.wikipedia.org/wiki/TensorFloat-32
It only uses 19 bytes in total:
1 8 *
bf16 s eeeeeeee fffffff (7)
fp32 s eeeeeeee fffffffffffffffffffffff (23)
tf32 s eeeeeeee ffffffffff (10)
My impression is that it's intended as a way to "reinterpret" existing fp32 values into a lower precision to make processing them faster. I don't think it's packing data into 19 bit chunks. So it's effectively a lazy "backwards compatible" fp32.
Historical support
My impression is that bf16 was initially created by Google and native support was added to their custom TPU's.
This is back around 2018 but I haven't found an exact date.
At that point, the existing nvidia GPU's wouldn't have had native support for bf16.
However since then other chip manufacturers have started adding native support, and I suspect that most nvidia hardware
used by the silnlp team for local development and in our clearml infra would support it.
For example these GPU's should support bf16:
- 3000 series
- 4000 series
- A100
- H100
However it's not clear to me yet what effect it would have.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status