Experiment with bf16 on nvidia graphics cards

# Overview

Our nllb models use the standard fp16 floating point format under the hood. This issue is proposing to explore the benefits of using bf16 ("brain floating point") as it may:

- improve accuracy
- reduce training time
- reduce memory requirements

# Proposal

- understand what support nvidia GPU's have for bf16
- add an experimental flag that allows enabling bf16 (current default behavior won't change)
- for nllb, investigate what effect using bf16 instead of fp16 has in terms of
  - model accuracy
  - training time
  - GPU memory usage

Investigation related to tf32 is out of scope.

In the future if bf16 proves to be stable and generally better, we may make that the default, but that's out of scope for this issue.

# Current settings

Currently we use this logic to set the preferred floating point:

```py
def _create_training_arguments(self) -> Seq2SeqTrainingArguments:
    ...
    merge_dict(
        args,
        {
            "fp16": self._mixed_precision and not self._is_t5, # <------------
            "bf16": self._mixed_precision and self._is_t5,     # <------------
            "tf32": self._mixed_precision,
        },
    )
    ...
```

(For context, the `_is_t5` field is related to the google madlad model and will be `False` for nllb)

For nllb, mixed precision is enabled by default, so usually the above will reduce to:

```py
{
    "fp16": True,
    "bf16": False,
    "tf32": True,
}
```

This means the models are using fp16 currently.

# Notes on floating point formats

### fp16

fp16 = "half precision floating point"

https://en.wikipedia.org/wiki/Half-precision_floating-point_format

It's the standard 16 bit floating point representation (see IEEE 754 standard)

The 16 bits are used like so:

```
       1   5       10
       s eeeee ffffffffff
          exp. significand

s = sign bit
e = exponent bits
f = fractional bits
```

Because fp16 has only 5 exponent bits, the largest normal value that can be represented is 65504.

### fp32

fp32 = "single precision floating point"

https://en.wikipedia.org/wiki/Single-precision_floating-point_format

The big brother of fp16

```
       1    8               23
fp32   s eeeeeeee fffffffffffffffffffffff

s = sign bit
e = exponent bits
f = fractional bits
```

### fp16, fp32 and mixed precision

32 bit floating points are more accurate, but it halves the number of values you can fit into some memory area.

In the context of GPU's and model training, this can mean it's much slower to train models.

That is why it can be helpful to used "mixed precision" where you sometimes lower to 16 bit values like fp16.

There is a trade off of accuracy to speed. Using fp16 also introduces problems because it can't represent very large numbers and also very small positive numbers. These issues can lower the accuracy of the model and increase the processing time and memory doing conversions related to loss scaling.

By default our model training uses mixed precision, but it can be disabled with `--disable-mixed-precision`.

### bf16

bf16 = "brain floating point"

Created by the Google Brain AI research group.

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

```
       1    8        7
       s eeeeeeee fffffff
          exp.    significand

s = sign bit
e = exponent bits
f = fractional bits
```

You can think of it as fp16 where 3 bits moved from the significand to the exponent. This means it can represent larger numbers, and the smallest positive number is more accurate, but you have less significant figures.

Another way to understand it is in relation to fp32:

```
bf16   s eeeeeeee fffffff (7)
fp32   s eeeeeeee fffffffffffffffffffffff (23)
                         ^^^^^^^^^^^^^^^^
                              16
                       additional precision
```

It is really just fp32 with the last 16 precision bits "chopped off", and they have the same number of exponent bits.

Conversions between fp32 and bf16 are very efficient because of this bit layout:
- fp32 -> bf16: chop off the back 16 bits
- bf16 -> fp32: pad with 16x0 bits

There's a nice summary in [this post](https://stats.stackexchange.com/a/639494) of the potential benefits for bf16 over fp16.

### tf32

tf32 = "tensor float 32"

https://en.wikipedia.org/wiki/TensorFloat-32

It only uses 19 bytes in total:

```
       1    8        *
bf16   s eeeeeeee fffffff (7)
fp32   s eeeeeeee fffffffffffffffffffffff (23)
tf32   s eeeeeeee ffffffffff (10)
```

My impression is that it's intended as a way to "reinterpret" existing fp32 values into a lower precision to make processing them faster. I don't think it's packing data into 19 bit chunks. So it's effectively a lazy "backwards compatible" fp32.

# Historical support

My impression is that bf16 was initially created by Google and native support was added to their custom TPU's.
This is back around 2018 but I haven't found an exact date.

At that point, the existing nvidia GPU's wouldn't have had native support for bf16.

However since then other chip manufacturers have started adding native support, and I suspect that most nvidia hardware
used by the silnlp team for local development and in our clearml infra would support it.

For example these GPU's _should_ support bf16:

- 3000 series
- 4000 series
- A100
- H100

However it's not clear to me yet what effect it would have.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Experiment with bf16 on nvidia graphics cards #647

Overview

Proposal

Current settings

Notes on floating point formats

fp16

fp32

fp16, fp32 and mixed precision

bf16

tf32

Historical support

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Experiment with bf16 on nvidia graphics cards #647

Description

Overview

Proposal

Current settings

Notes on floating point formats

fp16

fp32

fp16, fp32 and mixed precision

bf16

tf32

Historical support

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions