ml_dtypes

ml_dtypes is a stand-alone implementation of several NumPy dtype extensions used in machine learning libraries, including:

bfloat16: an alternative to the standard float16 format
float8_*: several experimental 8-bit floating point representations including:
- float8_e4m3b11fnuz
- float8_e4m3fn
- float8_e4m3fnuz
- float8_e5m2
- float8_e5m2fnuz
int4 and uint4: low precision integer types.

See below for specifications of these number formats.

Installation

The ml_dtypes package is tested with Python versions 3.9-3.12, and can be installed with the following command:

pip install ml_dtypes

To test your installation, you can run the following:

pip install absl-py pytest
pytest --pyargs ml_dtypes

To build from source, clone the repository and run:

git submodule init
git submodule update
pip install .

Example Usage

>>> from ml_dtypes import bfloat16
>>> import numpy as np
>>> np.zeros(4, dtype=bfloat16)
array([0, 0, 0, 0], dtype=bfloat16)

Importing ml_dtypes also registers the data types with numpy, so that they may be referred to by their string name:

>>> np.dtype('bfloat16')
dtype(bfloat16)
>>> np.dtype('float8_e5m2')
dtype(float8_e5m2)

Specifications of implemented floating point formats

`bfloat16`

A bfloat16 number is a single-precision float truncated at 16 bits.

Exponent: 8, Mantissa: 7, exponent bias: 127. IEEE 754, with NaN and inf.

`float8_e4m3b11fnuz`

Exponent: 4, Mantissa: 3, bias: 11.

Extended range: no inf, NaN represented by 0b1000'0000.

`float8_e4m3fn`

Exponent: 4, Mantissa: 3, bias: 7.

Extended range: no inf, NaN represented by 0bS111'1111.

The fn suffix is for consistency with the corresponding LLVM/MLIR type, signaling this type is not consistent with IEEE-754. The f indicates it is finite values only. The n indicates it includes NaNs, but only at the outer range.

`float8_e4m3fnuz`

8-bit floating point with 3 bit mantissa.

An 8-bit floating point type with 1 sign bit, 4 bits exponent and 3 bits mantissa. The suffix fnuz is consistent with LLVM/MLIR naming and is derived from the differences to IEEE floating point conventions. F is for "finite" (no infinities), N for with special NaN encoding, UZ for unsigned zero.

This type has the following characteristics:

bit encoding: S1E4M3 - 0bSEEEEMMM
exponent bias: 8
infinities: Not supported
NaNs: Supported with sign bit set to 1, exponent bits and mantissa bits set to all 0s - 0b10000000
denormals when exponent is 0

`float8_e5m2`

Exponent: 5, Mantissa: 2, bias: 15. IEEE 754, with NaN and inf.

`float8_e5m2fnuz`

8-bit floating point with 2 bit mantissa.

An 8-bit floating point type with 1 sign bit, 5 bits exponent and 2 bits mantissa. The suffix fnuz is consistent with LLVM/MLIR naming and is derived from the differences to IEEE floating point conventions. F is for "finite" (no infinities), N for with special NaN encoding, UZ for unsigned zero.

This type has the following characteristics:

bit encoding: S1E5M2 - 0bSEEEEEMM
exponent bias: 16
infinities: Not supported
NaNs: Supported with sign bit set to 1, exponent bits and mantissa bits set to all 0s - 0b10000000
denormals when exponent is 0

`int4` and `uint4`

4-bit integer types, where each element is represented unpacked (i.e., padded up to a byte in memory).

NumPy does not support types smaller than a single byte. For example, the distance between adjacent elements in an array (.strides) is expressed in bytes. Relaxing this restriction would be a considerable engineering project. The int4 and uint4 types therefore use an unpacked representation, where each element of the array is padded up to a byte in memory. The lower four bits of each byte contain the representation of the number, whereas the upper four bits are ignored.

Quirks of low-precision Arithmetic

If you're exploring the use of low-precision dtypes in your code, you should be careful to anticipate when the precision loss might lead to surprising results. One example is the behavior of aggregations like sum; consider this bfloat16 summation in NumPy (run with version 1.24.2):

>>> from ml_dtypes import bfloat16
>>> import numpy as np
>>> rng = np.random.default_rng(seed=0)
>>> vals = rng.uniform(size=10000).astype(bfloat16)
>>> vals.sum()
256

The true sum should be close to 5000, but numpy returns exactly 256: this is because bfloat16 does not have the precision to increment 256 by values less than 1:

>>> bfloat16(256) + bfloat16(1)
256

After 256, the next representable value in bfloat16 is 258:

>>> np.nextafter(bfloat16(256), bfloat16(np.inf))
258

For better results you can specify that the accumulation should happen in a higher-precision type like float32:

>>> vals.sum(dtype='float32').astype(bfloat16)
4992

In contrast to NumPy, projects like JAX which support low-precision arithmetic more natively will often do these kinds of higher-precision accumulations automatically:

>>> import jax.numpy as jnp
>>> jnp.array(vals).sum()
Array(4992, dtype=bfloat16)

License

This is not an officially supported Google product.

The ml_dtypes source code is licensed under the Apache 2.0 license (see LICENSE). Pre-compiled wheels are built with the EIGEN project, which is released under the MPL 2.0 license (see LICENSE.eigen).

Name		Name	Last commit message	Last commit date
Latest commit History 150 Commits
.github/workflows		.github/workflows
.vscode		.vscode
ml_dtypes		ml_dtypes
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
AUTHORS		AUTHORS
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE.eigen		LICENSE.eigen
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASING.md		RELEASING.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ml_dtypes

Installation

Example Usage

Specifications of implemented floating point formats

`bfloat16`

`float8_e4m3b11fnuz`

`float8_e4m3fn`

`float8_e4m3fnuz`

`float8_e5m2`

`float8_e5m2fnuz`

`int4` and `uint4`

Quirks of low-precision Arithmetic

License

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ml_dtypes

Installation

Example Usage

Specifications of implemented floating point formats

bfloat16

float8_e4m3b11fnuz

float8_e4m3fn

float8_e4m3fnuz

float8_e5m2

float8_e5m2fnuz

int4 and uint4

Quirks of low-precision Arithmetic

License

About

Resources

License

Licenses found

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`bfloat16`

`float8_e4m3b11fnuz`

`float8_e4m3fn`

`float8_e4m3fnuz`

`float8_e5m2`

`float8_e5m2fnuz`

`int4` and `uint4`

Packages