Skip to content

Commit d790d3e

Browse files
authored
Merge pull request #582 from amd-jnovotny/rocfft-refactor-rocmrel64
Auto-submit by Jenkins
2 parents aa98e53 + ae83352 commit d790d3e

File tree

17 files changed

+650
-586
lines changed

17 files changed

+650
-586
lines changed

LICENSE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# LICENSE
1+
# License
22

33
Copyright (C) 2016 - 2025 Advanced Micro Devices, Inc. All rights reserved.
44

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,14 +70,14 @@ You can install rocFFT using pre-built packages or building from source.
7070
7171
```bash
7272
mkdir build && cd build
73-
cmake -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_C_COMPILER=amdclang_PREFIX_PATH=/path/to/rocFFT-lib ..
73+
cmake -DCMAKE_CXX_COMPILER=amdclang++ -DCMAKE_PREFIX_PATH=/path/to/rocFFT-lib ..
7474
make -j
7575
```
7676
7777
To install client dependencies on Ubuntu, run:
7878
7979
```bash
80-
sudo apt install libgtest-dev libfftw3-dev
80+
sudo apt install libgtest-dev libfftw3-dev libboost-dev
8181
```
8282
8383
We use version 1.11 of GoogleTest.
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
.. meta::
2+
:description: rocFFT computations
3+
:keywords: rocFFT, ROCm, API, documentation, install, computation, fft
4+
5+
.. _fft-computation:
6+
7+
********************************************************************
8+
FFT computation
9+
********************************************************************
10+
11+
rocFFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the DFT definition to
12+
reduce the mathematical complexity from :math:`O(N^2)` to :math:`O(N \log N)`.
13+
14+
How does the library compute DFTs?
15+
==================================
16+
17+
Here are the formulas for 1D, 2D, and 3D complex DFTs:
18+
19+
For a 1D complex DFT:
20+
21+
:math:`{\tilde{x}}_j = \sum_{k=0}^{n-1}x_k\exp\left({\pm i}{{2\pi jk}\over{n}}\right)\hbox{ for } j=0,1,\ldots,n-1`
22+
23+
Where :math:`x_k` is the complex data to be transformed, :math:`\tilde{x}_j` is the transformed data,
24+
and the sign :math:`\pm`
25+
determines the direction of the transform: :math:`-` for forward and :math:`+` for backward.
26+
27+
For a 2D complex DFT:
28+
29+
:math:`{\tilde{x}}_{jk} = \sum_{q=0}^{m-1}\sum_{r=0}^{n-1}x_{rq}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)`
30+
31+
For :math:`j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1`, where :math:`x_{rq}` is the complex data to be transformed,
32+
:math:`\tilde{x}_{jk}` is the transformed data, and the sign :math:`\pm` determines the direction of the transform.
33+
34+
For a 3D complex DFT:
35+
36+
:math:`\tilde{x}_{jkl} = \sum_{s=0}^{p-1}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1}x_{rqs}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)\exp\left({\pm i}{{2\pi ls}\over{p}}\right)`
37+
38+
For :math:`j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\hbox{ and } l=0,1,\ldots,p-1`, where :math:`x_{rqs}` is the complex data to
39+
be transformed, :math:`\tilde{x}_{jkl}` is the transformed data, and the sign :math:`\pm` determines the direction of the transform.
Lines changed: 33 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,73 @@
1+
.. meta::
2+
:description: Distributed transforms in rocFFT
3+
:keywords: rocFFT, ROCm, API, documentation, distributed transform
4+
5+
16
.. _distributed-transforms:
27

38
********************************************************************
49
Distributed transforms
510
********************************************************************
611

712
rocFFT can optionally distribute FFTs across multiple devices in a
8-
single process, or across multiple MPI ranks. To perform distributed
9-
transforms, users must describe their input and output data layouts
13+
single process or across multiple Message Passing Interface (MPI) ranks. To perform distributed
14+
transforms, describe the input and output data layouts
1015
as :ref:`fields<input_output_fields>`.
1116

1217
Multiple devices in a single process
1318
====================================
1419

15-
A transform may be distributed across multiple devices in a single
20+
A transform can be distributed across multiple devices in a single
1621
process by passing distinct device IDs to
17-
:cpp:func:`rocfft_brick_create` for bricks in the input and output
22+
:cpp:func:`rocfft_brick_create` to create bricks in the input and output
1823
fields.
1924

2025
Support for single-process multi-device transforms was introduced in
2126
ROCm 6.0 with rocFFT 1.0.25.
2227

23-
Message Passing Interface (MPI)
28+
Message Passing Interface
2429
===============================
2530

26-
MPI allows for distributing the transform across multiple processes,
31+
MPI lets you distribute the transform across multiple processes,
2732
organized into MPI ranks.
2833

34+
To turn on rocFFT MPI support, enable the ``ROCFFT_MPI_ENABLE`` CMake option
35+
when building the library. By default, this option
36+
is off. To use Cray MPI, enable the ``ROCFFT_CRAY_MPI_ENABLE`` CMake option.
37+
38+
Additionally, rocFFT MPI support requires a GPU-aware MPI library
39+
that supports transferring data to and from HIP devices.
40+
2941
Support for MPI transforms was introduced in ROCm 6.3 with rocFFT
3042
1.0.29.
3143

3244
.. note::
3345

34-
rocFFT MPI support is only available when the library is built
35-
with the `ROCFFT_MPI_ENABLE` CMake option enabled. By default it
36-
is off.
37-
38-
Additionally, rocFFT MPI support requires a GPU-aware MPI library
39-
with support for transferring data to/from HIP devices.
40-
41-
Finally, rocFFT API calls made on different ranks may return
42-
different values. Users must take care to ensure that all ranks
46+
rocFFT API calls made on different ranks might return
47+
different values. Application developers must ensure that all ranks
4348
have successfully created their plans before attempting to execute
44-
a distributed transform, and it is possible for one rank to fail
45-
to create/execute a plan while the others succeed.
49+
a distributed transform. One rank can fail
50+
to create or execute a plan while the others succeed.
4651

47-
To perform a transform across multiple MPI ranks, additional steps
48-
are required to distribute the computation:
52+
To distribute a transform across multiple MPI ranks, the
53+
following additional steps are required:
4954

5055
#. Each rank calls :cpp:func:`rocfft_plan_description_set_comm` to
51-
add an MPI communicator to an allocated plan description. rocFFT
52-
will distribute the computation across all ranks in the
56+
add an MPI communicator to an allocated plan description. rocFFT
57+
distributes the computation across all ranks in the
5358
communicator.
5459

5560
#. Each rank allocates the same fields and calls
5661
:cpp:func:`rocfft_plan_description_add_infield` and
5762
:cpp:func:`rocfft_plan_description_add_outfield` on the plan
58-
description. However, each rank must only call
63+
description. However, each rank must only call
5964
:cpp:func:`rocfft_brick_create` and
6065
:cpp:func:`rocfft_field_add_brick` for bricks that reside on that
61-
rank.
62-
63-
A brick resides on exactly one rank, but each rank may have zero
64-
or more bricks on it.
66+
rank. A brick resides on exactly one rank. Each rank can have zero
67+
or more bricks associated to it.
6568

6669
#. Each rank in the communicator calls
67-
:cpp:func:`rocfft_plan_create`. At this time rocFFT will distribute
70+
:cpp:func:`rocfft_plan_create`. rocFFT then uses this information to distribute
6871
the supplied brick information between all of the ranks.
6972

7073
#. Each rank in the communicator calls :cpp:func:`rocfft_execute`.
@@ -73,8 +76,8 @@ are required to distribute the computation:
7376
of the current rank.
7477

7578
The pointers must be provided in the same order in which the bricks were
76-
added to the field (via calls to :cpp:func:`rocfft_field_add_brick`), and
77-
must point to memory on the device that was specified at that time.
79+
added to the field (using calls to :cpp:func:`rocfft_field_add_brick`) and
80+
must point to the memory on the device that was specified at that time.
7881

79-
For in-place transforms, only pass input pointers and pass an
82+
For in-place transforms, only pass the input pointers and use an
8083
empty array of output pointers.
Lines changed: 52 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,103 +1,86 @@
11
.. meta::
2-
:description: rocFFT documentation and API reference library
3-
:keywords: rocFFT, ROCm, API, documentation
2+
:description: How to load and store callbacks in rocFFT
3+
:keywords: rocFFT, ROCm, API, documentation, callbacks
44

55
.. _load-store-callbacks:
66

77
********************************************************************
8-
Load and Store Callbacks
8+
Load and store callbacks
99
********************************************************************
1010

1111
rocFFT includes experimental functionality to call user-defined device functions
12-
when loading input from global memory at the start of a transform, or
13-
when storing output to global memory at the end of a transform.
12+
when loading input from global memory at the transform start or
13+
when storing output to global memory at the transform end.
1414

15-
These user-defined callback functions may be optionally supplied
15+
These optional user-defined callback functions can be supplied
1616
to the library using
1717
:cpp:func:`rocfft_execution_info_set_load_callback` and
1818
:cpp:func:`rocfft_execution_info_set_store_callback`.
1919

2020
Device functions supplied as callbacks must load and store element
21-
data types that are appropriate for the transform being performed.
22-
23-
+-------------------------+--------------------+----------------------+
24-
|Transform type | Load element type | Store element type |
25-
+=========================+====================+======================+
26-
|Complex-to-complex, | `_Float16_2` | `_Float16_2` |
27-
|half-precision | | |
28-
+-------------------------+--------------------+----------------------+
29-
|Complex-to-complex, | `float2` | `float2` |
30-
|single-precision | | |
31-
+-------------------------+--------------------+----------------------+
32-
|Complex-to-complex, | `double2` | `double2` |
33-
|double-precision | | |
34-
+-------------------------+--------------------+----------------------+
35-
|Real-to-complex, | `float` | `float2` |
36-
|single-precision | | |
37-
+-------------------------+--------------------+----------------------+
38-
|Real-to-complex, | `_Float16` | `_Float16_2` |
39-
|half-precision | | |
40-
+-------------------------+--------------------+----------------------+
41-
|Real-to-complex, | `double` | `double2` |
42-
|double-precision | | |
43-
+-------------------------+--------------------+----------------------+
44-
|Complex-to-real, | `_Float16_2` | `_Float16` |
45-
|half-precision | | |
46-
+-------------------------+--------------------+----------------------+
47-
|Complex-to-real, | `float2` | `float` |
48-
|single-precision | | |
49-
+-------------------------+--------------------+----------------------+
50-
|Complex-to-real, | `double2` | `double` |
51-
|double-precision | | |
52-
+-------------------------+--------------------+----------------------+
21+
data types appropriate for the transform being executed.
22+
23+
+-------------------------+----------------------+------------------------+
24+
|Transform type | Load element type | Store element type |
25+
+=========================+======================+========================+
26+
|Complex-to-complex, | ``_Float16_2`` | ``_Float16_2`` |
27+
|half-precision | | |
28+
+-------------------------+----------------------+------------------------+
29+
|Complex-to-complex, | ``float2`` | ``float2`` |
30+
|single-precision | | |
31+
+-------------------------+----------------------+------------------------+
32+
|Complex-to-complex, | ``double2`` | ``double2`` |
33+
|double-precision | | |
34+
+-------------------------+----------------------+------------------------+
35+
|Real-to-complex, | ``float`` | ``float2`` |
36+
|single-precision | | |
37+
+-------------------------+----------------------+------------------------+
38+
|Real-to-complex, | ``_Float16`` | ``_Float16_2`` |
39+
|half-precision | | |
40+
+-------------------------+----------------------+------------------------+
41+
|Real-to-complex, | ``double`` | ``double2`` |
42+
|double-precision | | |
43+
+-------------------------+----------------------+------------------------+
44+
|Complex-to-real, | ``_Float16_2`` | ``_Float16`` |
45+
|half-precision | | |
46+
+-------------------------+----------------------+------------------------+
47+
|Complex-to-real, | ``float2`` | ``float`` |
48+
|single-precision | | |
49+
+-------------------------+----------------------+------------------------+
50+
|Complex-to-real, | ``double2`` | ``double`` |
51+
|double-precision | | |
52+
+-------------------------+----------------------+------------------------+
5353

5454
The callback function signatures must match the specifications
5555
below.
5656

5757
.. code-block:: c
5858
59-
T load_callback(T* buffer, size_t offset, void* callback_data, void* shared_memory);
60-
void store_callback(T* buffer, size_t offset, T element, void* callback_data, void* shared_memory);
59+
Tdata load_callback(Tdata* buffer, size_t offset, void* callback_data, void* shared_memory);
60+
void store_callback(Tdata* buffer, size_t offset, Tdata element, void* callback_data, void* shared_memory);
6161
62-
The parameters for the functions are defined as:
62+
The parameters for the functions are as follows:
6363

64-
* `T`: The data type of each element being loaded or stored from the
64+
* ``Tdata``: The data type of each element being loaded or stored from the
6565
input or output.
66-
* `buffer`: Pointer to the input (for load callbacks) or
66+
* ``buffer``: Pointer to the input (for load callbacks) or
6767
output (for store callbacks) in device memory that was passed to
6868
:cpp:func:`rocfft_execute`.
69-
* `offset`: The offset of the location being read from or written
70-
to. This counts in elements, from the `buffer` pointer.
71-
* `element`: For store callbacks only, the element to be stored.
72-
* `callback_data`: A pointer value accepted by
69+
* ``offset``: The offset of the location being read from or written
70+
to. This counts by elements from the ``buffer`` pointer.
71+
* ``element``: For store callbacks only, the element to be stored.
72+
* ``callback_data``: A pointer value accepted by
7373
:cpp:func:`rocfft_execution_info_set_load_callback` and
7474
:cpp:func:`rocfft_execution_info_set_store_callback` which is passed
7575
through to the callback function.
76-
* `shared_memory`: A pointer to an amount of shared memory requested
77-
when the callback is set. Shared memory is not supported,
78-
and this parameter is always null.
76+
* ``shared_memory``: A pointer to an amount of shared memory requested
77+
when the callback is set. Shared memory is not supported,
78+
so this parameter is always null.
7979

8080
Callback functions are called exactly once for each element being
81-
loaded or stored in a transform. Note that multiple kernels may be
81+
loaded or stored in a transform. Multiple kernels can be
8282
launched to decompose a transform, which means that separate kernels
83-
may call the load and store callbacks for a transform if both are
83+
might call the load and store callbacks for a transform if both are
8484
specified.
8585

8686
Callbacks functions are only supported for transforms that do not use planar format for input or output.
87-
88-
Runtime compilation
89-
===================
90-
91-
rocFFT includes many kernels for common FFT problems. Some plans may
92-
require additional kernels aside from what is built in to the
93-
library. In these cases, rocFFT will compile optimized kernels for
94-
the plan when the plan is created.
95-
96-
Compiled kernels are stored in memory by default and will be reused
97-
if they are required again for plans in the same process.
98-
99-
If the ``ROCFFT_RTC_CACHE_PATH`` environment variable is set to a
100-
writable file location, rocFFT will write compiled kernels to this
101-
location. rocFFT will read kernels from this location for plans in
102-
other processes that need runtime-compiled kernels. rocFFT will
103-
create the specified file if it does not already exist.

0 commit comments

Comments
 (0)