Skip to content

Commit fe0c955

Browse files
authored
Merge branch 'main' into ad
2 parents 85db9c3 + c5a945a commit fe0c955

38 files changed

+2804
-1755
lines changed

Project.toml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,22 @@ StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
1313
AMDGPU = "21141c5a-9bdb-4563-92ae-f87d6854732e"
1414
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
1515
Enzyme = "7da242da-08ed-463a-9acd-ee780be4f1d9"
16+
Metal = "dde4c033-4e86-420c-a63e-0dd931031962"
1617
Polyester = "f517fe37-dbe3-4b94-8317-1923a5111588"
1718

1819
[extensions]
1920
ParallelStencil_AMDGPUExt = "AMDGPU"
2021
ParallelStencil_CUDAExt = "CUDA"
2122
ParallelStencil_EnzymeExt = "Enzyme"
23+
ParallelStencil_MetalExt = "Metal"
2224

2325
[compat]
2426
AMDGPU = "0.6, 0.7, 0.8, 0.9, 1"
2527
CUDA = "3.12, 4, 5"
2628
CellArrays = "0.3"
2729
Enzyme = "0.12, 0.13"
2830
MacroTools = "0.5"
31+
Metal = "1.2"
2932
Polyester = "0.7"
3033
StaticArrays = "1"
3134
julia = "1.10" # Minimum version supporting Data module creation
@@ -35,4 +38,4 @@ TOML = "fa267f1f-6049-4f14-aa54-33bafae1ed76"
3538
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
3639

3740
[targets]
38-
test = ["Test", "TOML", "AMDGPU", "CUDA", "Enzyme", "Polyester"]
41+
test = ["Test", "TOML", "AMDGPU", "CUDA", "Metal", "Enzyme", "Polyester"]

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ParallelStencil empowers domain scientists to write architecture-agnostic high-l
77

88
<a id="fig_teff">![Performance ParallelStencil Teff](docs/images/perf_ps2.png)</a>
99

10-
ParallelStencil relies on the native kernel programming capabilities of [CUDA.jl] and [AMDGPU.jl] and on [Base.Threads] for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with [ImplicitGlobalGrid.jl], which renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs \[[1][JuliaCon20a], [2][JuliaCon20b], [3][JuliaCon19], [4][PASC19]\]. Moreover, ParallelStencil enables hiding communication behind computation with a simple macro call and without any particular restrictions on the package used for communication. ParallelStencil has been designed in conjunction with [ImplicitGlobalGrid.jl] for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance multi-GPU applications readily accessible to them. Furthermore, we have developed a self-contained approach for "Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia" relying on ParallelStencil and [ImplicitGlobalGrid.jl] \[[1][JuliaCon20a]\]. ParallelStencil's feature to hide communication behind computation was showcased when a close to ideal weak scaling was demonstrated for a 3-D poro-hydro-mechanical real-world application on up to 1024 GPUs on the Piz Daint Supercomputer \[[1][JuliaCon20a]\]:
10+
ParallelStencil relies on the native kernel programming capabilities of [CUDA.jl], [AMDGPU.jl], [Metal.jl] and on [Base.Threads] for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with [ImplicitGlobalGrid.jl], which renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs \[[1][JuliaCon20a], [2][JuliaCon20b], [3][JuliaCon19], [4][PASC19]\]. Moreover, ParallelStencil enables hiding communication behind computation with a simple macro call and without any particular restrictions on the package used for communication. ParallelStencil has been designed in conjunction with [ImplicitGlobalGrid.jl] for simplest possible usage by domain-scientists, rendering fast and interactive development of massively scalable high performance multi-GPU applications readily accessible to them. Furthermore, we have developed a self-contained approach for "Solving Nonlinear Multi-Physics on GPU Supercomputers with Julia" relying on ParallelStencil and [ImplicitGlobalGrid.jl] \[[1][JuliaCon20a]\]. ParallelStencil's feature to hide communication behind computation was showcased when a close to ideal weak scaling was demonstrated for a 3-D poro-hydro-mechanical real-world application on up to 1024 GPUs on the Piz Daint Supercomputer \[[1][JuliaCon20a]\]:
1111

1212
![Parallel efficiency of ParallelStencil with CUDA C backend](docs/images/par_eff_c_julia2.png)
1313

@@ -33,7 +33,7 @@ Beyond traditional high-performance computing, ParallelStencil supports automati
3333
* [References](#references)
3434

3535
## Parallelization and optimization with one macro call
36-
A simple call to `@parallel` is enough to parallelize and optimize a function and to launch it. The package used underneath for parallelization is defined in a call to `@init_parallel_stencil` beforehand. Supported are [CUDA.jl] and [AMDGPU.jl] for running on GPU and [Base.Threads] for CPU. The following example outlines how to run parallel computations on a GPU using the native kernel programming capabilities of [CUDA.jl] underneath (omitted lines are represented with `#(...)`, omitted arguments with `...`):
36+
A simple call to `@parallel` is enough to parallelize and optimize a function and to launch it. The package used underneath for parallelization is defined in a call to `@init_parallel_stencil` beforehand. Supported are [CUDA.jl], [AMDGPU.jl] and [Metal.jl] for running on GPU and [Base.Threads] for CPU. The following example outlines how to run parallel computations on a GPU using the native kernel programming capabilities of [CUDA.jl] underneath (omitted lines are represented with `#(...)`, omitted arguments with `...`):
3737
```julia
3838
#(...)
3939
@init_parallel_stencil(CUDA,...)
@@ -554,6 +554,7 @@ Please open an issue to discuss your idea for a contribution beforehand. Further
554554
[CellArrays.jl]: https://github.com/omlins/CellArrays.jl
555555
[CUDA.jl]: https://github.com/JuliaGPU/CUDA.jl
556556
[AMDGPU.jl]: https://github.com/JuliaGPU/AMDGPU.jl
557+
[Metal.jl]: https://github.com/JuliaGPU/Metal.jl
557558
[Enzyme.jl]: https://github.com/EnzymeAD/Enzyme.jl
558559
[MacroTools.jl]: https://github.com/FluxML/MacroTools.jl
559560
[StaticArrays.jl]: https://github.com/JuliaArrays/StaticArrays.jl

ext/ParallelStencil_MetalExt.jl

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
module ParallelStencil_MetalExt
2+
include(joinpath(@__DIR__, "..", "src", "ParallelKernel", "MetalExt", "shared.jl"))
3+
include(joinpath(@__DIR__, "..", "src", "ParallelKernel", "MetalExt", "allocators.jl"))
4+
end

src/FieldAllocators.jl

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ To see a description of a macro type `?<macroname>` (including the `@`).
2828
"""
2929
module FieldAllocators
3030
import ..ParallelKernel
31+
import ..ParallelStencil: check_initialized
3132
@doc replace(ParallelKernel.FieldAllocators.ALLOCATE_DOC, "@init_parallel_kernel" => "@init_parallel_stencil") macro allocate(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@allocate($(args...)))); end
3233
@doc replace(ParallelKernel.FieldAllocators.FIELD_DOC, "@init_parallel_kernel" => "@init_parallel_stencil") macro Field(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@Field($(args...)))); end
3334
@doc replace(ParallelKernel.FieldAllocators.VECTORFIELD_DOC, "@init_parallel_kernel" => "@init_parallel_stencil") macro VectorField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@VectorField($(args...)))); end
@@ -46,5 +47,19 @@ module FieldAllocators
4647
@doc replace(ParallelKernel.FieldAllocators.TENSORFIELD_COMP_DOC, "@init_parallel_kernel" => "@init_parallel_stencil") macro XZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XZField($(args...)))); end
4748
@doc replace(ParallelKernel.FieldAllocators.TENSORFIELD_COMP_DOC, "@init_parallel_kernel" => "@init_parallel_stencil") macro YZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@YZField($(args...)))); end
4849

50+
macro IField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@IField($(args...)))); end
51+
macro XXYField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XXYField($(args...)))); end
52+
macro XYYField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XYYField($(args...)))); end
53+
macro XYZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XYZField($(args...)))); end
54+
macro XXYZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XXYZField($(args...)))); end
55+
macro XYYZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XYYZField($(args...)))); end
56+
macro XYZZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XYZZField($(args...)))); end
57+
macro XXYYField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XXYYField($(args...)))); end
58+
macro XXZZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XXZZField($(args...)))); end
59+
macro YYZZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@YYZZField($(args...)))); end
60+
macro XXYYZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XXYYZField($(args...)))); end
61+
macro XYYZZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XYYZZField($(args...)))); end
62+
macro XXYZZField(args...) check_initialized(__module__); esc(:(ParallelStencil.ParallelKernel.FieldAllocators.@XXYZZField($(args...)))); end
63+
4964
export @allocate, @Field, @VectorField, @BVectorField, @TensorField, @XField, @BXField, @YField, @BYField, @ZField, @BZField, @XXField, @YYField, @ZZField, @XYField, @XZField, @YZField
5065
end

0 commit comments

Comments
 (0)