Description
This issue has emerged multiple times on discord
https://discourse.julialang.org/t/memory-usage-increasing-with-each-epoch/121798
https://discourse.julialang.org/t/flux-memory-usage-high-in-srcnn/115174
https://discourse.julialang.org/t/out-of-memory-using-flux-cnn-during-back-propagation-phase/24492
https://discourse.julialang.org/t/flux-gpu-memory-problems/79783
and it could be related to #828 #302 #736 and JuliaGPU/CUDA.jl#137
This is a minimal example, involving only the forward pass, on Flux's master:
using Flux
using Statistics, Random
using CUDA
function train_mlp()
d_in = 128
d_out = 128
batch_size = 128
num_iters = 10
device = gpu_device()
model = Dense(d_in => d_out) |> device
x = randn(Float32, d_in, batch_size) |> device
for iter in 1:num_iters
ŷ = model(x)
@info iter
# GC.gc(true)
CUDA.pool_status()
end
end
train_mlp()
# GC.gc(true)
# CUDA.raclaim()
with output
[ Info: 1
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 1.586 MiB (32.000 MiB reserved)
[ Info: 2
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.091 MiB (32.000 MiB reserved)
[ Info: 3
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.596 MiB (32.000 MiB reserved)
[ Info: 4
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.101 MiB (32.000 MiB reserved)
[ Info: 5
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.606 MiB (32.000 MiB reserved)
[ Info: 6
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.110 MiB (32.000 MiB reserved)
[ Info: 7
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.615 MiB (32.000 MiB reserved)
[ Info: 8
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.120 MiB (32.000 MiB reserved)
[ Info: 9
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.625 MiB (32.000 MiB reserved)
[ Info: 10
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 6.130 MiB (32.000 MiB reserved)
Running multiple times train_mlp()
the memory usage keeps ever increasing and more and more memory is reserved.
Mitigation strategies are to set memory limit like
ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "10%"
ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "5%"
or to manually run the garbage collector
GC.gc(true)
which slows done a lot if done every iteration.
This behavior is highly problematic because training runs quickly fill the gpu and one cannot run other gpu processes.
cc @maleadt