cuda gpu memory usage increasing in time

This issue has emerged multiple times on discord

https://discourse.julialang.org/t/memory-usage-increasing-with-each-epoch/121798
https://discourse.julialang.org/t/flux-memory-usage-high-in-srcnn/115174
https://discourse.julialang.org/t/out-of-memory-using-flux-cnn-during-back-propagation-phase/24492
https://discourse.julialang.org/t/flux-gpu-memory-problems/79783

and it could be related to #828 #302 #736  and https://github.com/JuliaGPU/CUDA.jl/issues/137

This is a minimal example, involving only the forward pass, on Flux's master:
```julia
using Flux
using Statistics, Random

using CUDA

function train_mlp()
    d_in = 128
    d_out = 128
    batch_size = 128
    num_iters = 10
    device = gpu_device()
    
    model = Dense(d_in => d_out) |> device
    x = randn(Float32, d_in, batch_size) |> device
    for iter in 1:num_iters
        ŷ = model(x)
        @info iter
        # GC.gc(true)
        CUDA.pool_status()
    end
end

train_mlp()
# GC.gc(true)
# CUDA.raclaim()
```
with output
```
[ Info: 1
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 1.586 MiB (32.000 MiB reserved)
[ Info: 2
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.091 MiB (32.000 MiB reserved)
[ Info: 3
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.596 MiB (32.000 MiB reserved)
[ Info: 4
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.101 MiB (32.000 MiB reserved)
[ Info: 5
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.606 MiB (32.000 MiB reserved)
[ Info: 6
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.110 MiB (32.000 MiB reserved)
[ Info: 7
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.615 MiB (32.000 MiB reserved)
[ Info: 8
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.120 MiB (32.000 MiB reserved)
[ Info: 9
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.625 MiB (32.000 MiB reserved)
[ Info: 10
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 6.130 MiB (32.000 MiB reserved)
```
Running multiple times `train_mlp()` the memory usage keeps ever increasing and more and more memory is reserved.

Mitigation strategies are to set memory limit like
```
ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "10%"
ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "5%"
```
or to manually run the garbage collector
```julia
GC.gc(true)
```
which slows done a lot if done every iteration.

This behavior is highly problematic because training runs quickly fill the gpu and one cannot run other gpu processes.

cc @maleadt 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

cuda gpu memory usage increasing in time #2523

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

cuda gpu memory usage increasing in time #2523

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions