Skip to content

cuda gpu memory usage increasing in time #2523

Open
@CarloLucibello

Description

@CarloLucibello

This issue has emerged multiple times on discord

https://discourse.julialang.org/t/memory-usage-increasing-with-each-epoch/121798
https://discourse.julialang.org/t/flux-memory-usage-high-in-srcnn/115174
https://discourse.julialang.org/t/out-of-memory-using-flux-cnn-during-back-propagation-phase/24492
https://discourse.julialang.org/t/flux-gpu-memory-problems/79783

and it could be related to #828 #302 #736 and JuliaGPU/CUDA.jl#137

This is a minimal example, involving only the forward pass, on Flux's master:

using Flux
using Statistics, Random

using CUDA

function train_mlp()
    d_in = 128
    d_out = 128
    batch_size = 128
    num_iters = 10
    device = gpu_device()
    
    model = Dense(d_in => d_out) |> device
    x = randn(Float32, d_in, batch_size) |> device
    for iter in 1:num_iters
        ŷ = model(x)
        @info iter
        # GC.gc(true)
        CUDA.pool_status()
    end
end

train_mlp()
# GC.gc(true)
# CUDA.raclaim()

with output

[ Info: 1
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 1.586 MiB (32.000 MiB reserved)
[ Info: 2
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.091 MiB (32.000 MiB reserved)
[ Info: 3
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.596 MiB (32.000 MiB reserved)
[ Info: 4
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.101 MiB (32.000 MiB reserved)
[ Info: 5
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.606 MiB (32.000 MiB reserved)
[ Info: 6
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.110 MiB (32.000 MiB reserved)
[ Info: 7
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.615 MiB (32.000 MiB reserved)
[ Info: 8
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.120 MiB (32.000 MiB reserved)
[ Info: 9
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.625 MiB (32.000 MiB reserved)
[ Info: 10
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 6.130 MiB (32.000 MiB reserved)

Running multiple times train_mlp() the memory usage keeps ever increasing and more and more memory is reserved.

Mitigation strategies are to set memory limit like

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "10%"
ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "5%"

or to manually run the garbage collector

GC.gc(true)

which slows done a lot if done every iteration.

This behavior is highly problematic because training runs quickly fill the gpu and one cannot run other gpu processes.

cc @maleadt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions