-
Notifications
You must be signed in to change notification settings - Fork 435
Closed
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
1. Issue or feature description
I am trying to figure out why my container cannot allocate a certain block of pinned memory past about 1GB of RAM. One of our algorithms uses a 2GB fixed memory pool of pinned memory that it pulls and returns memory to. This code works natively on all systems that I have tested it on, but in Docker it fails. I am using Docker Desktop for Windows with WSL2.
Here is the code I am using. I made it as simple as possible to just test the memory allocation issues:
#include <cuda_runtime_api.h>
#include <iostream>
#include <cstdio>
#include <stdio.h>
#include <stdexcept>
#include <string>
/**
* @def checkCudaError( cudaError )
* @brief Macro to call the cuda check function. Simply wrap this macro around any cuda runtime api library calls being made.
* @param cudaError result from a cuda runtime api function call.
* @ingroup CUDA
*/
#define checkCudaError( cudaError ) __checkCudaError( cudaError, __FILE__, __LINE__ )
/**
* @brief Checks the return value of any cuda function for errors.
* If an error occurs, the line number and error type is displayed for debugging purposes. Additionally,
* the model will be terminated when an error is encountered. This method is preferred to the old method of
* checking cuda errors through looking at the last error as it allows the error to be isolated to a single line
* and file.
* NOTE: This function is not explicitly called. It can only be called properly via the use of the macro checkCudaError
* above.
*
* This can be used with kernal launches as well, by calling cudaPeekAtLastError and cudaDeviceSynchronize and
* wrapping both functions with the macro above.
* @ingroup CUDA
* @param result_t result from a cuda runtime api library function
* @param file File that the cuda check was used in
* @param line Line number of the cuda check
*/
inline void __checkCudaError ( cudaError_t result_t, const char * file, const int line )
{
std::string error_string;
// Ignore both success and driver shutting down when throwing exceptions.
if ( cudaSuccess != result_t && cudaErrorCudartUnloading != result_t )
{
fprintf ( stderr, "\x1B[31m CUDA error encountered in file '%s', line %d\n Error %d: %s\n Terminating FIRE!\n \x1B[0m", file, line, result_t,
cudaGetErrorString ( result_t ) );
std::cerr << "CUDA error encountered: " + std::string( cudaGetErrorString ( result_t ) ) + ". Terminating application." << std::endl;
throw std::runtime_error ( "checkCUDAError : ERROR: CUDA Error" );
}
}
int main( int argc, char * argv[] )
{
void * data;
void * gpu;
checkCudaError(cudaSetDevice(0));
checkCudaError(cudaMalloc(&gpu, 2147483648ull));
int attr;
checkCudaError(cudaDeviceGetAttribute(&attr, cudaDevAttrHostRegisterSupported, 0));
std::cout << "Host Register supported: " << attr << std::endl;
checkCudaError(cudaFree(gpu));
checkCudaError(cudaMallocHost(&data, 1024ull*1024ull*1024ull*2ull));
checkCudaError(cudaFreeHost(data));
}My Dockerfile:
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update
ENV CMAKE_VERSION 3.18.4
ENV CMAKE_SH cmake-${CMAKE_VERSION}-Linux-x86_64.sh
ENV CMAKE_URL https://github.com/Kitware/CMake/releases/download/v$CMAKE_VERSION/$CMAKE_SH
RUN apt-get install -y wget
RUN mkdir /cmake && cd /cmake \
&& wget --no-check-certificate $CMAKE_URL \
&& chmod +x ${CMAKE_SH} \
&& ./${CMAKE_SH} --prefix=/usr/local --skip-license \
&& cmake --version
ADD . /test
RUN mkdir -p /test/build
RUN cd /test/build && \
cmake .. && make
WORKDIR /test/build/
ENTRYPOINT ["./test_cuda"]
CMake build script used for by the Dockerfile.
project(TEST LANGUAGES CUDA CXX)
find_package( CUDAToolkit REQUIRED)
add_executable( test_cuda test_cuda.cpp)
target_link_libraries( test_cuda PRIVATE CUDA::cudart)I am using a T1000 GPU with 4GB of dedicated VRAM. All code runs natively on the system in question without issue.
2. Steps to reproduce the issue
- Create a folder with the following files:
a. Dockerfile - Contents of Docker code above.
b. test_cuda.cpp - Contents of C++/CUDA source code.
c. CMakeLists.txt - Contents of CMake code above. - Go to this folder in a terminal.
- Build the container using
docker build -t cuda_test:latest. - Run the container. These is the command I am currently using.
docker run --gpus=all --ulimit memlock=-1 --rm cuda_test:latest
Resulting output:
CUDA error encountered in file '/test/test_cuda.cpp', line 60
Error 2: out of memory
Terminating FIRE!
CUDA error encountered: out of memory. Terminating application.
terminate called after throwing an instance of 'std::runtime_error'
what(): checkCUDAError : ERROR: CUDA Error
3. Information to attach (optional if deemed irrelevant)
- Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info - Kernel version from
uname -a - Any relevant kernel output lines from
dmesg - Driver information from
nvidia-smi -a - Docker version from
docker version - NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*' - NVIDIA container library version from
nvidia-container-cli -V - NVIDIA container library logs (see troubleshooting)
- Docker command, image and tag used
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.