Minimal CUDA MLP Inference and Benchmark

Introduction

Run a neural network without PyTorch, TensorRT, or Python — just pure CUDA.

This repository demonstrates a fully standalone C++/CUDA implementation of a multi-layer perceptron (MLP) using cuBLASLt and a few lightweight custom kernels.
It performs forward inference directly on the GPU — no frameworks, no Python, no hidden layers of abstraction.

The goal is to demonstrate how lean and transparent deep learning inference can be when we remove framework overhead and run "close to the metal."

⚡ Why Run Models Without PyTorch or Python?

AI frameworks like PyTorch and TensorRT are extremely capable — but they are also heavy, opaque, and difficult to control. This project demonstrates that even a well-known and highly optimized model, such as an MLP, can run efficiently without relying on such frameworks or libraries.
By compiling directly to CUDA/C++, all framework layers between the model and the GPU are removed.

Practical benefits:

🚀 High performance: Direct GPU access through optimized libraries or custom kernels — with minimal framework overhead.
💾 Lightweight footprint: One small binary; ideal for edge devices, restricted environments, or safety-critical systems. Unlike PyTorch, the statically compiled binary offers potential for auditability, reproducibility, and deterministic behavior.
🧠 No Python or PyTorch runtime: Minimal host-side overhead; the model runs directly on the GPU.
🔁 Stable and reproducible: No dependency hell or version mismatches — once compiled, it runs consistently on any compatible NVIDIA system. With minor adjustments, the same approach could be applied to other vendors (e.g., AMD) using HIP instead of CUDA.
🔍 Transparency: Every custom kernel and operation is visible, modifiable, and understandable. Third-party or vendor-provided libraries may remain opaque (cuDNN, cuBLAS) if used.

In short: you get PyTorch-level performance with better transparency and without framework complexity.

🤖 Vision: Toward an LLM-Driven Compiler

This example MLP was largely created through a specialized LLM-driven workflow — an early prototype of what an AI compiler could become.

The long-term goal: let AI researchers write models in PyTorch, and let an intelligent compiler automatically generate optimized CUDA/C++ code for direct deployment.

Such an LLM-based compiler could:

Remove heavy dependencies on Python and large frameworks.
Automatically tune kernels and memory layouts for specific GPUs.
Provide readable, auditable source code instead of opaque binaries.
Enable cross-vendor optimization, not limited to NVIDIA hardware.

While this repo does not include the full LLM workflow, it demonstrates the feasibility and performance of this idea.

⚙️ Highlights

Pure C++/CUDA — no frameworks or Python required
Uses cuBLASLt for Tensor Core acceleration (TF32)
Performance comparable to (or faster than) PyTorch inference
Transparent and modifiable implementation
Educational foundation for future AI-compilers

💥 Results

Runtime Comparison

Below is an example plot of PyTorch vs Custom CUDA forward runtime across batch sizes on an RTX 5090:

The CUDA implementation consistently outperforms PyTorch.
Runtime scales approximately linearly with batch size due to the tiled implementation.
Peak memory usage is controlled via tiling, avoiding out-of-memory issues for large batches.
🔍 Performance note: Part of the speedup comes from TF32 (Tensor Cores), which PyTorch does not enable by default in FP32 mode. This highlights one of the advantages of a compiler-assisted or LLM-driven workflow: it can automatically expose hidden hardware performance knobs that most developers and standard frameworks don’t activate, while still ensuring numerical correctness.

System Requirements

GPU: NVIDIA RTX 5090 (tested)
CUDA Toolkit: 13.0 installed on the system
PyTorch CUDA version: 12.9 packages (compatible with CUDA 13 drivers)
Python: 3.12
Conda environment: packages installed from requirements.txt

Note: PyTorch currently provides precompiled binaries for CUDA 12.9 (cu12). These packages run correctly on systems with CUDA 13 drivers, so the environment is fully compatible with your hardware.

Installation

Create a Conda environment to isolate dependencies.

# Create a conda environment
conda create -n mlp_env python=3.12 -y
conda activate mlp_env

# Install PyTorch with the appropriate CUDA version
pip3 install torch --index-url https://download.pytorch.org/whl/cu130

# Install Python dependencies
pip install -r requirements.txt

Build the CUDA code

# Create build folder
mkdir -p build
cd build

# Run CMake and compile
cmake ..
make -j

This will compile the custom CUDA kernels.

Usage

1. Create MLP weights

Before running the evaluation, generate the synthetic weights using the provided script:

python scripts/create_synthetic_weights.py

This will create the weights in the data/ folder, which evaluate.py reads automatically.

2. Run runtime measurement and output comparison

python evaluate.py --B 16384 --tileB 16384

Where:

--B is the batch size
--tileB is the tile size used for tiled inference (see below)

Output Comparison

To verify numerical correctness, the script computes absolute and relative differences between PyTorch and CUDA outputs:

Note: Relative differences can be misleading when some outputs are very close to zero while others are large. Even a very small absolute difference can appear as a large relative error in those cases. Both absolute and relative differences to capture correctness comprehensively are reported.

Additional context:

Synthetic weights are scaled down to reduce floating-point accumulation errors.
Comparisons use FP32 (PyTorch default) vs TF32 (CUDA) on very large batches (up to 262,144 samples).
TF32 has fewer mantissa bits than FP32, so small numerical differences are expected, particularly for very small or very large values.

Differences remain within expected numerical tolerances, ensuring functional correctness.

Tiling Mechanism

For very large batch sizes (e.g., B = 65536), allocating a single contiguous tensor would exceed the available VRAM (>8 GB for hidden activations) on my RTX 5090.

To avoid out-of-memory errors, tiled inference was used:

The batch is split into smaller tiles of size tileB
Each tile is processed sequentially, reducing peak memory usage
The outputs are written into a pre-allocated output tensor

This allows efficient inference at extremely large batch sizes without exceeding GPU memory limits.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Minimal CUDA MLP Inference and Benchmark

Introduction

⚡ Why Run Models Without PyTorch or Python?

🤖 Vision: Toward an LLM-Driven Compiler

⚙️ Highlights

💥 Results

Runtime Comparison

System Requirements

Installation

Build the CUDA code

Usage

1. Create MLP weights

2. Run runtime measurement and output comparison

Output Comparison

Tiling Mechanism

License

About

Uh oh!

Releases

Packages

Languages

TimoSaemann/cuda_mlp

Folders and files

Latest commit

History

Repository files navigation

Minimal CUDA MLP Inference and Benchmark

Introduction

⚡ Why Run Models Without PyTorch or Python?

🤖 Vision: Toward an LLM-Driven Compiler

⚙️ Highlights

💥 Results

Runtime Comparison

System Requirements

Installation

Build the CUDA code

Usage

1. Create MLP weights

2. Run runtime measurement and output comparison

Output Comparison

Tiling Mechanism

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages