Run a neural network without PyTorch, TensorRT, or Python — just pure CUDA.
This repository demonstrates a fully standalone C++/CUDA implementation of a multi-layer perceptron (MLP) using cuBLASLt and a few lightweight custom kernels.
It performs forward inference directly on the GPU — no frameworks, no Python, no hidden layers of abstraction.
The goal is to demonstrate how lean and transparent deep learning inference can be when we remove framework overhead and run "close to the metal."
AI frameworks like PyTorch and TensorRT are extremely capable — but they are also heavy, opaque, and difficult to control.
This project demonstrates that even a well-known and highly optimized model, such as an MLP, can run efficiently without relying on such frameworks or libraries.
By compiling directly to CUDA/C++, all framework layers between the model and the GPU are removed.
Practical benefits:
-
🚀 High performance: Direct GPU access through optimized libraries or custom kernels — with minimal framework overhead.
-
💾 Lightweight footprint: One small binary; ideal for edge devices, restricted environments, or safety-critical systems. Unlike PyTorch, the statically compiled binary offers potential for auditability, reproducibility, and deterministic behavior.
-
🧠 No Python or PyTorch runtime: Minimal host-side overhead; the model runs directly on the GPU.
-
🔁 Stable and reproducible: No dependency hell or version mismatches — once compiled, it runs consistently on any compatible NVIDIA system. With minor adjustments, the same approach could be applied to other vendors (e.g., AMD) using HIP instead of CUDA.
-
🔍 Transparency: Every custom kernel and operation is visible, modifiable, and understandable. Third-party or vendor-provided libraries may remain opaque (cuDNN, cuBLAS) if used.
In short: you get PyTorch-level performance with better transparency and without framework complexity.
This example MLP was largely created through a specialized LLM-driven workflow — an early prototype of what an AI compiler could become.
The long-term goal: let AI researchers write models in PyTorch, and let an intelligent compiler automatically generate optimized CUDA/C++ code for direct deployment.
Such an LLM-based compiler could:
- Remove heavy dependencies on Python and large frameworks.
- Automatically tune kernels and memory layouts for specific GPUs.
- Provide readable, auditable source code instead of opaque binaries.
- Enable cross-vendor optimization, not limited to NVIDIA hardware.
While this repo does not include the full LLM workflow, it demonstrates the feasibility and performance of this idea.
- Pure C++/CUDA — no frameworks or Python required
- Uses cuBLASLt for Tensor Core acceleration (TF32)
- Performance comparable to (or faster than) PyTorch inference
- Transparent and modifiable implementation
- Educational foundation for future AI-compilers
Below is an example plot of PyTorch vs Custom CUDA forward runtime across batch sizes on an RTX 5090:
- The CUDA implementation consistently outperforms PyTorch.
- Runtime scales approximately linearly with batch size due to the tiled implementation.
- Peak memory usage is controlled via tiling, avoiding out-of-memory issues for large batches.
- 🔍 Performance note: Part of the speedup comes from TF32 (Tensor Cores), which PyTorch does not enable by default in FP32 mode. This highlights one of the advantages of a compiler-assisted or LLM-driven workflow: it can automatically expose hidden hardware performance knobs that most developers and standard frameworks don’t activate, while still ensuring numerical correctness.
- GPU: NVIDIA RTX 5090 (tested)
- CUDA Toolkit: 13.0 installed on the system
- PyTorch CUDA version: 12.9 packages (compatible with CUDA 13 drivers)
- Python: 3.12
- Conda environment: packages installed from
requirements.txt
Note: PyTorch currently provides precompiled binaries for CUDA 12.9 (
cu12). These packages run correctly on systems with CUDA 13 drivers, so the environment is fully compatible with your hardware.
Create a Conda environment to isolate dependencies.
# Create a conda environment
conda create -n mlp_env python=3.12 -y
conda activate mlp_env
# Install PyTorch with the appropriate CUDA version
pip3 install torch --index-url https://download.pytorch.org/whl/cu130
# Install Python dependencies
pip install -r requirements.txt# Create build folder
mkdir -p build
cd build
# Run CMake and compile
cmake ..
make -jThis will compile the custom CUDA kernels.
Before running the evaluation, generate the synthetic weights using the provided script:
python scripts/create_synthetic_weights.pyThis will create the weights in the data/ folder, which evaluate.py reads automatically.
python evaluate.py --B 16384 --tileB 16384Where:
--Bis the batch size--tileBis the tile size used for tiled inference (see below)
To verify numerical correctness, the script computes absolute and relative differences between PyTorch and CUDA outputs:
Note: Relative differences can be misleading when some outputs are very close to zero while others are large. Even a very small absolute difference can appear as a large relative error in those cases. Both absolute and relative differences to capture correctness comprehensively are reported.
Additional context:
- Synthetic weights are scaled down to reduce floating-point accumulation errors.
- Comparisons use FP32 (PyTorch default) vs TF32 (CUDA) on very large batches (up to 262,144 samples).
- TF32 has fewer mantissa bits than FP32, so small numerical differences are expected, particularly for very small or very large values.
Differences remain within expected numerical tolerances, ensuring functional correctness.
For very large batch sizes (e.g., B = 65536), allocating a single contiguous tensor would exceed the available VRAM (>8 GB for hidden activations) on my RTX 5090.
To avoid out-of-memory errors, tiled inference was used:
- The batch is split into smaller tiles of size
tileB - Each tile is processed sequentially, reducing peak memory usage
- The outputs are written into a pre-allocated output tensor
This allows efficient inference at extremely large batch sizes without exceeding GPU memory limits.
MIT
