Fast, reproducible, and portable software development environments
-
Updated
Dec 8, 2021 - Dockerfile
Fast, reproducible, and portable software development environments
Remote development on HPC clusters with VSCode
Accelerate and optimize existing C/C++ CPU-only applications using the most essential CUDA tools and techniques.
Matrix multiplication example performed with OpenMP, OpenACC, BLAS, cuBLABS, and CUDA
High-performance Sobel edge detection using CUDA with CPU vs GPU benchmarking, roofline analysis, and Nsight profiling.
CUDA Samples and Nsight Guided Profiling Samples
Repository for Architecture of computers and parallel systems course on VŠB
A simple and understandable CUDA kernel for batch-matmul operation
The MNIST classification problem is a fundamental machine learning task that involves recognizing handwritten digits (0- 9) from a dataset of 70,000 grayscale images (28x28 pixels each). It serves as a benchmark for evaluating machine learning models, particularly neural networks.
Custom PyTorch CUDA kernel implementing optimized ReLU activation with vectorization, performance profiling, and memory analysis on Tesla T4 GPU achieving 75% bandwidth efficiency.
16-step CUDA optimization of FlashAttention-2 achieving 99.2% of official performance on A100 — Ampere architecture
C++23 benchmarking framework with 6 profiler backends, CUDA GPU support, statistical regression detection, cross-compilation for 5 architectures, and CLI tools for analysis and visualization.
High-Performance Computing (HPC) & Optimization studies using CUDA C++. Includes Grid-Stride Loops, Shared Memory tiling, and Nsight Compute profiling analysis.
CUDA-accelerated kNN regression for rent estimation with CPU baseline, shared-memory optimization, and profiling
University Project for "Computer Architecture" course (MSc Computer Engineering @ University of Pisa). Implementation of a Parallelized Nearest Neighbor Upscaler using CUDA.
Open-source stencil-aware multi-GPU Conjugate Gradient solver on 8× A100 NVLink. 2.07× SpMV vs cuSPARSE · 1.44× above NVIDIA AmgX · 93.5% strong scaling efficiency. Profiled with Nsight Systems & Nsight Compute.
Quantum workload planning and profiler-backed architecture analysis for exact tensor-network execution.
🎬 Explore GPU training efficiency with FP32 vs FP16 in this modular lab, utilizing Tensor Core acceleration for deep learning insights.
Add a description, image, and links to the nsight topic page so that developers can more easily learn about it.
To associate your repository with the nsight topic, visit your repo's landing page and select "manage topics."