High-Performance Computing & Scaling Large Models

A graduate-level course on High-Performance Computing (HPC) and the systems engineering required to train, fine-tune, and serve Large Language Models (LLMs) efficiently. The course integrates GPU architecture, CUDA kernel programming, memory-efficient attention, distributed training paradigms, and modern inference systems into a unified, hands-on curriculum.

Course Overview

The exponential growth of model parameters in modern deep learning has shifted the central bottleneck from algorithmic novelty to systems efficiency. Training a state-of-the-art language model is no longer a single-GPU task; it is a distributed-systems problem governed by the limits of memory bandwidth, interconnect latency, and arithmetic intensity. This course equips students with the theoretical foundations and practical engineering skills required to operate at this frontier.

Each week pairs a rigorous lecture on a core HPC topic with a practical laboratory implemented as a Jupyter notebook. Students will write low-level CUDA kernels, profile real models with NVIDIA Nsight, integrate production-grade inference engines such as vLLM, and scale training across multiple GPUs using ZeRO, FSDP, and 3D parallelism.

Syllabus

Week	Lecture Topic	Practical Laboratory
1	HPC Foundations, Hardware Architectures, and Profiling	Model analysis with PyTorch Profiler and NVIDIA Nsight
2	Advanced PyTorch Optimization and Introduction to CUDA	Writing a matrix multiplication (`Y = X · W`) CUDA kernel from scratch
3	Memory-Bound Bottlenecks: FlashAttention and Serving Systems	vLLM (PagedAttention) integration and inference benchmarking
4	Memory Optimization in Distributed Training (ZeRO & FSDP)	DeepSpeed ZeRO-3 and dataset prefetching/caching
5	Multi-Dimensional Parallelism (3D) and Large-Cluster Scaling	Tensor and Pipeline parallelism with Megatron-LM
6	Efficient Inference, Model Compression, and Fine-Tuning Technologies	Distributed fine-tuning with LoRA/QLoRA and quantization

Repository Structure

high_performance_computing/
├── notebooks/                    # Weekly Jupyter notebooks (theory + practice)
│   ├── week1_hpc_foundations_profiling.ipynb
│   ├── week2_pytorch_optimization_cuda.ipynb
│   ├── week3_flashattention_vllm.ipynb
│   ├── week4_zero_fsdp_distributed.ipynb
│   ├── week5_3d_parallelism_megatron.ipynb
│   └── week6_lora_qlora_quantization.ipynb
├── src/                          # Reusable library code
│   ├── cuda/                     # Custom CUDA kernels (matmul, attention)
│   ├── distributed/              # Distributed training utilities
│   ├── inference/                # Inference and serving helpers
│   └── utils/                    # Profiling, benchmarking, logging
├── scripts/                      # Launcher scripts
│   └── slurm/                    # SLURM job submission templates
├── configs/                      # YAML configurations for training/inference
├── docker/                       # Reproducible Docker environment
├── tests/                        # Unit and integration tests
├── .github/workflows/            # Continuous Integration pipelines
├── assets/                       # Figures and diagrams
├── requirements.txt              # Python dependencies
├── environment.yml               # Conda environment specification
├── pyproject.toml                # Project metadata and tooling
├── CONTRIBUTING.md               # Contribution guidelines
├── CODE_OF_CONDUCT.md
├── CITATION.cff                  # Academic citation metadata
├── LICENSE
└── README.md

Learning Outcomes

By the end of this course, students will be able to:

Analyze the performance of a deep learning model using hardware-aware profiling tools (Nsight Systems, Nsight Compute, PyTorch Profiler) and identify whether a workload is compute-bound, memory-bound, or communication-bound.
Implement custom CUDA kernels for fundamental linear algebra operations and reason about occupancy, shared memory tiling, and warp-level primitives.
Deploy memory-efficient attention mechanisms (FlashAttention, PagedAttention) and explain their algorithmic and systems-level innovations.
Design distributed training strategies combining data parallelism, tensor parallelism, pipeline parallelism, and ZeRO sharding for models that exceed single-device memory.
Apply parameter-efficient fine-tuning (LoRA, QLoRA) and post-training quantization to deploy LLMs under realistic hardware budgets.

Quick Start

Prerequisites

Hardware: NVIDIA GPU with Compute Capability ≥ 7.0 (Volta or newer). Multi-GPU recommended for Weeks 4–6.
CUDA Toolkit: 12.1 or newer.
Python: 3.10 or newer.
Operating System: Linux (Ubuntu 22.04 LTS verified). WSL2 supported with caveats.

Installation

# Clone the repository
git clone https://github.com/HAYDARKILIC/high_performance_computing.git
cd high_performance_computing

# Option A: pip + virtualenv
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Option B: conda
conda env create -f environment.yml
conda activate high_performance_computing

# Option C: Docker
docker compose -f docker/docker-compose.yml up -d
docker exec -it high_performance_computing bash

Launching the Notebooks

jupyter lab notebooks/

For SLURM-managed clusters, see scripts/slurm/ for ready-to-use submission templates.

Recommended Hardware per Week

Week	Minimum	Recommended
1	1× T4 (16 GB)	1× A100 (40 GB)
2	1× T4	1× A100
3	1× A10G (24 GB)	1× A100 (80 GB)
4	2× A100 (40 GB)	4× A100 (80 GB)
5	4× A100 (40 GB)	8× A100 (80 GB) + NVLink
6	1× A10G	2× A100

References and Further Reading

Key references that anchor the course material:

Dao, T. et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS.
Kwon, W. et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP.
Rajbhandari, S. et al. (2020). ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC20.
Narayanan, D. et al. (2021). Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. SC21.
Hu, E. J. et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
Dettmers, T. et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS.

License

This course is released under the MIT License. Course materials and lecture notes are additionally distributed under CC BY 4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High-Performance Computing & Scaling Large Models

Course Overview

Syllabus

Repository Structure

Learning Outcomes

Quick Start

Prerequisites

Installation

Launching the Notebooks

Recommended Hardware per Week

References and Further Reading

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
data		data
docker		docker
docs		docs
notebooks		notebooks
scripts		scripts
src		src
tests		tests
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

High-Performance Computing & Scaling Large Models

Course Overview

Syllabus

Repository Structure

Learning Outcomes

Quick Start

Prerequisites

Installation

Launching the Notebooks

Recommended Hardware per Week

References and Further Reading

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages