Forge-LM is a decoder-only transformer language model built entirely from first principles. Starting from raw bytes, it implements a byte-level BPE tokenizer, a modern transformer stack (RMSNorm, RoPE, SwiGLU), the training utilities needed to optimize it, and an end-to-end pipeline that pretrains the model and adapts it to multiple-choice question answering.
Nothing here is imported from a high-level modeling library. Every core component is written by hand and validated against reference snapshots.
- Byte-level BPE tokenizer trained directly from a text corpus
- Modern transformer architecture: RMSNorm, Rotary Position Embeddings (RoPE), and a SwiGLU feed-forward network
- Hand-written training primitives: numerically stable softmax, cross-entropy, gradient clipping, token accuracy, and perplexity
- Analytical FLOPs and memory estimators
- A full pretrain plus fine-tune pipeline that adapts the model to QA, with a zero-shot prompting baseline for comparison
| Stage | Module | What it does |
|---|---|---|
| Part 1 | part1/ |
Byte-level BPE tokenizer (train, encode, decode) |
| Part 2 | part2/model.py |
Full transformer LM with RoPE and SwiGLU |
| Part 3 | part3/nn_utils.py |
Training and evaluation utilities |
| Part 4 | part4/ |
Pretraining, QA fine-tuning, and prompting |
- Python 3.10+
- A CUDA-capable GPU (recommended for Part 4, not required for Parts 1 to 3)
conda create -n cs288a2 python=3.10 -y
conda activate cs288a2
pip install -r requirements.txtRun tests from within each part's directory:
# Part 1: Tokenization
cd part1
python -m pytest tests/ -v
# Part 2: Transformer Model
cd part2
python -m pytest tests/ -v
# Part 3: NN Utilities
cd part3
python -m pytest tests/ -vOr run them all from the source directory:
cd source
python -m pytest part1/tests/ part2/tests/ part3/tests/ -vAfter the core components are in place, you can train and evaluate the full model.
cd part4
python train_baseline.pyThis will:
- Train a BPE tokenizer on TinyStories
- Pretrain a transformer language model
- Fine-tune on multiple-choice QA
- Evaluate using zero-shot prompting
- Save predictions to
part4/outputs/
# Quick test run (smaller model, fewer steps)
python train_baseline.py --quick
# Medium configuration
python train_baseline.py --medium
# Full training (default)
python train_baseline.pyAfter training, prediction files are saved to part4/outputs/:
finetuned_predictions.json: fine-tuned model predictionsprompting_predictions.json: zero-shot prompting predictions
train_bpe(): train a BPE vocabulary from a text corpusTokenizer._bpe(): apply BPE merges to a tokenTokenizer._encode_chunk(): encode text to token IDsTokenizer.decode(): decode token IDs back to text
Linear: linear transformation layerEmbedding: token embedding layerRMSNorm: root mean square layer normalizationsoftmax(): numerically stable softmaxsilu(): SiLU activation functionSwiGLU: gated feed-forward networkRotaryPositionEmbedding: RoPE positional encodingscaled_dot_product_attention(): attention mechanismMultiHeadSelfAttention: multi-head attentionMultiHeadSelfAttentionWithRoPE: attention with RoPETransformerBlock: a single transformer layerTransformerLM: the complete language modelcount_flops_per_token(): FLOPs estimationestimate_memory_bytes(): memory estimation
softmax(): numerically stable softmax for trainingcross_entropy(): cross-entropy lossgradient_clipping(): gradient norm clippingtoken_accuracy(): token-level accuracyperplexity(): language model perplexity
Create the submission archive:
bash create_submission.sh- Do not modify function signatures or class interfaces
- Do not add dependencies beyond
requirements.txt - Ensure the code passes local tests before submitting
- The autograder runs additional hidden tests
- Use the provided fixtures for testing