Forge-LM: A Transformer Language Model from Bytes to QA

Forge-LM is a decoder-only transformer language model built entirely from first principles. Starting from raw bytes, it implements a byte-level BPE tokenizer, a modern transformer stack (RMSNorm, RoPE, SwiGLU), the training utilities needed to optimize it, and an end-to-end pipeline that pretrains the model and adapts it to multiple-choice question answering.

Nothing here is imported from a high-level modeling library. Every core component is written by hand and validated against reference snapshots.

Highlights

Byte-level BPE tokenizer trained directly from a text corpus
Modern transformer architecture: RMSNorm, Rotary Position Embeddings (RoPE), and a SwiGLU feed-forward network
Hand-written training primitives: numerically stable softmax, cross-entropy, gradient clipping, token accuracy, and perplexity
Analytical FLOPs and memory estimators
A full pretrain plus fine-tune pipeline that adapts the model to QA, with a zero-shot prompting baseline for comparison

Architecture at a Glance

Stage	Module	What it does
Part 1	`part1/`	Byte-level BPE tokenizer (train, encode, decode)
Part 2	`part2/model.py`	Full transformer LM with RoPE and SwiGLU
Part 3	`part3/nn_utils.py`	Training and evaluation utilities
Part 4	`part4/`	Pretraining, QA fine-tuning, and prompting

Installation

Prerequisites

Python 3.10+
A CUDA-capable GPU (recommended for Part 4, not required for Parts 1 to 3)

Setup

conda create -n cs288a2 python=3.10 -y
conda activate cs288a2
pip install -r requirements.txt

Running Tests

Run tests from within each part's directory:

# Part 1: Tokenization
cd part1
python -m pytest tests/ -v

# Part 2: Transformer Model
cd part2
python -m pytest tests/ -v

# Part 3: NN Utilities
cd part3
python -m pytest tests/ -v

Or run them all from the source directory:

cd source
python -m pytest part1/tests/ part2/tests/ part3/tests/ -v

Training and Evaluation

After the core components are in place, you can train and evaluate the full model.

Run the Training Pipeline

cd part4
python train_baseline.py

This will:

Train a BPE tokenizer on TinyStories
Pretrain a transformer language model
Fine-tune on multiple-choice QA
Evaluate using zero-shot prompting
Save predictions to part4/outputs/

Configuration Options

# Quick test run (smaller model, fewer steps)
python train_baseline.py --quick

# Medium configuration
python train_baseline.py --medium

# Full training (default)
python train_baseline.py

Output Files

After training, prediction files are saved to part4/outputs/:

finetuned_predictions.json: fine-tuned model predictions
prompting_predictions.json: zero-shot prompting predictions

Component Reference

Part 1: Tokenization

train_bpe(): train a BPE vocabulary from a text corpus
Tokenizer._bpe(): apply BPE merges to a token
Tokenizer._encode_chunk(): encode text to token IDs
Tokenizer.decode(): decode token IDs back to text

Part 2: Model Components

Linear: linear transformation layer
Embedding: token embedding layer
RMSNorm: root mean square layer normalization
softmax(): numerically stable softmax
silu(): SiLU activation function
SwiGLU: gated feed-forward network
RotaryPositionEmbedding: RoPE positional encoding
scaled_dot_product_attention(): attention mechanism
MultiHeadSelfAttention: multi-head attention
MultiHeadSelfAttentionWithRoPE: attention with RoPE
TransformerBlock: a single transformer layer
TransformerLM: the complete language model
count_flops_per_token(): FLOPs estimation
estimate_memory_bytes(): memory estimation

Part 3: Training Utilities

softmax(): numerically stable softmax for training
cross_entropy(): cross-entropy loss
gradient_clipping(): gradient norm clipping
token_accuracy(): token-level accuracy
perplexity(): language model perplexity

Submission

Create the submission archive:

bash create_submission.sh

Notes

Do not modify function signatures or class interfaces
Do not add dependencies beyond requirements.txt
Ensure the code passes local tests before submitting
The autograder runs additional hidden tests
Use the provided fixtures for testing

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
part1		part1
part2		part2
part3		part3
part4		part4
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
create_submission.sh		create_submission.sh
log.txt		log.txt
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forge-LM: A Transformer Language Model from Bytes to QA

Highlights

Architecture at a Glance

Installation

Prerequisites

Setup

Running Tests

Training and Evaluation

Run the Training Pipeline

Configuration Options

Output Files

Component Reference

Part 1: Tokenization

Part 2: Model Components

Part 3: Training Utilities

Submission

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Forge-LM: A Transformer Language Model from Bytes to QA

Highlights

Architecture at a Glance

Installation

Prerequisites

Setup

Running Tests

Training and Evaluation

Run the Training Pipeline

Configuration Options

Output Files

Component Reference

Part 1: Tokenization

Part 2: Model Components

Part 3: Training Utilities

Submission

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages