-
Notifications
You must be signed in to change notification settings - Fork 2
llama.cpp Integration Roadmap
Goal: Port llama.cpp reference implementation to EMBODIOS kernel
Status: Not started (0%)
Last Updated: 14 December 2025
Strategy: Selective porting, not wholesale copy
llama.cpp is the de facto reference implementation for LLM inference:
- ✅ Production-tested (millions of users)
- ✅ Highly optimized (SIMD, quantization, KV cache)
- ✅ Compatible with GGUF format
- ✅ Clean C/C++ codebase (minimal dependencies)
- ✅ Battle-tested tokenizers (BPE, SentencePiece)
- ✅ Extensive model support (Llama, Mistral, Phi, etc.)
Why not just use llama.cpp directly?
- ❌ Requires userspace runtime (pthreads, C++ stdlib)
- ❌ Uses mmap() for model loading
- ❌ Relies on POSIX APIs
- ❌ Has C++ dependencies we can't use in kernel
Our approach: Selective porting
- Extract core algorithms (transformer, attention, quantization)
- Rewrite in pure C for kernel environment
- Adapt to EMBODIOS memory model
- Maintain API compatibility for testing
llama.cpp/
├── ggml.c # Tensor operations (CRITICAL)
│ ├── ggml_compute_forward → Port to embodios/ai/ops/
│ ├── ggml_mul_mat → matmul.c
│ ├── ggml_rope → rope.c
│ └── ggml_soft_max → softmax.c
│
├── llama.cpp # Model loading & inference (CRITICAL)
│ ├── llama_load_model → gguf_parser.c
│ ├── llama_decode → transformer.c
│ ├── llama_kv_cache → kv_cache.c
│ └── llama_sampling → sampling.c
│
├── ggml-quants.c # Quantization kernels (CRITICAL)
│ ├── dequantize_row_q4_K → gguf_quant.c
│ ├── dequantize_row_q5_K → gguf_quant.c
│ └── dequantize_row_q6_K → gguf_quant.c
│
├── unicode.cpp # UTF-8 handling (NEEDED)
│ └── codepoint handling → tokenizer/unicode.c
│
└── common/ # Utilities (SELECTIVE)
├── sampling → sampling.c
└── grammar (skip for v1.0)
llama.cpp/
├── examples/ # Skip (userspace CLI tools)
├── tests/ # Skip (use our own tests)
├── ggml-backend.c # Skip (GPU/Metal/CUDA)
├── ggml-alloc.c # Skip (use our heap)
└── llm.cpp # Skip (LLM-specific, use llama.cpp)
Goal: Port core tensor operations from ggml.c to EMBODIOS.
Source: ggml.c:ggml_compute_forward_mul_mat()
Port to: kernel/ai/ops/matmul.c
Key Changes:
// llama.cpp version (floating-point)
void ggml_compute_forward_mul_mat_f32(
const struct ggml_tensor * src0,
const struct ggml_tensor * src1,
struct ggml_tensor * dst) {
const float * src0_data = (float *) src0->data;
const float * src1_data = (float *) src1->data;
float * dst_data = (float *) dst->data;
// Matrix multiply logic...
}
// EMBODIOS version (fixed-point)
void embodios_matmul_fixed(
const fixed_t* a, // M x K
const fixed_t* b, // K x N
fixed_t* c, // M x N
int M, int K, int N) {
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
fixed_t sum = 0;
for (int k = 0; k < K; k++) {
sum += FIXED_MUL(a[i*K + k], b[k*N + j]);
}
c[i*N + j] = sum;
}
}
}Deliverables:
- Port matmul (naive version)
- Add SIMD optimization (SSE2/AVX2)
- Unit tests: compare with llama.cpp output
- Performance: within 20% of llama.cpp
Source: ggml.c:ggml_compute_forward_rope()
Port to: kernel/ai/ops/rope.c
Key Insight: llama.cpp precomputes sin/cos tables. Copy this optimization.
// llama.cpp approach
struct ggml_tensor * ggml_rope_custom(
struct ggml_context * ctx,
struct ggml_tensor * a,
int n_dims,
int mode,
int n_ctx,
float freq_base,
float freq_scale,
...) {
// Precompute sin/cos
float * freq = malloc(n_dims/2 * sizeof(float));
for (int i = 0; i < n_dims/2; i++) {
freq[i] = 1.0f / powf(freq_base, (float)i*2 / n_dims);
}
// Apply rotation...
}
// EMBODIOS version
void embodios_rope(
fixed_t* q, // Query tensor
int pos, // Position
int head_dim, // Head dimension
int n_heads) {
// Use precomputed freq table (global)
for (int h = 0; h < n_heads; h++) {
for (int d = 0; d < head_dim/2; d++) {
int idx = h * head_dim + d;
fixed_t cos_val = rope_cos_table[pos * head_dim + d];
fixed_t sin_val = rope_sin_table[pos * head_dim + d];
fixed_t q0 = q[idx];
fixed_t q1 = q[idx + head_dim/2];
q[idx] = FIXED_MUL(q0, cos_val) - FIXED_MUL(q1, sin_val);
q[idx + head_dim/2] = FIXED_MUL(q0, sin_val) + FIXED_MUL(q1, cos_val);
}
}
}Deliverables:
- Port RoPE from llama.cpp
- Precompute sin/cos tables at boot
- Unit test: verify rotations match llama.cpp
- Support head_dim 64, 128
Source: ggml.c:ggml_compute_forward_soft_max()
Port to: kernel/ai/ops/softmax.c
Critical insight from llama.cpp:
// llama.cpp uses max-subtraction for numerical stability
void ggml_compute_forward_soft_max_f32(
const struct ggml_tensor * src0,
struct ggml_tensor * dst) {
// Find max value
float max = -INFINITY;
for (int i = 0; i < ne0; i++) {
max = fmaxf(max, src[i]);
}
// Compute exp(x - max)
float sum = 0.0f;
for (int i = 0; i < ne0; i++) {
dst[i] = expf(src[i] - max);
sum += dst[i];
}
// Normalize
for (int i = 0; i < ne0; i++) {
dst[i] /= sum;
}
}EMBODIOS version: Use lookup table for exp() instead of expf().
Deliverables:
- Port softmax with max-subtraction
- Implement exp() lookup table
- Unit test: compare with llama.cpp
- Error: <0.1% deviation
Source: ggml.c:ggml_compute_forward_rms_norm()
Port to: kernel/ai/rmsnorm.c
// llama.cpp version
void ggml_compute_forward_rms_norm_f32(
const struct ggml_tensor * src0,
struct ggml_tensor * dst) {
const float eps = 1e-6f;
// Compute mean of squares
float sum = 0.0f;
for (int i = 0; i < ne0; i++) {
sum += src0[i] * src0[i];
}
float mean = sum / ne0;
// Normalize
float scale = 1.0f / sqrtf(mean + eps);
for (int i = 0; i < ne0; i++) {
dst[i] = src0[i] * scale * weight[i];
}
}EMBODIOS adaptation: Use fixed-point sqrt approximation.
Deliverables:
- Port RMSNorm from llama.cpp
- Implement fixed-point rsqrt (1/sqrt)
- Unit test: verify normalization
- Performance: <2ms for 2048-dim vector
Goal: Port quantization/dequantization kernels from ggml-quants.c.
Source: ggml-quants.c:dequantize_row_q4_K()
Port to: kernel/gguf/gguf_quant.c
llama.cpp Q4_K structure:
// From ggml-quants.h
#define QK_K 256
typedef struct {
uint8_t scales[QK_K/16]; // 16 scales
uint8_t mins[QK_K/16]; // 16 minimums
uint8_t qs[QK_K/2]; // 4-bit quantized values
ggml_fp16_t d; // Delta (scale factor)
ggml_fp16_t dmin; // Min scale factor
} block_q4_K;
void dequantize_row_q4_K(const block_q4_K * x, float * y, int k) {
for (int i = 0; i < k / QK_K; i++) {
const block_q4_K * b = &x[i];
for (int j = 0; j < QK_K; j++) {
int scale_idx = j / 16;
int q_idx = j / 2;
uint8_t scale = b->scales[scale_idx];
uint8_t q = (b->qs[q_idx] >> ((j%2)*4)) & 0xF;
float d = GGML_FP16_TO_FP32(b->d);
y[i*QK_K + j] = (q - 8) * scale * d;
}
}
}EMBODIOS version: Output to fixed_t instead of float.
Deliverables:
- Port Q4_K dequantization
- Convert to fixed-point output
- Unit test: compare with llama.cpp
- Performance: >1M weights/sec
Source: ggml-quants.c:dequantize_row_q5_K() and dequantize_row_q6_K()
Deliverables:
- Port Q5_K (5-bit quantization)
- Port Q6_K (6-bit quantization)
- Unit tests for both
- Document memory/speed tradeoffs
Source: ggml-quants.c (SSE/AVX2 versions)
llama.cpp has SIMD-optimized quantization kernels:
#if defined(__SSE2__)
void dequantize_row_q4_K_sse2(...) {
// 4x faster than scalar
}
#endif
#if defined(__AVX2__)
void dequantize_row_q4_K_avx2(...) {
// 8x faster than scalar
}
#endifEMBODIOS: Port SSE2 version first, AVX2 later (optional).
Deliverables:
- Port SSE2 quantization kernels
- Measure speedup (target: 3-4x)
- Optional: Port AVX2 kernels
Goal: Port GGUF loading logic from llama.cpp:llama_load_model().
Source: llama.cpp:llama_model_load_internal()
Key insight: llama.cpp has robust error handling for GGUF parsing.
// llama.cpp approach
struct llama_model * llama_load_model_from_file(
const char * path,
struct llama_model_params params) {
// Memory-map file
void * data = mmap(...);
// Parse GGUF header
struct gguf_context * ctx = gguf_init_from_file(path, params);
// Extract hyperparameters
int n_vocab = gguf_get_int(ctx, "llama.vocab_size");
int n_embd = gguf_get_int(ctx, "llama.embedding_length");
int n_layer = gguf_get_int(ctx, "llama.block_count");
// Load tensors
for (int i = 0; i < gguf_get_n_tensors(ctx); i++) {
const char * name = gguf_get_tensor_name(ctx, i);
struct ggml_tensor * tensor = ggml_get_tensor(ctx->ctx_data, name);
// Store tensor...
}
}EMBODIOS adaptation:
- Replace mmap with our file API
- Use our heap allocator instead of ggml_context
Deliverables:
- Port GGUF parsing logic
- Extract all hyperparameters
- Load all tensors to memory
- Unit test: load TinyLlama successfully
Source: llama.cpp:llama_load_vocab()
Critical: llama.cpp supports multiple tokenizer types:
enum llama_vocab_type {
LLAMA_VOCAB_TYPE_SPM, // SentencePiece
LLAMA_VOCAB_TYPE_BPE, // Byte-Pair Encoding (GPT-2)
};
// llama.cpp loads tokenizer from GGUF
void llama_load_vocab(struct gguf_context * ctx, struct llama_vocab * vocab) {
const char * tokenizer_model = gguf_get_string(ctx, "tokenizer.ggml.model");
if (strcmp(tokenizer_model, "gpt2") == 0) {
vocab->type = LLAMA_VOCAB_TYPE_BPE;
load_bpe_vocab(ctx, vocab);
} else if (strcmp(tokenizer_model, "llama") == 0) {
vocab->type = LLAMA_VOCAB_TYPE_SPM;
load_spm_vocab(ctx, vocab);
}
// Load token strings
int n_vocab = gguf_get_int(ctx, "tokenizer.ggml.vocab_size");
for (int i = 0; i < n_vocab; i++) {
const char * token = gguf_get_string_array(ctx, "tokenizer.ggml.tokens", i);
float score = gguf_get_float_array(ctx, "tokenizer.ggml.scores", i);
// Store token...
}
}EMBODIOS: Support BPE for v1.0 (TinyLlama uses BPE). SentencePiece later.
Deliverables:
- Port tokenizer loading logic
- Support BPE tokenizer
- Load vocabulary from GGUF
- Unit test: tokenize "Once upon a time"
Source: llama.cpp:llama_model_load_internal()
Critical insight: llama.cpp organizes tensors hierarchically:
// llama.cpp tensor naming convention
model.layers[0].attn_q.weight // Query weights, layer 0
model.layers[0].attn_k.weight // Key weights, layer 0
model.layers[0].attn_v.weight // Value weights, layer 0
model.layers[0].attn_output.weight
model.layers[0].ffn_gate.weight
model.layers[0].ffn_down.weight
model.layers[0].ffn_up.weightEMBODIOS: Replicate this structure for compatibility.
struct llama_layer {
// Attention weights
fixed_t* wq; // Query
fixed_t* wk; // Key
fixed_t* wv; // Value
fixed_t* wo; // Output
// FFN weights
fixed_t* w1; // Gate
fixed_t* w2; // Down
fixed_t* w3; // Up
// Normalization weights
fixed_t* attn_norm;
fixed_t* ffn_norm;
};
struct llama_model {
int n_vocab;
int n_embd;
int n_layer;
int n_head;
int n_head_kv;
fixed_t* tok_embeddings;
struct llama_layer* layers;
fixed_t* output_norm;
fixed_t* output;
};Deliverables:
- Define model structure
- Load all tensors into structure
- Verify tensor shapes match GGUF
- Document memory layout
Goal: Port transformer inference from llama.cpp:llama_decode().
Source: llama.cpp:llm_build_llama()
llama.cpp uses computational graph approach:
static struct ggml_cgraph * llm_build_llama(
struct llama_context * lctx,
const llama_batch & batch) {
struct ggml_context * ctx = lctx->ctx_builder;
struct ggml_cgraph * gf = ggml_new_graph(ctx);
// 1. Token embedding
struct ggml_tensor * inpL = ggml_get_rows(ctx,
model.tok_embeddings, batch.token);
// 2. For each layer
for (int il = 0; il < n_layer; il++) {
// RMSNorm
struct ggml_tensor * cur = ggml_rms_norm(ctx, inpL, eps);
cur = ggml_mul(ctx, cur, model.layers[il].attn_norm);
// Attention
struct ggml_tensor * Qcur = ggml_mul_mat(ctx, model.layers[il].wq, cur);
struct ggml_tensor * Kcur = ggml_mul_mat(ctx, model.layers[il].wk, cur);
struct ggml_tensor * Vcur = ggml_mul_mat(ctx, model.layers[il].wv, cur);
// RoPE
Qcur = ggml_rope_custom(ctx, Qcur, ...);
Kcur = ggml_rope_custom(ctx, Kcur, ...);
// Attention scores
struct ggml_tensor * kq = ggml_mul_mat(ctx, Kcur, Qcur);
kq = ggml_soft_max(ctx, kq);
// Attention output
cur = ggml_mul_mat(ctx, Vcur, kq);
cur = ggml_mul_mat(ctx, model.layers[il].wo, cur);
// Residual
inpL = ggml_add(ctx, inpL, cur);
// FFN (similar structure)...
}
// Build and return graph
ggml_build_forward_expand(gf, cur);
return gf;
}EMBODIOS approach: Imperative (no graph), direct computation.
void llama_forward(
struct llama_model* model,
int token,
int pos,
fixed_t* logits) {
// 1. Embedding lookup
fixed_t* x = &model->tok_embeddings[token * model->n_embd];
// 2. For each layer
for (int layer = 0; layer < model->n_layer; layer++) {
struct llama_layer* l = &model->layers[layer];
// Pre-attention RMSNorm
rmsnorm(x, l->attn_norm, model->n_embd);
// Attention
attention(x, l, pos, model);
// Residual
add_residual(x, x_orig, model->n_embd);
// Pre-FFN RMSNorm
rmsnorm(x, l->ffn_norm, model->n_embd);
// FFN
ffn_swiglu(x, l, model);
// Residual
add_residual(x, x_orig, model->n_embd);
}
// 3. Final RMSNorm + output
rmsnorm(x, model->output_norm, model->n_embd);
matmul(x, model->output, logits, model->n_embd, model->n_vocab);
}Deliverables:
- Implement forward pass
- Match llama.cpp layer-by-layer
- Unit test: compare intermediate activations
- Integration test: compare final logits
Source: llama.cpp:llama_kv_cache
llama.cpp KV cache structure:
struct llama_kv_cache {
struct ggml_tensor * k; // All key tensors
struct ggml_tensor * v; // All value tensors
int n_ctx; // Max context length (e.g., 2048)
int n_layer; // Number of layers
};
void llama_kv_cache_update(
struct llama_kv_cache * cache,
const struct llama_batch * batch) {
// Store K, V for each position
for (int pos = 0; pos < batch->n_tokens; pos++) {
for (int layer = 0; layer < n_layer; layer++) {
// Copy K, V to cache
memcpy(&cache->k[layer][pos * n_embd], K_cur, n_embd * sizeof(float));
memcpy(&cache->v[layer][pos * n_embd], V_cur, n_embd * sizeof(float));
}
}
}EMBODIOS: Similar structure, but use fixed_t.
Deliverables:
- Allocate KV cache (n_layer x n_ctx x n_embd)
- Store K, V during forward pass
- Reuse cached values for past tokens
- Measure: 2x speedup on autoregressive generation
Source: llama.cpp:llama_sample_*() and common/sampling.cpp
llama.cpp sampling functions:
// Greedy sampling (argmax)
llama_token llama_sample_token_greedy(
struct llama_context * ctx,
llama_token_data_array * candidates);
// Temperature sampling
llama_token llama_sample_token(
struct llama_context * ctx,
llama_token_data_array * candidates,
float temp);
// Top-k sampling
void llama_sample_top_k(
struct llama_context * ctx,
llama_token_data_array * candidates,
int k);
// Top-p (nucleus) sampling
void llama_sample_top_p(
struct llama_context * ctx,
llama_token_data_array * candidates,
float p);EMBODIOS: Port greedy and top-k for v1.0.
Deliverables:
- Port greedy sampling
- Port top-k sampling
- Port temperature scaling
- Unit test: compare sampled tokens with llama.cpp
Goal: Port BPE tokenizer from llama.cpp.
Source: unicode.cpp and llama.cpp:llama_tokenize_internal()
llama.cpp BPE algorithm:
std::vector<llama_vocab::id> llama_tokenize_internal(
const llama_vocab & vocab,
std::string text,
bool bos,
bool special) {
std::vector<llama_vocab::id> output;
// Add BOS token
if (bos) {
output.push_back(vocab.special_bos_id);
}
// Convert UTF-8 to codepoints
std::vector<uint32_t> codepoints = unicode_cpts_from_utf8(text);
// Apply BPE merges
for (auto & merge : vocab.bpe_merges) {
// Find and merge pairs...
}
// Map to token IDs
for (auto & token : tokens) {
output.push_back(vocab.token_to_id[token]);
}
return output;
}EMBODIOS: Port to pure C.
Deliverables:
- Port UTF-8 → codepoint conversion
- Port BPE merge algorithm
- Load BPE merges from GGUF
- Unit test: "Once upon a time" → same token IDs as llama.cpp
Source: llama.cpp:llama_detokenize()
llama.cpp decoding:
std::string llama_detokenize_bpe(
const llama_vocab & vocab,
const std::vector<llama_vocab::id> & tokens) {
std::string text;
for (auto id : tokens) {
// Skip special tokens
if (id == vocab.special_bos_id || id == vocab.special_eos_id) {
continue;
}
// Get token string
const std::string & token_str = vocab.id_to_token[id].text;
text += token_str;
}
return text;
}EMBODIOS: Port to C.
Deliverables:
- Port token ID → string mapping
- Handle special tokens
- Unit test: roundtrip encoding/decoding
Goal: Ensure EMBODIOS output matches llama.cpp exactly.
Create comparison test:
#!/usr/bin/env python3
"""
Compare EMBODIOS output with llama.cpp on the same prompt.
"""
import subprocess
prompt = "Once upon a time"
# Run llama.cpp
llamacpp_output = subprocess.check_output([
"./llama.cpp/main",
"-m", "models/tinyllama-1.1b-q4km.gguf",
"-p", prompt,
"-n", "50",
"--temp", "0.0" # Greedy for deterministic output
])
# Run EMBODIOS
embodios_output = subprocess.check_output([
"./embodios_cli",
"--model", "models/tinyllama-1.1b-q4km.gguf",
"--prompt", prompt,
"--n-predict", "50",
"--temp", "0.0"
])
# Compare token-by-token
llamacpp_tokens = parse_tokens(llamacpp_output)
embodios_tokens = parse_tokens(embodios_output)
for i, (l, e) in enumerate(zip(llamacpp_tokens, embodios_tokens)):
if l != e:
print(f"MISMATCH at token {i}: llama.cpp={l}, embodios={e}")
exit(1)
print("✅ ALL TOKENS MATCH!")Deliverables:
- Create comparison script
- Test 10 different prompts
- Achieve 100% token match on greedy sampling
- Document any discrepancies
Benchmark against llama.cpp:
| Metric | llama.cpp | EMBODIOS v1.0 | Target |
|---|---|---|---|
| Tokens/sec | 83-86 | TBD | 85+ |
| First token (ms) | ~50 | TBD | <20 |
| Memory (MB) | 160 | 120 | <150 |
| Latency jitter (ms) | ±5-10 | ±0.5 | <±1 |
Deliverables:
- Benchmark 1000-token generation
- Compare speed, memory, latency
- Identify performance gaps
- Create optimization plan
Note: Implementation follows a chapter-by-chapter approach. Each phase builds upon the previous one.
Phase 1: GGML Tensor Operations (Chapter 1)
├─ Matrix multiplication
├─ RoPE and Softmax
└─ RMSNorm and testing
Phase 2: Quantization Kernels (Chapter 2)
├─ Q4_K dequantization
├─ Q5_K and Q6_K support
└─ SIMD optimizations
Phase 3: Model Loading (Chapter 3)
├─ GGUF metadata parsing
├─ Tokenizer loading
└─ Weight tensor organization
Phase 4: Inference Pipeline
├─ Transformer forward pass
├─ KV cache implementation
└─ Sampling strategies
Phase 5: Tokenization
├─ BPE encoding
└─ BPE decoding
Phase 6: Validation
├─ Token-by-token comparison with llama.cpp
└─ Performance benchmarking
v1.0 llama.cpp integration is complete when:
-
✅ Correctness:
- Generate same tokens as llama.cpp (greedy sampling)
- Pass token-by-token comparison on 10 prompts
- Tokenizer roundtrip matches llama.cpp
-
✅ Performance:
- 85+ tokens/sec (match/exceed llama.cpp)
- <20ms first token latency
- ±0.5ms latency jitter
-
✅ Compatibility:
- Load any GGUF model llama.cpp can load
- Support Q4_K_M, Q5_K_M, Q6_K quantization
- TinyLlama, Phi-2, Mistral-7B work
-
✅ Code Quality:
- Pure C (no C++ dependencies)
- Kernel-safe (no malloc/free after boot)
- Well-commented with llama.cpp references
| llama.cpp | EMBODIOS | Status |
|---|---|---|
ggml.c:ggml_mul_mat |
kernel/ai/ops/matmul.c |
TODO |
ggml.c:ggml_rope |
kernel/ai/ops/rope.c |
TODO |
ggml.c:ggml_soft_max |
kernel/ai/ops/softmax.c |
TODO |
ggml.c:ggml_rms_norm |
kernel/ai/rmsnorm.c |
TODO |
ggml-quants.c:dequantize_row_q4_K |
kernel/gguf/gguf_quant.c |
TODO |
llama.cpp:llama_load_model |
kernel/gguf/gguf_parser.c |
TODO |
llama.cpp:llama_decode |
kernel/ai/transformer.c |
TODO |
llama.cpp:llama_kv_cache |
kernel/ai/kv_cache.c |
TODO |
llama.cpp:llama_sample_* |
kernel/ai/sampling.c |
TODO |
unicode.cpp |
kernel/ai/tokenizer/unicode.c |
TODO |
llama.cpp:llama_tokenize |
kernel/ai/tokenizer/bpe.c |
TODO |
LLAMA-001: Port ggml_mul_mat from llama.cppLLAMA-002: Port ggml_rope from llama.cppLLAMA-003: Port ggml_soft_max from llama.cppLLAMA-004: Port ggml_rms_norm from llama.cpp
LLAMA-005: Port Q4_K dequantization from llama.cppLLAMA-006: Port Q5_K and Q6_K dequantizationLLAMA-007: Port GGUF metadata parsingLLAMA-008: Port tokenizer loading logicLLAMA-009: Port weight tensor organization
LLAMA-010: Port transformer forward passLLAMA-011: Port KV cache implementationLLAMA-012: Port sampling strategiesLLAMA-013: Port BPE encodingLLAMA-014: Port BPE decoding
LLAMA-015: Create token-by-token comparison testLLAMA-016: Benchmark against llama.cpp
Total: 16 issues
#embodios #llamacpp #ggml #integration #transformer #inference #pillar-1