llama.cpp Integration Roadmap

Goal: Port llama.cpp reference implementation to EMBODIOS kernel
Status: Not started (0%)
Last Updated: 14 December 2025
Strategy: Selective porting, not wholesale copy

Why llama.cpp?

llama.cpp is the de facto reference implementation for LLM inference:

✅ Production-tested (millions of users)
✅ Highly optimized (SIMD, quantization, KV cache)
✅ Compatible with GGUF format
✅ Clean C/C++ codebase (minimal dependencies)
✅ Battle-tested tokenizers (BPE, SentencePiece)
✅ Extensive model support (Llama, Mistral, Phi, etc.)

Why not just use llama.cpp directly?

❌ Requires userspace runtime (pthreads, C++ stdlib)
❌ Uses mmap() for model loading
❌ Relies on POSIX APIs
❌ Has C++ dependencies we can't use in kernel

Our approach: Selective porting

Extract core algorithms (transformer, attention, quantization)
Rewrite in pure C for kernel environment
Adapt to EMBODIOS memory model
Maintain API compatibility for testing

llama.cpp Architecture Analysis

Core Components We Need

llama.cpp/
├── ggml.c                    # Tensor operations (CRITICAL)
│   ├── ggml_compute_forward   → Port to embodios/ai/ops/
│   ├── ggml_mul_mat            → matmul.c
│   ├── ggml_rope               → rope.c
│   └── ggml_soft_max           → softmax.c
│
├── llama.cpp                 # Model loading & inference (CRITICAL)
│   ├── llama_load_model       → gguf_parser.c
│   ├── llama_decode           → transformer.c
│   ├── llama_kv_cache         → kv_cache.c
│   └── llama_sampling         → sampling.c
│
├── ggml-quants.c             # Quantization kernels (CRITICAL)
│   ├── dequantize_row_q4_K   → gguf_quant.c
│   ├── dequantize_row_q5_K   → gguf_quant.c
│   └── dequantize_row_q6_K   → gguf_quant.c
│
├── unicode.cpp               # UTF-8 handling (NEEDED)
│   └── codepoint handling     → tokenizer/unicode.c
│
└── common/                   # Utilities (SELECTIVE)
    ├── sampling               → sampling.c
    └── grammar (skip for v1.0)

Components We DON'T Need

llama.cpp/
├── examples/                 # Skip (userspace CLI tools)
├── tests/                    # Skip (use our own tests)
├── ggml-backend.c            # Skip (GPU/Metal/CUDA)
├── ggml-alloc.c              # Skip (use our heap)
└── llm.cpp                   # Skip (LLM-specific, use llama.cpp)

Integration Phases

Chapter 1: GGML Tensor Operations

Goal: Port core tensor operations from ggml.c to EMBODIOS.

Step 1: Matrix Multiplication

Source: ggml.c:ggml_compute_forward_mul_mat()

Port to: kernel/ai/ops/matmul.c

Key Changes:

// llama.cpp version (floating-point)
void ggml_compute_forward_mul_mat_f32(
    const struct ggml_tensor * src0,
    const struct ggml_tensor * src1,
    struct ggml_tensor * dst) {

    const float * src0_data = (float *) src0->data;
    const float * src1_data = (float *) src1->data;
    float * dst_data = (float *) dst->data;

    // Matrix multiply logic...
}

// EMBODIOS version (fixed-point)
void embodios_matmul_fixed(
    const fixed_t* a,      // M x K
    const fixed_t* b,      // K x N
    fixed_t* c,            // M x N
    int M, int K, int N) {

    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            fixed_t sum = 0;
            for (int k = 0; k < K; k++) {
                sum += FIXED_MUL(a[i*K + k], b[k*N + j]);
            }
            c[i*N + j] = sum;
        }
    }
}

Deliverables:

Port matmul (naive version)
Add SIMD optimization (SSE2/AVX2)
Unit tests: compare with llama.cpp output
Performance: within 20% of llama.cpp

Step 2: RoPE (Rotary Positional Encoding)

Source: ggml.c:ggml_compute_forward_rope()

Port to: kernel/ai/ops/rope.c

Key Insight: llama.cpp precomputes sin/cos tables. Copy this optimization.

// llama.cpp approach
struct ggml_tensor * ggml_rope_custom(
    struct ggml_context * ctx,
    struct ggml_tensor * a,
    int n_dims,
    int mode,
    int n_ctx,
    float freq_base,
    float freq_scale,
    ...) {

    // Precompute sin/cos
    float * freq = malloc(n_dims/2 * sizeof(float));
    for (int i = 0; i < n_dims/2; i++) {
        freq[i] = 1.0f / powf(freq_base, (float)i*2 / n_dims);
    }

    // Apply rotation...
}

// EMBODIOS version
void embodios_rope(
    fixed_t* q,           // Query tensor
    int pos,              // Position
    int head_dim,         // Head dimension
    int n_heads) {

    // Use precomputed freq table (global)
    for (int h = 0; h < n_heads; h++) {
        for (int d = 0; d < head_dim/2; d++) {
            int idx = h * head_dim + d;

            fixed_t cos_val = rope_cos_table[pos * head_dim + d];
            fixed_t sin_val = rope_sin_table[pos * head_dim + d];

            fixed_t q0 = q[idx];
            fixed_t q1 = q[idx + head_dim/2];

            q[idx] = FIXED_MUL(q0, cos_val) - FIXED_MUL(q1, sin_val);
            q[idx + head_dim/2] = FIXED_MUL(q0, sin_val) + FIXED_MUL(q1, cos_val);
        }
    }
}

Deliverables:

Port RoPE from llama.cpp
Precompute sin/cos tables at boot
Unit test: verify rotations match llama.cpp
Support head_dim 64, 128

Step 3: Softmax

Source: ggml.c:ggml_compute_forward_soft_max()

Port to: kernel/ai/ops/softmax.c

Critical insight from llama.cpp:

// llama.cpp uses max-subtraction for numerical stability
void ggml_compute_forward_soft_max_f32(
    const struct ggml_tensor * src0,
    struct ggml_tensor * dst) {

    // Find max value
    float max = -INFINITY;
    for (int i = 0; i < ne0; i++) {
        max = fmaxf(max, src[i]);
    }

    // Compute exp(x - max)
    float sum = 0.0f;
    for (int i = 0; i < ne0; i++) {
        dst[i] = expf(src[i] - max);
        sum += dst[i];
    }

    // Normalize
    for (int i = 0; i < ne0; i++) {
        dst[i] /= sum;
    }
}

EMBODIOS version: Use lookup table for exp() instead of expf().

Deliverables:

Port softmax with max-subtraction
Implement exp() lookup table
Unit test: compare with llama.cpp
Error: <0.1% deviation

Step 4: RMS Normalization

Source: ggml.c:ggml_compute_forward_rms_norm()

Port to: kernel/ai/rmsnorm.c

// llama.cpp version
void ggml_compute_forward_rms_norm_f32(
    const struct ggml_tensor * src0,
    struct ggml_tensor * dst) {

    const float eps = 1e-6f;

    // Compute mean of squares
    float sum = 0.0f;
    for (int i = 0; i < ne0; i++) {
        sum += src0[i] * src0[i];
    }
    float mean = sum / ne0;

    // Normalize
    float scale = 1.0f / sqrtf(mean + eps);
    for (int i = 0; i < ne0; i++) {
        dst[i] = src0[i] * scale * weight[i];
    }
}

EMBODIOS adaptation: Use fixed-point sqrt approximation.

Deliverables:

Port RMSNorm from llama.cpp
Implement fixed-point rsqrt (1/sqrt)
Unit test: verify normalization
Performance: <2ms for 2048-dim vector

Chapter 2: Quantization Kernels

Goal: Port quantization/dequantization kernels from ggml-quants.c.

Step 1: Q4_K Dequantization

Source: ggml-quants.c:dequantize_row_q4_K()

Port to: kernel/gguf/gguf_quant.c

llama.cpp Q4_K structure:

// From ggml-quants.h
#define QK_K 256

typedef struct {
    uint8_t scales[QK_K/16];   // 16 scales
    uint8_t mins[QK_K/16];     // 16 minimums
    uint8_t qs[QK_K/2];        // 4-bit quantized values
    ggml_fp16_t d;              // Delta (scale factor)
    ggml_fp16_t dmin;           // Min scale factor
} block_q4_K;

void dequantize_row_q4_K(const block_q4_K * x, float * y, int k) {
    for (int i = 0; i < k / QK_K; i++) {
        const block_q4_K * b = &x[i];

        for (int j = 0; j < QK_K; j++) {
            int scale_idx = j / 16;
            int q_idx = j / 2;

            uint8_t scale = b->scales[scale_idx];
            uint8_t q = (b->qs[q_idx] >> ((j%2)*4)) & 0xF;

            float d = GGML_FP16_TO_FP32(b->d);
            y[i*QK_K + j] = (q - 8) * scale * d;
        }
    }
}

EMBODIOS version: Output to fixed_t instead of float.

Deliverables:

Port Q4_K dequantization
Convert to fixed-point output
Unit test: compare with llama.cpp
Performance: >1M weights/sec

Step 2: Q5_K and Q6_K

Source: ggml-quants.c:dequantize_row_q5_K() and dequantize_row_q6_K()

Deliverables:

Port Q5_K (5-bit quantization)
Port Q6_K (6-bit quantization)
Unit tests for both
Document memory/speed tradeoffs

Step 3: SIMD Quantization Kernels

Source: ggml-quants.c (SSE/AVX2 versions)

llama.cpp has SIMD-optimized quantization kernels:

#if defined(__SSE2__)
void dequantize_row_q4_K_sse2(...) {
    // 4x faster than scalar
}
#endif

#if defined(__AVX2__)
void dequantize_row_q4_K_avx2(...) {
    // 8x faster than scalar
}
#endif

EMBODIOS: Port SSE2 version first, AVX2 later (optional).

Deliverables:

Port SSE2 quantization kernels
Measure speedup (target: 3-4x)
Optional: Port AVX2 kernels

Chapter 3: Model Loading

Goal: Port GGUF loading logic from llama.cpp:llama_load_model().

Step 1: GGUF Metadata Parsing

Source: llama.cpp:llama_model_load_internal()

Key insight: llama.cpp has robust error handling for GGUF parsing.

// llama.cpp approach
struct llama_model * llama_load_model_from_file(
    const char * path,
    struct llama_model_params params) {

    // Memory-map file
    void * data = mmap(...);

    // Parse GGUF header
    struct gguf_context * ctx = gguf_init_from_file(path, params);

    // Extract hyperparameters
    int n_vocab = gguf_get_int(ctx, "llama.vocab_size");
    int n_embd = gguf_get_int(ctx, "llama.embedding_length");
    int n_layer = gguf_get_int(ctx, "llama.block_count");

    // Load tensors
    for (int i = 0; i < gguf_get_n_tensors(ctx); i++) {
        const char * name = gguf_get_tensor_name(ctx, i);
        struct ggml_tensor * tensor = ggml_get_tensor(ctx->ctx_data, name);
        // Store tensor...
    }
}

EMBODIOS adaptation:

Replace mmap with our file API
Use our heap allocator instead of ggml_context

Deliverables:

Port GGUF parsing logic
Extract all hyperparameters
Load all tensors to memory
Unit test: load TinyLlama successfully

Step 2: Tokenizer Loading

Source: llama.cpp:llama_load_vocab()

Critical: llama.cpp supports multiple tokenizer types:

enum llama_vocab_type {
    LLAMA_VOCAB_TYPE_SPM,  // SentencePiece
    LLAMA_VOCAB_TYPE_BPE,  // Byte-Pair Encoding (GPT-2)
};

// llama.cpp loads tokenizer from GGUF
void llama_load_vocab(struct gguf_context * ctx, struct llama_vocab * vocab) {
    const char * tokenizer_model = gguf_get_string(ctx, "tokenizer.ggml.model");

    if (strcmp(tokenizer_model, "gpt2") == 0) {
        vocab->type = LLAMA_VOCAB_TYPE_BPE;
        load_bpe_vocab(ctx, vocab);
    } else if (strcmp(tokenizer_model, "llama") == 0) {
        vocab->type = LLAMA_VOCAB_TYPE_SPM;
        load_spm_vocab(ctx, vocab);
    }

    // Load token strings
    int n_vocab = gguf_get_int(ctx, "tokenizer.ggml.vocab_size");
    for (int i = 0; i < n_vocab; i++) {
        const char * token = gguf_get_string_array(ctx, "tokenizer.ggml.tokens", i);
        float score = gguf_get_float_array(ctx, "tokenizer.ggml.scores", i);
        // Store token...
    }
}

EMBODIOS: Support BPE for v1.0 (TinyLlama uses BPE). SentencePiece later.

Deliverables:

Port tokenizer loading logic
Support BPE tokenizer
Load vocabulary from GGUF
Unit test: tokenize "Once upon a time"

Step 3: Weight Tensor Organization

Source: llama.cpp:llama_model_load_internal()

Critical insight: llama.cpp organizes tensors hierarchically:

// llama.cpp tensor naming convention
model.layers[0].attn_q.weight    // Query weights, layer 0
model.layers[0].attn_k.weight    // Key weights, layer 0
model.layers[0].attn_v.weight    // Value weights, layer 0
model.layers[0].attn_output.weight
model.layers[0].ffn_gate.weight
model.layers[0].ffn_down.weight
model.layers[0].ffn_up.weight

EMBODIOS: Replicate this structure for compatibility.

struct llama_layer {
    // Attention weights
    fixed_t* wq;  // Query
    fixed_t* wk;  // Key
    fixed_t* wv;  // Value
    fixed_t* wo;  // Output

    // FFN weights
    fixed_t* w1;  // Gate
    fixed_t* w2;  // Down
    fixed_t* w3;  // Up

    // Normalization weights
    fixed_t* attn_norm;
    fixed_t* ffn_norm;
};

struct llama_model {
    int n_vocab;
    int n_embd;
    int n_layer;
    int n_head;
    int n_head_kv;

    fixed_t* tok_embeddings;
    struct llama_layer* layers;
    fixed_t* output_norm;
    fixed_t* output;
};

Deliverables:

Define model structure
Load all tensors into structure
Verify tensor shapes match GGUF
Document memory layout

Phase 4: Inference Pipeline (Weeks 11-14)

Goal: Port transformer inference from llama.cpp:llama_decode().

4.1: Single-Token Forward Pass (Week 11-12)

Source: llama.cpp:llm_build_llama()

llama.cpp uses computational graph approach:

static struct ggml_cgraph * llm_build_llama(
    struct llama_context * lctx,
    const llama_batch & batch) {

    struct ggml_context * ctx = lctx->ctx_builder;
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    // 1. Token embedding
    struct ggml_tensor * inpL = ggml_get_rows(ctx,
        model.tok_embeddings, batch.token);

    // 2. For each layer
    for (int il = 0; il < n_layer; il++) {
        // RMSNorm
        struct ggml_tensor * cur = ggml_rms_norm(ctx, inpL, eps);
        cur = ggml_mul(ctx, cur, model.layers[il].attn_norm);

        // Attention
        struct ggml_tensor * Qcur = ggml_mul_mat(ctx, model.layers[il].wq, cur);
        struct ggml_tensor * Kcur = ggml_mul_mat(ctx, model.layers[il].wk, cur);
        struct ggml_tensor * Vcur = ggml_mul_mat(ctx, model.layers[il].wv, cur);

        // RoPE
        Qcur = ggml_rope_custom(ctx, Qcur, ...);
        Kcur = ggml_rope_custom(ctx, Kcur, ...);

        // Attention scores
        struct ggml_tensor * kq = ggml_mul_mat(ctx, Kcur, Qcur);
        kq = ggml_soft_max(ctx, kq);

        // Attention output
        cur = ggml_mul_mat(ctx, Vcur, kq);
        cur = ggml_mul_mat(ctx, model.layers[il].wo, cur);

        // Residual
        inpL = ggml_add(ctx, inpL, cur);

        // FFN (similar structure)...
    }

    // Build and return graph
    ggml_build_forward_expand(gf, cur);
    return gf;
}

EMBODIOS approach: Imperative (no graph), direct computation.

void llama_forward(
    struct llama_model* model,
    int token,
    int pos,
    fixed_t* logits) {

    // 1. Embedding lookup
    fixed_t* x = &model->tok_embeddings[token * model->n_embd];

    // 2. For each layer
    for (int layer = 0; layer < model->n_layer; layer++) {
        struct llama_layer* l = &model->layers[layer];

        // Pre-attention RMSNorm
        rmsnorm(x, l->attn_norm, model->n_embd);

        // Attention
        attention(x, l, pos, model);

        // Residual
        add_residual(x, x_orig, model->n_embd);

        // Pre-FFN RMSNorm
        rmsnorm(x, l->ffn_norm, model->n_embd);

        // FFN
        ffn_swiglu(x, l, model);

        // Residual
        add_residual(x, x_orig, model->n_embd);
    }

    // 3. Final RMSNorm + output
    rmsnorm(x, model->output_norm, model->n_embd);
    matmul(x, model->output, logits, model->n_embd, model->n_vocab);
}

Deliverables:

Implement forward pass
Match llama.cpp layer-by-layer
Unit test: compare intermediate activations
Integration test: compare final logits

4.2: KV Cache Implementation (Week 13)

Source: llama.cpp:llama_kv_cache

llama.cpp KV cache structure:

struct llama_kv_cache {
    struct ggml_tensor * k;  // All key tensors
    struct ggml_tensor * v;  // All value tensors

    int n_ctx;    // Max context length (e.g., 2048)
    int n_layer;  // Number of layers
};

void llama_kv_cache_update(
    struct llama_kv_cache * cache,
    const struct llama_batch * batch) {

    // Store K, V for each position
    for (int pos = 0; pos < batch->n_tokens; pos++) {
        for (int layer = 0; layer < n_layer; layer++) {
            // Copy K, V to cache
            memcpy(&cache->k[layer][pos * n_embd], K_cur, n_embd * sizeof(float));
            memcpy(&cache->v[layer][pos * n_embd], V_cur, n_embd * sizeof(float));
        }
    }
}

EMBODIOS: Similar structure, but use fixed_t.

Deliverables:

Allocate KV cache (n_layer x n_ctx x n_embd)
Store K, V during forward pass
Reuse cached values for past tokens
Measure: 2x speedup on autoregressive generation

4.3: Sampling Strategies (Week 14)

Source: llama.cpp:llama_sample_*() and common/sampling.cpp

llama.cpp sampling functions:

// Greedy sampling (argmax)
llama_token llama_sample_token_greedy(
    struct llama_context * ctx,
    llama_token_data_array * candidates);

// Temperature sampling
llama_token llama_sample_token(
    struct llama_context * ctx,
    llama_token_data_array * candidates,
    float temp);

// Top-k sampling
void llama_sample_top_k(
    struct llama_context * ctx,
    llama_token_data_array * candidates,
    int k);

// Top-p (nucleus) sampling
void llama_sample_top_p(
    struct llama_context * ctx,
    llama_token_data_array * candidates,
    float p);

EMBODIOS: Port greedy and top-k for v1.0.

Deliverables:

Port greedy sampling
Port top-k sampling
Port temperature scaling
Unit test: compare sampled tokens with llama.cpp

Phase 5: Tokenization (Weeks 15-16)

Goal: Port BPE tokenizer from llama.cpp.

5.1: BPE Encoding (Week 15)

Source: unicode.cpp and llama.cpp:llama_tokenize_internal()

llama.cpp BPE algorithm:

std::vector<llama_vocab::id> llama_tokenize_internal(
    const llama_vocab & vocab,
    std::string text,
    bool bos,
    bool special) {

    std::vector<llama_vocab::id> output;

    // Add BOS token
    if (bos) {
        output.push_back(vocab.special_bos_id);
    }

    // Convert UTF-8 to codepoints
    std::vector<uint32_t> codepoints = unicode_cpts_from_utf8(text);

    // Apply BPE merges
    for (auto & merge : vocab.bpe_merges) {
        // Find and merge pairs...
    }

    // Map to token IDs
    for (auto & token : tokens) {
        output.push_back(vocab.token_to_id[token]);
    }

    return output;
}

EMBODIOS: Port to pure C.

Deliverables:

Port UTF-8 → codepoint conversion
Port BPE merge algorithm
Load BPE merges from GGUF
Unit test: "Once upon a time" → same token IDs as llama.cpp

5.2: BPE Decoding (Week 16)

Source: llama.cpp:llama_detokenize()

llama.cpp decoding:

std::string llama_detokenize_bpe(
    const llama_vocab & vocab,
    const std::vector<llama_vocab::id> & tokens) {

    std::string text;

    for (auto id : tokens) {
        // Skip special tokens
        if (id == vocab.special_bos_id || id == vocab.special_eos_id) {
            continue;
        }

        // Get token string
        const std::string & token_str = vocab.id_to_token[id].text;
        text += token_str;
    }

    return text;
}

EMBODIOS: Port to C.

Deliverables:

Port token ID → string mapping
Handle special tokens
Unit test: roundtrip encoding/decoding

Phase 6: Validation & Testing (Weeks 17-18)

Goal: Ensure EMBODIOS output matches llama.cpp exactly.

6.1: Token-by-Token Comparison (Week 17)

Create comparison test:

#!/usr/bin/env python3
"""
Compare EMBODIOS output with llama.cpp on the same prompt.
"""

import subprocess

prompt = "Once upon a time"

# Run llama.cpp
llamacpp_output = subprocess.check_output([
    "./llama.cpp/main",
    "-m", "models/tinyllama-1.1b-q4km.gguf",
    "-p", prompt,
    "-n", "50",
    "--temp", "0.0"  # Greedy for deterministic output
])

# Run EMBODIOS
embodios_output = subprocess.check_output([
    "./embodios_cli",
    "--model", "models/tinyllama-1.1b-q4km.gguf",
    "--prompt", prompt,
    "--n-predict", "50",
    "--temp", "0.0"
])

# Compare token-by-token
llamacpp_tokens = parse_tokens(llamacpp_output)
embodios_tokens = parse_tokens(embodios_output)

for i, (l, e) in enumerate(zip(llamacpp_tokens, embodios_tokens)):
    if l != e:
        print(f"MISMATCH at token {i}: llama.cpp={l}, embodios={e}")
        exit(1)

print("✅ ALL TOKENS MATCH!")

Deliverables:

Create comparison script
Test 10 different prompts
Achieve 100% token match on greedy sampling
Document any discrepancies

6.2: Performance Benchmarking (Week 18)

Benchmark against llama.cpp:

Metric	llama.cpp	EMBODIOS v1.0	Target
Tokens/sec	83-86	TBD	85+
First token (ms)	~50	TBD	<20
Memory (MB)	160	120	<150
Latency jitter (ms)	±5-10	±0.5	<±1

Deliverables:

Benchmark 1000-token generation
Compare speed, memory, latency
Identify performance gaps
Create optimization plan

Implementation Phases

Note: Implementation follows a chapter-by-chapter approach. Each phase builds upon the previous one.

Phase 1: GGML Tensor Operations (Chapter 1)
├─ Matrix multiplication
├─ RoPE and Softmax
└─ RMSNorm and testing

Phase 2: Quantization Kernels (Chapter 2)
├─ Q4_K dequantization
├─ Q5_K and Q6_K support
└─ SIMD optimizations

Phase 3: Model Loading (Chapter 3)
├─ GGUF metadata parsing
├─ Tokenizer loading
└─ Weight tensor organization

Phase 4: Inference Pipeline
├─ Transformer forward pass
├─ KV cache implementation
└─ Sampling strategies

Phase 5: Tokenization
├─ BPE encoding
└─ BPE decoding

Phase 6: Validation
├─ Token-by-token comparison with llama.cpp
└─ Performance benchmarking

Success Criteria

v1.0 llama.cpp integration is complete when:

✅ Correctness:
- Generate same tokens as llama.cpp (greedy sampling)
- Pass token-by-token comparison on 10 prompts
- Tokenizer roundtrip matches llama.cpp
✅ Performance:
- 85+ tokens/sec (match/exceed llama.cpp)
- <20ms first token latency
- ±0.5ms latency jitter
✅ Compatibility:
- Load any GGUF model llama.cpp can load
- Support Q4_K_M, Q5_K_M, Q6_K quantization
- TinyLlama, Phi-2, Mistral-7B work
✅ Code Quality:
- Pure C (no C++ dependencies)
- Kernel-safe (no malloc/free after boot)
- Well-commented with llama.cpp references

File Mapping: llama.cpp → EMBODIOS

llama.cpp	EMBODIOS	Status
`ggml.c:ggml_mul_mat`	`kernel/ai/ops/matmul.c`	TODO
`ggml.c:ggml_rope`	`kernel/ai/ops/rope.c`	TODO
`ggml.c:ggml_soft_max`	`kernel/ai/ops/softmax.c`	TODO
`ggml.c:ggml_rms_norm`	`kernel/ai/rmsnorm.c`	TODO
`ggml-quants.c:dequantize_row_q4_K`	`kernel/gguf/gguf_quant.c`	TODO
`llama.cpp:llama_load_model`	`kernel/gguf/gguf_parser.c`	TODO
`llama.cpp:llama_decode`	`kernel/ai/transformer.c`	TODO
`llama.cpp:llama_kv_cache`	`kernel/ai/kv_cache.c`	TODO
`llama.cpp:llama_sample_*`	`kernel/ai/sampling.c`	TODO
`unicode.cpp`	`kernel/ai/tokenizer/unicode.c`	TODO
`llama.cpp:llama_tokenize`	`kernel/ai/tokenizer/bpe.c`	TODO

GitHub Issues to Create

Critical Priority (Phase 1: GGML Tensor Operations)

LLAMA-001: Port ggml_mul_mat from llama.cpp
LLAMA-002: Port ggml_rope from llama.cpp
LLAMA-003: Port ggml_soft_max from llama.cpp
LLAMA-004: Port ggml_rms_norm from llama.cpp

High Priority (Phases 2-3: Quantization & Model Loading)

LLAMA-005: Port Q4_K dequantization from llama.cpp
LLAMA-006: Port Q5_K and Q6_K dequantization
LLAMA-007: Port GGUF metadata parsing
LLAMA-008: Port tokenizer loading logic
LLAMA-009: Port weight tensor organization

Medium Priority (Phases 4-5: Inference & Tokenization)

LLAMA-010: Port transformer forward pass
LLAMA-011: Port KV cache implementation
LLAMA-012: Port sampling strategies
LLAMA-013: Port BPE encoding
LLAMA-014: Port BPE decoding

Testing (Phase 6: Validation)

LLAMA-015: Create token-by-token comparison test
LLAMA-016: Benchmark against llama.cpp

Total: 16 issues

Links

External References

#embodios #llamacpp #ggml #integration #transformer #inference #pillar-1

llama.cpp Integration Roadmap

Why llama.cpp?

llama.cpp Architecture Analysis

Core Components We Need

Components We DON'T Need

Integration Phases

Chapter 1: GGML Tensor Operations

Step 1: Matrix Multiplication

Step 2: RoPE (Rotary Positional Encoding)

Step 3: Softmax

Step 4: RMS Normalization

Chapter 2: Quantization Kernels

Step 1: Q4_K Dequantization

Step 2: Q5_K and Q6_K

Step 3: SIMD Quantization Kernels

Chapter 3: Model Loading

Step 1: GGUF Metadata Parsing

Step 2: Tokenizer Loading

Step 3: Weight Tensor Organization

Phase 4: Inference Pipeline (Weeks 11-14)

4.1: Single-Token Forward Pass (Week 11-12)

4.2: KV Cache Implementation (Week 13)

4.3: Sampling Strategies (Week 14)

Phase 5: Tokenization (Weeks 15-16)

5.1: BPE Encoding (Week 15)

5.2: BPE Decoding (Week 16)

Phase 6: Validation & Testing (Weeks 17-18)

6.1: Token-by-Token Comparison (Week 17)

6.2: Performance Benchmarking (Week 18)

Implementation Phases

Success Criteria

File Mapping: llama.cpp → EMBODIOS

GitHub Issues to Create

Critical Priority (Phase 1: GGML Tensor Operations)

High Priority (Phases 2-3: Quantization & Model Loading)

Medium Priority (Phases 4-5: Inference & Tokenization)

Testing (Phase 6: Validation)

Links

External References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally