Skip to content

llama.cpp Integration Roadmap

Dmitry Dimcha edited this page Jan 3, 2026 · 2 revisions

Goal: Port llama.cpp reference implementation to EMBODIOS kernel
Status: Not started (0%)
Last Updated: 14 December 2025
Strategy: Selective porting, not wholesale copy


Why llama.cpp?

llama.cpp is the de facto reference implementation for LLM inference:

  • ✅ Production-tested (millions of users)
  • ✅ Highly optimized (SIMD, quantization, KV cache)
  • ✅ Compatible with GGUF format
  • ✅ Clean C/C++ codebase (minimal dependencies)
  • ✅ Battle-tested tokenizers (BPE, SentencePiece)
  • ✅ Extensive model support (Llama, Mistral, Phi, etc.)

Why not just use llama.cpp directly?

  • ❌ Requires userspace runtime (pthreads, C++ stdlib)
  • ❌ Uses mmap() for model loading
  • ❌ Relies on POSIX APIs
  • ❌ Has C++ dependencies we can't use in kernel

Our approach: Selective porting

  • Extract core algorithms (transformer, attention, quantization)
  • Rewrite in pure C for kernel environment
  • Adapt to EMBODIOS memory model
  • Maintain API compatibility for testing

llama.cpp Architecture Analysis

Core Components We Need

llama.cpp/
├── ggml.c                    # Tensor operations (CRITICAL)
│   ├── ggml_compute_forward   → Port to embodios/ai/ops/
│   ├── ggml_mul_mat            → matmul.c
│   ├── ggml_rope               → rope.c
│   └── ggml_soft_max           → softmax.c
│
├── llama.cpp                 # Model loading & inference (CRITICAL)
│   ├── llama_load_model       → gguf_parser.c
│   ├── llama_decode           → transformer.c
│   ├── llama_kv_cache         → kv_cache.c
│   └── llama_sampling         → sampling.c
│
├── ggml-quants.c             # Quantization kernels (CRITICAL)
│   ├── dequantize_row_q4_K   → gguf_quant.c
│   ├── dequantize_row_q5_K   → gguf_quant.c
│   └── dequantize_row_q6_K   → gguf_quant.c
│
├── unicode.cpp               # UTF-8 handling (NEEDED)
│   └── codepoint handling     → tokenizer/unicode.c
│
└── common/                   # Utilities (SELECTIVE)
    ├── sampling               → sampling.c
    └── grammar (skip for v1.0)

Components We DON'T Need

llama.cpp/
├── examples/                 # Skip (userspace CLI tools)
├── tests/                    # Skip (use our own tests)
├── ggml-backend.c            # Skip (GPU/Metal/CUDA)
├── ggml-alloc.c              # Skip (use our heap)
└── llm.cpp                   # Skip (LLM-specific, use llama.cpp)

Integration Phases

Chapter 1: GGML Tensor Operations

Goal: Port core tensor operations from ggml.c to EMBODIOS.

Step 1: Matrix Multiplication

Source: ggml.c:ggml_compute_forward_mul_mat()

Port to: kernel/ai/ops/matmul.c

Key Changes:

// llama.cpp version (floating-point)
void ggml_compute_forward_mul_mat_f32(
    const struct ggml_tensor * src0,
    const struct ggml_tensor * src1,
    struct ggml_tensor * dst) {

    const float * src0_data = (float *) src0->data;
    const float * src1_data = (float *) src1->data;
    float * dst_data = (float *) dst->data;

    // Matrix multiply logic...
}

// EMBODIOS version (fixed-point)
void embodios_matmul_fixed(
    const fixed_t* a,      // M x K
    const fixed_t* b,      // K x N
    fixed_t* c,            // M x N
    int M, int K, int N) {

    for (int i = 0; i < M; i++) {
        for (int j = 0; j < N; j++) {
            fixed_t sum = 0;
            for (int k = 0; k < K; k++) {
                sum += FIXED_MUL(a[i*K + k], b[k*N + j]);
            }
            c[i*N + j] = sum;
        }
    }
}

Deliverables:

  • Port matmul (naive version)
  • Add SIMD optimization (SSE2/AVX2)
  • Unit tests: compare with llama.cpp output
  • Performance: within 20% of llama.cpp

Step 2: RoPE (Rotary Positional Encoding)

Source: ggml.c:ggml_compute_forward_rope()

Port to: kernel/ai/ops/rope.c

Key Insight: llama.cpp precomputes sin/cos tables. Copy this optimization.

// llama.cpp approach
struct ggml_tensor * ggml_rope_custom(
    struct ggml_context * ctx,
    struct ggml_tensor * a,
    int n_dims,
    int mode,
    int n_ctx,
    float freq_base,
    float freq_scale,
    ...) {

    // Precompute sin/cos
    float * freq = malloc(n_dims/2 * sizeof(float));
    for (int i = 0; i < n_dims/2; i++) {
        freq[i] = 1.0f / powf(freq_base, (float)i*2 / n_dims);
    }

    // Apply rotation...
}

// EMBODIOS version
void embodios_rope(
    fixed_t* q,           // Query tensor
    int pos,              // Position
    int head_dim,         // Head dimension
    int n_heads) {

    // Use precomputed freq table (global)
    for (int h = 0; h < n_heads; h++) {
        for (int d = 0; d < head_dim/2; d++) {
            int idx = h * head_dim + d;

            fixed_t cos_val = rope_cos_table[pos * head_dim + d];
            fixed_t sin_val = rope_sin_table[pos * head_dim + d];

            fixed_t q0 = q[idx];
            fixed_t q1 = q[idx + head_dim/2];

            q[idx] = FIXED_MUL(q0, cos_val) - FIXED_MUL(q1, sin_val);
            q[idx + head_dim/2] = FIXED_MUL(q0, sin_val) + FIXED_MUL(q1, cos_val);
        }
    }
}

Deliverables:

  • Port RoPE from llama.cpp
  • Precompute sin/cos tables at boot
  • Unit test: verify rotations match llama.cpp
  • Support head_dim 64, 128

Step 3: Softmax

Source: ggml.c:ggml_compute_forward_soft_max()

Port to: kernel/ai/ops/softmax.c

Critical insight from llama.cpp:

// llama.cpp uses max-subtraction for numerical stability
void ggml_compute_forward_soft_max_f32(
    const struct ggml_tensor * src0,
    struct ggml_tensor * dst) {

    // Find max value
    float max = -INFINITY;
    for (int i = 0; i < ne0; i++) {
        max = fmaxf(max, src[i]);
    }

    // Compute exp(x - max)
    float sum = 0.0f;
    for (int i = 0; i < ne0; i++) {
        dst[i] = expf(src[i] - max);
        sum += dst[i];
    }

    // Normalize
    for (int i = 0; i < ne0; i++) {
        dst[i] /= sum;
    }
}

EMBODIOS version: Use lookup table for exp() instead of expf().

Deliverables:

  • Port softmax with max-subtraction
  • Implement exp() lookup table
  • Unit test: compare with llama.cpp
  • Error: <0.1% deviation

Step 4: RMS Normalization

Source: ggml.c:ggml_compute_forward_rms_norm()

Port to: kernel/ai/rmsnorm.c

// llama.cpp version
void ggml_compute_forward_rms_norm_f32(
    const struct ggml_tensor * src0,
    struct ggml_tensor * dst) {

    const float eps = 1e-6f;

    // Compute mean of squares
    float sum = 0.0f;
    for (int i = 0; i < ne0; i++) {
        sum += src0[i] * src0[i];
    }
    float mean = sum / ne0;

    // Normalize
    float scale = 1.0f / sqrtf(mean + eps);
    for (int i = 0; i < ne0; i++) {
        dst[i] = src0[i] * scale * weight[i];
    }
}

EMBODIOS adaptation: Use fixed-point sqrt approximation.

Deliverables:

  • Port RMSNorm from llama.cpp
  • Implement fixed-point rsqrt (1/sqrt)
  • Unit test: verify normalization
  • Performance: <2ms for 2048-dim vector

Chapter 2: Quantization Kernels

Goal: Port quantization/dequantization kernels from ggml-quants.c.

Step 1: Q4_K Dequantization

Source: ggml-quants.c:dequantize_row_q4_K()

Port to: kernel/gguf/gguf_quant.c

llama.cpp Q4_K structure:

// From ggml-quants.h
#define QK_K 256

typedef struct {
    uint8_t scales[QK_K/16];   // 16 scales
    uint8_t mins[QK_K/16];     // 16 minimums
    uint8_t qs[QK_K/2];        // 4-bit quantized values
    ggml_fp16_t d;              // Delta (scale factor)
    ggml_fp16_t dmin;           // Min scale factor
} block_q4_K;

void dequantize_row_q4_K(const block_q4_K * x, float * y, int k) {
    for (int i = 0; i < k / QK_K; i++) {
        const block_q4_K * b = &x[i];

        for (int j = 0; j < QK_K; j++) {
            int scale_idx = j / 16;
            int q_idx = j / 2;

            uint8_t scale = b->scales[scale_idx];
            uint8_t q = (b->qs[q_idx] >> ((j%2)*4)) & 0xF;

            float d = GGML_FP16_TO_FP32(b->d);
            y[i*QK_K + j] = (q - 8) * scale * d;
        }
    }
}

EMBODIOS version: Output to fixed_t instead of float.

Deliverables:

  • Port Q4_K dequantization
  • Convert to fixed-point output
  • Unit test: compare with llama.cpp
  • Performance: >1M weights/sec

Step 2: Q5_K and Q6_K

Source: ggml-quants.c:dequantize_row_q5_K() and dequantize_row_q6_K()

Deliverables:

  • Port Q5_K (5-bit quantization)
  • Port Q6_K (6-bit quantization)
  • Unit tests for both
  • Document memory/speed tradeoffs

Step 3: SIMD Quantization Kernels

Source: ggml-quants.c (SSE/AVX2 versions)

llama.cpp has SIMD-optimized quantization kernels:

#if defined(__SSE2__)
void dequantize_row_q4_K_sse2(...) {
    // 4x faster than scalar
}
#endif

#if defined(__AVX2__)
void dequantize_row_q4_K_avx2(...) {
    // 8x faster than scalar
}
#endif

EMBODIOS: Port SSE2 version first, AVX2 later (optional).

Deliverables:

  • Port SSE2 quantization kernels
  • Measure speedup (target: 3-4x)
  • Optional: Port AVX2 kernels

Chapter 3: Model Loading

Goal: Port GGUF loading logic from llama.cpp:llama_load_model().

Step 1: GGUF Metadata Parsing

Source: llama.cpp:llama_model_load_internal()

Key insight: llama.cpp has robust error handling for GGUF parsing.

// llama.cpp approach
struct llama_model * llama_load_model_from_file(
    const char * path,
    struct llama_model_params params) {

    // Memory-map file
    void * data = mmap(...);

    // Parse GGUF header
    struct gguf_context * ctx = gguf_init_from_file(path, params);

    // Extract hyperparameters
    int n_vocab = gguf_get_int(ctx, "llama.vocab_size");
    int n_embd = gguf_get_int(ctx, "llama.embedding_length");
    int n_layer = gguf_get_int(ctx, "llama.block_count");

    // Load tensors
    for (int i = 0; i < gguf_get_n_tensors(ctx); i++) {
        const char * name = gguf_get_tensor_name(ctx, i);
        struct ggml_tensor * tensor = ggml_get_tensor(ctx->ctx_data, name);
        // Store tensor...
    }
}

EMBODIOS adaptation:

  • Replace mmap with our file API
  • Use our heap allocator instead of ggml_context

Deliverables:

  • Port GGUF parsing logic
  • Extract all hyperparameters
  • Load all tensors to memory
  • Unit test: load TinyLlama successfully

Step 2: Tokenizer Loading

Source: llama.cpp:llama_load_vocab()

Critical: llama.cpp supports multiple tokenizer types:

enum llama_vocab_type {
    LLAMA_VOCAB_TYPE_SPM,  // SentencePiece
    LLAMA_VOCAB_TYPE_BPE,  // Byte-Pair Encoding (GPT-2)
};

// llama.cpp loads tokenizer from GGUF
void llama_load_vocab(struct gguf_context * ctx, struct llama_vocab * vocab) {
    const char * tokenizer_model = gguf_get_string(ctx, "tokenizer.ggml.model");

    if (strcmp(tokenizer_model, "gpt2") == 0) {
        vocab->type = LLAMA_VOCAB_TYPE_BPE;
        load_bpe_vocab(ctx, vocab);
    } else if (strcmp(tokenizer_model, "llama") == 0) {
        vocab->type = LLAMA_VOCAB_TYPE_SPM;
        load_spm_vocab(ctx, vocab);
    }

    // Load token strings
    int n_vocab = gguf_get_int(ctx, "tokenizer.ggml.vocab_size");
    for (int i = 0; i < n_vocab; i++) {
        const char * token = gguf_get_string_array(ctx, "tokenizer.ggml.tokens", i);
        float score = gguf_get_float_array(ctx, "tokenizer.ggml.scores", i);
        // Store token...
    }
}

EMBODIOS: Support BPE for v1.0 (TinyLlama uses BPE). SentencePiece later.

Deliverables:

  • Port tokenizer loading logic
  • Support BPE tokenizer
  • Load vocabulary from GGUF
  • Unit test: tokenize "Once upon a time"

Step 3: Weight Tensor Organization

Source: llama.cpp:llama_model_load_internal()

Critical insight: llama.cpp organizes tensors hierarchically:

// llama.cpp tensor naming convention
model.layers[0].attn_q.weight    // Query weights, layer 0
model.layers[0].attn_k.weight    // Key weights, layer 0
model.layers[0].attn_v.weight    // Value weights, layer 0
model.layers[0].attn_output.weight
model.layers[0].ffn_gate.weight
model.layers[0].ffn_down.weight
model.layers[0].ffn_up.weight

EMBODIOS: Replicate this structure for compatibility.

struct llama_layer {
    // Attention weights
    fixed_t* wq;  // Query
    fixed_t* wk;  // Key
    fixed_t* wv;  // Value
    fixed_t* wo;  // Output

    // FFN weights
    fixed_t* w1;  // Gate
    fixed_t* w2;  // Down
    fixed_t* w3;  // Up

    // Normalization weights
    fixed_t* attn_norm;
    fixed_t* ffn_norm;
};

struct llama_model {
    int n_vocab;
    int n_embd;
    int n_layer;
    int n_head;
    int n_head_kv;

    fixed_t* tok_embeddings;
    struct llama_layer* layers;
    fixed_t* output_norm;
    fixed_t* output;
};

Deliverables:

  • Define model structure
  • Load all tensors into structure
  • Verify tensor shapes match GGUF
  • Document memory layout

Phase 4: Inference Pipeline (Weeks 11-14)

Goal: Port transformer inference from llama.cpp:llama_decode().

4.1: Single-Token Forward Pass (Week 11-12)

Source: llama.cpp:llm_build_llama()

llama.cpp uses computational graph approach:

static struct ggml_cgraph * llm_build_llama(
    struct llama_context * lctx,
    const llama_batch & batch) {

    struct ggml_context * ctx = lctx->ctx_builder;
    struct ggml_cgraph * gf = ggml_new_graph(ctx);

    // 1. Token embedding
    struct ggml_tensor * inpL = ggml_get_rows(ctx,
        model.tok_embeddings, batch.token);

    // 2. For each layer
    for (int il = 0; il < n_layer; il++) {
        // RMSNorm
        struct ggml_tensor * cur = ggml_rms_norm(ctx, inpL, eps);
        cur = ggml_mul(ctx, cur, model.layers[il].attn_norm);

        // Attention
        struct ggml_tensor * Qcur = ggml_mul_mat(ctx, model.layers[il].wq, cur);
        struct ggml_tensor * Kcur = ggml_mul_mat(ctx, model.layers[il].wk, cur);
        struct ggml_tensor * Vcur = ggml_mul_mat(ctx, model.layers[il].wv, cur);

        // RoPE
        Qcur = ggml_rope_custom(ctx, Qcur, ...);
        Kcur = ggml_rope_custom(ctx, Kcur, ...);

        // Attention scores
        struct ggml_tensor * kq = ggml_mul_mat(ctx, Kcur, Qcur);
        kq = ggml_soft_max(ctx, kq);

        // Attention output
        cur = ggml_mul_mat(ctx, Vcur, kq);
        cur = ggml_mul_mat(ctx, model.layers[il].wo, cur);

        // Residual
        inpL = ggml_add(ctx, inpL, cur);

        // FFN (similar structure)...
    }

    // Build and return graph
    ggml_build_forward_expand(gf, cur);
    return gf;
}

EMBODIOS approach: Imperative (no graph), direct computation.

void llama_forward(
    struct llama_model* model,
    int token,
    int pos,
    fixed_t* logits) {

    // 1. Embedding lookup
    fixed_t* x = &model->tok_embeddings[token * model->n_embd];

    // 2. For each layer
    for (int layer = 0; layer < model->n_layer; layer++) {
        struct llama_layer* l = &model->layers[layer];

        // Pre-attention RMSNorm
        rmsnorm(x, l->attn_norm, model->n_embd);

        // Attention
        attention(x, l, pos, model);

        // Residual
        add_residual(x, x_orig, model->n_embd);

        // Pre-FFN RMSNorm
        rmsnorm(x, l->ffn_norm, model->n_embd);

        // FFN
        ffn_swiglu(x, l, model);

        // Residual
        add_residual(x, x_orig, model->n_embd);
    }

    // 3. Final RMSNorm + output
    rmsnorm(x, model->output_norm, model->n_embd);
    matmul(x, model->output, logits, model->n_embd, model->n_vocab);
}

Deliverables:

  • Implement forward pass
  • Match llama.cpp layer-by-layer
  • Unit test: compare intermediate activations
  • Integration test: compare final logits

4.2: KV Cache Implementation (Week 13)

Source: llama.cpp:llama_kv_cache

llama.cpp KV cache structure:

struct llama_kv_cache {
    struct ggml_tensor * k;  // All key tensors
    struct ggml_tensor * v;  // All value tensors

    int n_ctx;    // Max context length (e.g., 2048)
    int n_layer;  // Number of layers
};

void llama_kv_cache_update(
    struct llama_kv_cache * cache,
    const struct llama_batch * batch) {

    // Store K, V for each position
    for (int pos = 0; pos < batch->n_tokens; pos++) {
        for (int layer = 0; layer < n_layer; layer++) {
            // Copy K, V to cache
            memcpy(&cache->k[layer][pos * n_embd], K_cur, n_embd * sizeof(float));
            memcpy(&cache->v[layer][pos * n_embd], V_cur, n_embd * sizeof(float));
        }
    }
}

EMBODIOS: Similar structure, but use fixed_t.

Deliverables:

  • Allocate KV cache (n_layer x n_ctx x n_embd)
  • Store K, V during forward pass
  • Reuse cached values for past tokens
  • Measure: 2x speedup on autoregressive generation

4.3: Sampling Strategies (Week 14)

Source: llama.cpp:llama_sample_*() and common/sampling.cpp

llama.cpp sampling functions:

// Greedy sampling (argmax)
llama_token llama_sample_token_greedy(
    struct llama_context * ctx,
    llama_token_data_array * candidates);

// Temperature sampling
llama_token llama_sample_token(
    struct llama_context * ctx,
    llama_token_data_array * candidates,
    float temp);

// Top-k sampling
void llama_sample_top_k(
    struct llama_context * ctx,
    llama_token_data_array * candidates,
    int k);

// Top-p (nucleus) sampling
void llama_sample_top_p(
    struct llama_context * ctx,
    llama_token_data_array * candidates,
    float p);

EMBODIOS: Port greedy and top-k for v1.0.

Deliverables:

  • Port greedy sampling
  • Port top-k sampling
  • Port temperature scaling
  • Unit test: compare sampled tokens with llama.cpp

Phase 5: Tokenization (Weeks 15-16)

Goal: Port BPE tokenizer from llama.cpp.

5.1: BPE Encoding (Week 15)

Source: unicode.cpp and llama.cpp:llama_tokenize_internal()

llama.cpp BPE algorithm:

std::vector<llama_vocab::id> llama_tokenize_internal(
    const llama_vocab & vocab,
    std::string text,
    bool bos,
    bool special) {

    std::vector<llama_vocab::id> output;

    // Add BOS token
    if (bos) {
        output.push_back(vocab.special_bos_id);
    }

    // Convert UTF-8 to codepoints
    std::vector<uint32_t> codepoints = unicode_cpts_from_utf8(text);

    // Apply BPE merges
    for (auto & merge : vocab.bpe_merges) {
        // Find and merge pairs...
    }

    // Map to token IDs
    for (auto & token : tokens) {
        output.push_back(vocab.token_to_id[token]);
    }

    return output;
}

EMBODIOS: Port to pure C.

Deliverables:

  • Port UTF-8 → codepoint conversion
  • Port BPE merge algorithm
  • Load BPE merges from GGUF
  • Unit test: "Once upon a time" → same token IDs as llama.cpp

5.2: BPE Decoding (Week 16)

Source: llama.cpp:llama_detokenize()

llama.cpp decoding:

std::string llama_detokenize_bpe(
    const llama_vocab & vocab,
    const std::vector<llama_vocab::id> & tokens) {

    std::string text;

    for (auto id : tokens) {
        // Skip special tokens
        if (id == vocab.special_bos_id || id == vocab.special_eos_id) {
            continue;
        }

        // Get token string
        const std::string & token_str = vocab.id_to_token[id].text;
        text += token_str;
    }

    return text;
}

EMBODIOS: Port to C.

Deliverables:

  • Port token ID → string mapping
  • Handle special tokens
  • Unit test: roundtrip encoding/decoding

Phase 6: Validation & Testing (Weeks 17-18)

Goal: Ensure EMBODIOS output matches llama.cpp exactly.

6.1: Token-by-Token Comparison (Week 17)

Create comparison test:

#!/usr/bin/env python3
"""
Compare EMBODIOS output with llama.cpp on the same prompt.
"""

import subprocess

prompt = "Once upon a time"

# Run llama.cpp
llamacpp_output = subprocess.check_output([
    "./llama.cpp/main",
    "-m", "models/tinyllama-1.1b-q4km.gguf",
    "-p", prompt,
    "-n", "50",
    "--temp", "0.0"  # Greedy for deterministic output
])

# Run EMBODIOS
embodios_output = subprocess.check_output([
    "./embodios_cli",
    "--model", "models/tinyllama-1.1b-q4km.gguf",
    "--prompt", prompt,
    "--n-predict", "50",
    "--temp", "0.0"
])

# Compare token-by-token
llamacpp_tokens = parse_tokens(llamacpp_output)
embodios_tokens = parse_tokens(embodios_output)

for i, (l, e) in enumerate(zip(llamacpp_tokens, embodios_tokens)):
    if l != e:
        print(f"MISMATCH at token {i}: llama.cpp={l}, embodios={e}")
        exit(1)

print("✅ ALL TOKENS MATCH!")

Deliverables:

  • Create comparison script
  • Test 10 different prompts
  • Achieve 100% token match on greedy sampling
  • Document any discrepancies

6.2: Performance Benchmarking (Week 18)

Benchmark against llama.cpp:

Metric llama.cpp EMBODIOS v1.0 Target
Tokens/sec 83-86 TBD 85+
First token (ms) ~50 TBD <20
Memory (MB) 160 120 <150
Latency jitter (ms) ±5-10 ±0.5 <±1

Deliverables:

  • Benchmark 1000-token generation
  • Compare speed, memory, latency
  • Identify performance gaps
  • Create optimization plan

Implementation Phases

Note: Implementation follows a chapter-by-chapter approach. Each phase builds upon the previous one.

Phase 1: GGML Tensor Operations (Chapter 1)
├─ Matrix multiplication
├─ RoPE and Softmax
└─ RMSNorm and testing

Phase 2: Quantization Kernels (Chapter 2)
├─ Q4_K dequantization
├─ Q5_K and Q6_K support
└─ SIMD optimizations

Phase 3: Model Loading (Chapter 3)
├─ GGUF metadata parsing
├─ Tokenizer loading
└─ Weight tensor organization

Phase 4: Inference Pipeline
├─ Transformer forward pass
├─ KV cache implementation
└─ Sampling strategies

Phase 5: Tokenization
├─ BPE encoding
└─ BPE decoding

Phase 6: Validation
├─ Token-by-token comparison with llama.cpp
└─ Performance benchmarking

Success Criteria

v1.0 llama.cpp integration is complete when:

  1. Correctness:

    • Generate same tokens as llama.cpp (greedy sampling)
    • Pass token-by-token comparison on 10 prompts
    • Tokenizer roundtrip matches llama.cpp
  2. Performance:

    • 85+ tokens/sec (match/exceed llama.cpp)
    • <20ms first token latency
    • ±0.5ms latency jitter
  3. Compatibility:

    • Load any GGUF model llama.cpp can load
    • Support Q4_K_M, Q5_K_M, Q6_K quantization
    • TinyLlama, Phi-2, Mistral-7B work
  4. Code Quality:

    • Pure C (no C++ dependencies)
    • Kernel-safe (no malloc/free after boot)
    • Well-commented with llama.cpp references

File Mapping: llama.cpp → EMBODIOS

llama.cpp EMBODIOS Status
ggml.c:ggml_mul_mat kernel/ai/ops/matmul.c TODO
ggml.c:ggml_rope kernel/ai/ops/rope.c TODO
ggml.c:ggml_soft_max kernel/ai/ops/softmax.c TODO
ggml.c:ggml_rms_norm kernel/ai/rmsnorm.c TODO
ggml-quants.c:dequantize_row_q4_K kernel/gguf/gguf_quant.c TODO
llama.cpp:llama_load_model kernel/gguf/gguf_parser.c TODO
llama.cpp:llama_decode kernel/ai/transformer.c TODO
llama.cpp:llama_kv_cache kernel/ai/kv_cache.c TODO
llama.cpp:llama_sample_* kernel/ai/sampling.c TODO
unicode.cpp kernel/ai/tokenizer/unicode.c TODO
llama.cpp:llama_tokenize kernel/ai/tokenizer/bpe.c TODO

GitHub Issues to Create

Critical Priority (Phase 1: GGML Tensor Operations)

  • LLAMA-001: Port ggml_mul_mat from llama.cpp
  • LLAMA-002: Port ggml_rope from llama.cpp
  • LLAMA-003: Port ggml_soft_max from llama.cpp
  • LLAMA-004: Port ggml_rms_norm from llama.cpp

High Priority (Phases 2-3: Quantization & Model Loading)

  • LLAMA-005: Port Q4_K dequantization from llama.cpp
  • LLAMA-006: Port Q5_K and Q6_K dequantization
  • LLAMA-007: Port GGUF metadata parsing
  • LLAMA-008: Port tokenizer loading logic
  • LLAMA-009: Port weight tensor organization

Medium Priority (Phases 4-5: Inference & Tokenization)

  • LLAMA-010: Port transformer forward pass
  • LLAMA-011: Port KV cache implementation
  • LLAMA-012: Port sampling strategies
  • LLAMA-013: Port BPE encoding
  • LLAMA-014: Port BPE decoding

Testing (Phase 6: Validation)

  • LLAMA-015: Create token-by-token comparison test
  • LLAMA-016: Benchmark against llama.cpp

Total: 16 issues


Links


External References


#embodios #llamacpp #ggml #integration #transformer #inference #pillar-1

Clone this wiki locally