Skip to content

perf: NEON SIMD matmul optimization (~4x speedup)#218

Merged
dddimcha merged 1 commit into
mainfrom
simd/neon-matmul-optimization
Feb 2, 2026
Merged

perf: NEON SIMD matmul optimization (~4x speedup)#218
dddimcha merged 1 commit into
mainfrom
simd/neon-matmul-optimization

Conversation

@dddimcha

@dddimcha dddimcha commented Feb 2, 2026

Copy link
Copy Markdown
Owner

Summary

Replace scalar tensor_gemm() with SIMD-accelerated matmul_neon() in tensor_dense_forward().

Changes

  • 1 file changed: kernel/ai/tensor_ops.c (+62, -8 lines)
  • Uses existing matmul_neon() from simd_ops.c
  • Converts float→Q16.16 fixed-point→SIMD→float
  • Graceful fallback to scalar on allocation failure

Benchmark Results (ARM64)

Matrix Size Scalar NEON Speedup
64×64 0.16ms 0.04ms 4.48x
128×128 1.52ms 0.38ms 4.03x
256×256 15.2ms 3.76ms 4.04x
512×512 144ms 36ms 4.00x
1024×1024 1278ms 319ms 4.01x

Testing

  • ✅ Compiles for x86_64
  • ✅ Compiles for aarch64 (cross-compile)
  • ✅ Benchmarked on Apple Silicon

Code Diff

// BEFORE (scalar):
tensor_gemm(in_data, weight_data, out_data, ...);

// AFTER (SIMD):
matmul_neon(in_fixed, weight_fixed, out_fixed, ...);

Replace scalar tensor_gemm() with SIMD-accelerated matmul_neon() for
~4x faster matrix multiplication on ARM64.

Changes:
- Add fixed_point.h include for Q16.16 format
- Convert float→fixed→SIMD→float in tensor_dense_forward()
- Graceful fallback to scalar on allocation failure

Benchmark results (Apple Silicon ARM64):
  64x64:   4.48x speedup
  256x256: 4.04x speedup
  512x512: 4.00x speedup
  1024x1024: 4.01x speedup

Tested on: macOS ARM64, cross-compiled for aarch64-elf
@dddimcha dddimcha merged commit a840eb8 into main Feb 2, 2026
2 of 4 checks passed
@dddimcha dddimcha deleted the simd/neon-matmul-optimization branch February 2, 2026 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant