Erasure Code: Extend ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6 by OH195-C · Pull Request #423 · intel/isa-l

OH195-C · 2026-06-12T03:46:46Z

Erasure Code:
1) add 4~6 vector AVX2 dot product with GFNI implementation
2) add AVX2 6vect mad with GFNI implementation
3) ensuring encoding process not modify the input mul_array pointer

Complete the implementations of ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6, consistent with the implementations for other instruction sets (AVX512, AVX2, etc.). This avoids reading source data twice when computing parities k+4 to k+6, preventing memory bandwidth amplification.

pablodelara · 2026-06-12T08:31:09Z

Hi @OH195-C. I was looking for the question about why there is no p=4,5,6 implementation for AVX2_gfni, as you are implementing it, but I cannot find it...
Anyway, we haven't done it because going beyond p=3 didn't report any benefit in throughput, as it was already saturating the execution port executing these instructions. Have you seen a performance improvement yourself?

OH195-C · 2026-06-15T15:22:40Z

Hi @pablodelara. I closed the issue #419, because this commit fixes it. (I will be using the OH195-C account for all future communication.)
I performed performance tests using the erasure_encode_perf benchmark on an Intel Xeon Platinum 8331C CPU.
Model name: Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz
CPU MHz: 2499.999
CPU max MHz: 2500.0000
CPU min MHz: 800.0000
L1d cache: 48K
L1i cache: 32K
L2 cache: 1280K
L3 cache: 36864K

Here are the results:

AVX2+GFNI (+1~+6) outperforms AVX2 because GFNI instructions reduce the number of instructions required to implement Galois Field multiplication.
For 10+4, AVX2+GFNI (+1~+6) outperforms AVX2+GFNI (+1~+3) because the new implementation for parity +4 uses a 64-byte loop stride and reads the data only once, incurring no read amplification.
For 10+5 and 10+6 at 1 KB, where the data fits in the L1 cache, AVX2+GFNI (+1~+3) outperforms AVX2+GFNI (+1~+6); the former processes 64 bytes per iteration, while the latter processes only 32 bytes. At 32 KB, both variants perform equally. However, at 1 MB, AVX2+GFNI (+1~+6) outperforms AVX2+GFNI (+1~+3), as the performance penalty from reading data blocks twice becomes increasingly significant.

During implementation, some YMM and general-purpose registers were reused to address register pressure. As demonstrated above, AVX2+GFNI (+1~+6) consistently outperforms AVX2. Crucially, it introduces no data read amplification, maintaining consistency with other SIMD implementations.

The performance results for ec_encode_data_update_single_src_simple_warm are shown below:

AVX2+GFNI (+1~+6) delivers the best overall performance with no data read amplification.

OH195-C · 2026-06-17T02:04:41Z

Hi @pablodelara, I noticed that some CI checks are failing. I reviewed the error logs, which is strange because I haven't modified any of the affected files. Could you advise on how to resolve this?

pablodelara · 2026-06-17T07:15:30Z

Hi @pablodelara, I noticed that some CI checks are failing. I reviewed the error logs, which is strange because I haven't modified any of the affected files. Could you advise on how to resolve this?

Can you rebase on top of latest master?

1) add 4~6 vector AVX2 dot product with GFNI implementation 2) add AVX2 6vect mad with GFNI implementation 3) ensuring encoding process not modify the input mul_array pointer Complete the implementations of ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6, consistent with the implementations for other instruction sets (AVX512, AVX2, etc.). This avoids reading source data twice when computing parities k+4 to k+6, preventing memory bandwidth amplification. Signed-off-by: cl304641 <cl304641@alibaba-inc.com>

OH195-C · 2026-06-17T08:29:49Z

Hi @pablodelara, I noticed that some CI checks are failing. I reviewed the error logs, which is strange because I haven't modified any of the affected files. Could you advise on how to resolve this?

Can you rebase on top of latest master?

Done. I've rebased onto the latest master.

OH195-C · 2026-06-18T09:44:12Z

Hi @pablodelara, could you please re-run the run_tests_linux-riscv64-v job? This looks like a self-hosted runner issue — the build process was killed by an external signal with no compilation errors, and all other 8 CI jobs passed.

OH195-C force-pushed the develop branch from d54ad96 to 3b2596f Compare June 17, 2026 08:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erasure Code: Extend ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6#423

Erasure Code: Extend ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6#423
OH195-C wants to merge 1 commit into
intel:masterfrom
alibaba:develop

OH195-C commented Jun 12, 2026

Uh oh!

pablodelara commented Jun 12, 2026

Uh oh!

OH195-C commented Jun 15, 2026 •

edited

Loading

Uh oh!

OH195-C commented Jun 17, 2026

Uh oh!

pablodelara commented Jun 17, 2026

Uh oh!

OH195-C commented Jun 17, 2026

Uh oh!

OH195-C commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OH195-C commented Jun 12, 2026

Uh oh!

pablodelara commented Jun 12, 2026

Uh oh!

OH195-C commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OH195-C commented Jun 17, 2026

Uh oh!

pablodelara commented Jun 17, 2026

Uh oh!

OH195-C commented Jun 17, 2026

Uh oh!

OH195-C commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OH195-C commented Jun 15, 2026 •

edited

Loading