Skip to content

Erasure Code: Extend ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6#423

Open
OH195-C wants to merge 1 commit into
intel:masterfrom
alibaba:develop
Open

Erasure Code: Extend ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6#423
OH195-C wants to merge 1 commit into
intel:masterfrom
alibaba:develop

Conversation

@OH195-C

@OH195-C OH195-C commented Jun 12, 2026

Copy link
Copy Markdown

Erasure Code:
1) add 4~6 vector AVX2 dot product with GFNI implementation
2) add AVX2 6vect mad with GFNI implementation
3) ensuring encoding process not modify the input mul_array pointer

Complete the implementations of ec_encode_data_avx2_gfni and ec_encode_data_update_avx2_gfni to support parity blocks k+1 through k+6, consistent with the implementations for other instruction sets (AVX512, AVX2, etc.). This avoids reading source data twice when computing parities k+4 to k+6, preventing memory bandwidth amplification.

@pablodelara

Copy link
Copy Markdown
Contributor

Hi @OH195-C. I was looking for the question about why there is no p=4,5,6 implementation for AVX2_gfni, as you are implementing it, but I cannot find it...
Anyway, we haven't done it because going beyond p=3 didn't report any benefit in throughput, as it was already saturating the execution port executing these instructions. Have you seen a performance improvement yourself?

@OH195-C

OH195-C commented Jun 15, 2026

Copy link
Copy Markdown
Author

Hi @pablodelara. I closed the issue #419, because this commit fixes it. (I will be using the OH195-C account for all future communication.)
I performed performance tests using the erasure_encode_perf benchmark on an Intel Xeon Platinum 8331C CPU.
Model name: Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz
CPU MHz: 2499.999
CPU max MHz: 2500.0000
CPU min MHz: 800.0000
L1d cache: 48K
L1i cache: 32K
L2 cache: 1280K
L3 cache: 36864K

Here are the results:
image

  1. AVX2+GFNI (+1~+6) outperforms AVX2 because GFNI instructions reduce the number of instructions required to implement Galois Field multiplication.
  2. For 10+4, AVX2+GFNI (+1~+6) outperforms AVX2+GFNI (+1~+3) because the new implementation for parity +4 uses a 64-byte loop stride and reads the data only once, incurring no read amplification.
  3. For 10+5 and 10+6 at 1 KB, where the data fits in the L1 cache, AVX2+GFNI (+1~+3) outperforms AVX2+GFNI (+1~+6); the former processes 64 bytes per iteration, while the latter processes only 32 bytes. At 32 KB, both variants perform equally. However, at 1 MB, AVX2+GFNI (+1~+6) outperforms AVX2+GFNI (+1~+3), as the performance penalty from reading data blocks twice becomes increasingly significant.

During implementation, some YMM and general-purpose registers were reused to address register pressure. As demonstrated above, AVX2+GFNI (+1~+6) consistently outperforms AVX2. Crucially, it introduces no data read amplification, maintaining consistency with other SIMD implementations.

The performance results for ec_encode_data_update_single_src_simple_warm are shown below:
image
AVX2+GFNI (+1~+6) delivers the best overall performance with no data read amplification.

@OH195-C

OH195-C commented Jun 17, 2026

Copy link
Copy Markdown
Author

Hi @pablodelara, I noticed that some CI checks are failing. I reviewed the error logs, which is strange because I haven't modified any of the affected files. Could you advise on how to resolve this?

@pablodelara

Copy link
Copy Markdown
Contributor

Hi @pablodelara, I noticed that some CI checks are failing. I reviewed the error logs, which is strange because I haven't modified any of the affected files. Could you advise on how to resolve this?

Can you rebase on top of latest master?

    1) add 4~6 vector AVX2 dot product with GFNI implementation
    2) add AVX2 6vect mad with GFNI implementation
    3) ensuring encoding process not modify the input mul_array pointer

Complete the implementations of ec_encode_data_avx2_gfni and
ec_encode_data_update_avx2_gfni to support parity blocks k+1 through
k+6, consistent with the implementations for other instruction sets
(AVX512, AVX2, etc.). This avoids reading source data twice when
computing parities k+4 to k+6, preventing memory bandwidth
amplification.

Signed-off-by: cl304641 <cl304641@alibaba-inc.com>
@OH195-C

OH195-C commented Jun 17, 2026

Copy link
Copy Markdown
Author

Hi @pablodelara, I noticed that some CI checks are failing. I reviewed the error logs, which is strange because I haven't modified any of the affected files. Could you advise on how to resolve this?

Can you rebase on top of latest master?

Done. I've rebased onto the latest master.

@OH195-C

OH195-C commented Jun 18, 2026

Copy link
Copy Markdown
Author

Hi @pablodelara, could you please re-run the run_tests_linux-riscv64-v job? This looks like a self-hosted runner issue — the build process was killed by an external signal with no compilation errors, and all other 8 CI jobs passed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants