Skip to content

AES: Add POWER ISA 2.7+ optimization#213

Open
runlevel5 wants to merge 2 commits into
ip7z:mainfrom
runlevel5:ppc64
Open

AES: Add POWER ISA 2.7+ optimization#213
runlevel5 wants to merge 2 commits into
ip7z:mainfrom
runlevel5:ppc64

Conversation

@runlevel5
Copy link
Copy Markdown

What is it

Adds a POWER8 AES SIMD path to AesOpt.c and wires it into the runtime dispatch in Aes.c, using the in-core AES instructions introduced in Power ISA 2.07 (vcipher, vcipherlast, vncipher, vncipherlast). Both ppc64le and ppc64be are supported. The path is selected at runtime via CPU_IsSupported_VEC_CRYPTO() and falls through to the scalar implementation on hosts without the feature.

AES throughput on POWER (16 MB buffer, best-of-8, MB/s)

Operation Power9 LE Power10 LE Power9 BE
AES-128
CBC encode HW 765.0 1067.6 720.0
CBC encode SW 204.3 250.5 202.2
CBC decode HW 1000.9 3174.0 1302.7
CTR HW 1519.6 756.4 1185.4
CTR SW 222.0 255.0 218.4
AES-192
CBC encode HW 638.7 861.8 607.4
CBC encode SW 175.7 223.0 175.6
CBC decode HW 1364.3 2701.4 1085.7
CTR HW 1259.1 641.0 996.2
CTR SW 189.4 219.9 186.8
AES-256
CBC encode HW 557.7 772.5 517.3
CBC encode SW 147.3 188.1 147.2
CBC decode HW 1165.4 2153.9 926.4
CTR HW 1095.9 589.5 850.8
CTR SW 156.2 185.1 154.9

HW vs SW speedup

Key size Op Power9 LE Power10 LE Power9 BE
AES-128 CBC enc 3.7× 4.3× 3.6×
AES-128 CTR 6.8× 3.0× 5.4×
AES-192 CBC enc 3.6× 3.9× 3.5×
AES-192 CTR 6.6× 2.9× 5.3×
AES-256 CBC enc 3.8× 4.1× 3.5×
AES-256 CTR 7.0× 3.2× 5.5×

CBC decode HW has no apples-to-apples SW comparison on PPC — Aes_SetKey_Dec emits the encryption schedule when HW is active (see Aes.h), so scalar decode would need a separate IMC-applied schedule.

Host details

Host CPU Endian VEC_CRYPTO ARCH_3_00 ARCH_3_1
Power9 LE POWER9 LE 1 1 0
Power10 LE POWER10 LE 1 1 1
Power9 BE POWER9 BE 1 1 0

I hope this is a good starting point for future optimizations (such as CRC, SHA, etc)

Notes

  • Tested with Fedora Linux PPC64LE and Debian Forks PPC64BE
  • On hosts without the feature (e.g. POWER7), the dispatch leaves the scalar function pointers in place — there is no behavioural change.
  • FYI https://github.com/IBM/actionspz offers free GitHub Action on POWER-based runners, it would be great if we could apply and get it setup properly
  • The works are verified on real POWER9 and POWER10 computers

Credits

  • Timothy P. of Raptor Computing System for POWER9 hardware
  • Lance and Change from Oregon State University for assisting me with testing on real POWER10 hardware

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant