Silent miscompile of a software-emulated FP64 reduction at -O1/O2 (GPU without native FP64)
Summary
On an Intel GPU without native FP64 (Alder Lake-P iGPU, device 0xa7a0), where double is software-emulated by IGC, a oneapi::dpl::reduce over a small custom accumulator struct ({double, double, int}) returns a wrong result at -O2 while being correct at -O0 and correct on the OpenCL CPU device (which has native FP64). The reduced running_square_deviations field comes out 0 instead of 5.
This is a silent miscompilation — wrong data, no error, no crash.
Environment
|
|
| GPU |
Intel Alder Lake-P iGPU, device id 0xa7a0 (no native FP64) |
| Compiler |
Intel oneAPI DPC++/C++ Compiler 2026.0.0 |
| FP64 mode |
software emulation (OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1) |
| OS |
Ubuntu 22.04.5 (WSL2), kernel 6.6 |
Reproduced on two independent stacks (so it is not fixed on current IGC):
- IGC 2.11.12 / compute-runtime (NEO) 25.18.33578 / Level-Zero GPU 1.6.33578
- IGC 2.34.4 / compute-runtime (NEO) 26.18.38308 / Level-Zero GPU 1.15.38308
Reproducer (self-contained, ~55 lines, SYCL + oneDPL)
#include <sycl/sycl.hpp>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/numeric>
#include <oneapi/dpl/iterator>
#include <cstdio>
using size_type = int;
struct var_std { // numerically-stable parallel variance accumulator
double running_sum;
double running_square_deviations;
size_type count;
var_std(double t = 0, double s = 0, size_type n = 0)
: running_sum(t), running_square_deviations(s), count(n) {}
var_std operator+(var_std const& rhs) const {
auto m = count, n = rhs.count;
if (m == 0) return rhs;
if (n == 0) return *this;
double tm = running_sum, tn = rhs.running_sum;
double sm = running_square_deviations, sn = rhs.running_square_deviations;
double tmn = tm + tn;
double diff = ((double)n / m) * tm - tn;
double smn = sm + sn + ((double)m / n) / (m + n) * diff * diff;
return {tmn, smn, m + n};
}
};
struct to_var_std { var_std operator()(double const& v) const { return {v, 0.0, 1}; } };
struct plus_op { template <class T> T operator()(T const& a, T const& b) const { return a + b; } };
int main(int argc, char** argv) {
int N = argc > 1 ? atoi(argv[1]) : 4; // {1,2,3,4}
sycl::queue q{sycl::default_selector_v}; // honors ONEAPI_DEVICE_SELECTOR
std::printf("device: %s\n", q.get_device().get_info<sycl::info::device::name>().c_str());
auto counting = oneapi::dpl::counting_iterator<size_type>(0);
auto values = oneapi::dpl::make_transform_iterator(
counting, [](size_type i) { return (double)(i + 1); });
auto acc_it = oneapi::dpl::make_transform_iterator(values, to_var_std{});
var_std r = oneapi::dpl::reduce(
oneapi::dpl::execution::make_device_policy(q),
acc_it, acc_it + N, var_std{0, 0, 0}, plus_op{});
std::printf("sqdev = %g (expected 5)%s\n",
r.running_square_deviations,
r.running_square_deviations == 0 ? " <-- MISCOMPILE" : "");
return r.running_square_deviations == 0 ? 1 : 0;
}
Build / run
# FP64 software emulation must be enabled on an FP64-less iGPU:
export NEOReadDebugKeys=1 OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1
# Cold compiler caches so IGC actually recompiles each run:
export NEO_CACHE_PERSISTENT=0 SYCL_CACHE_PERSISTENT=0
FLAGS="-fsycl -fsycl-default-sub-group-size=32 -std=gnu++17"
icpx $FLAGS -O0 repro.cpp -o repro_O0 && ./repro_O0
icpx $FLAGS -O1 repro.cpp -o repro_O1 && ./repro_O1
icpx $FLAGS -O2 repro.cpp -o repro_O2 && ./repro_O2
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./repro_O2
Observed (IGC 2.11.12 / compute-runtime (NEO) 25.18.33578 / Level-Zero GPU 1.6.33578)
device: Intel(R) Graphics [0xa7a0]
GPU -O0: sqdev = 5 (expected 5)
GPU -O1: sqdev = 0 (expected 5) <-- MISCOMPILE
GPU -O2: sqdev = 0 (expected 5) <-- MISCOMPILE
CPU -O2: sqdev = 5 (expected 5)
Observed (IGC 2.34.4 / compute-runtime (NEO) 26.18.38308 / Level-Zero GPU 1.15.38308)
device: Intel(R) Graphics [0xa7a0]
GPU -O0: sqdev = 5 (expected 5)
GPU -O1: sqdev = -nan (expected 5) <-- MISCOMPILE
GPU -O2: sqdev = -nan (expected 5) <-- MISCOMPILE
CPU -O2: sqdev = 5 (expected 5)
Triage already done
- Correct at -O0, wrong at -O2 on the GPU; the wrong field is exactly
0.
- Same source is correct on the OpenCL CPU device at every optimization level → the device program is well-defined; the defect is in the GPU FP64-emulation + optimizer path.
- It is the FP64 emulation specifically: changing
double → float (native) makes -O2 correct; an accumulator that uses only a single emulated + is also correct. Only the emulation-heavy merge (several emulated double divides/muls) is miscompiled.
- From
IGC_ShaderDumpEnableAll dumps: the double ops survive the entire middle-end intact and are lowered to integer software emulation at CG_after_PreCompiledFuncImport. The wrong value appears after that, in the CodeGen memory-optimization stage:
IGC_DisableMemOpt=1 changes the result from 0 to -nan — i.e. the running_square_deviations field value never reaches the output store; MemOpt's load/store forwarding turns that uninitialized read into 0.
- Disabling individual passes does not restore correctness:
DisableCustomUnsafeOpt, DisableFastMathConstantHandling, OCLEnableReassociate=0, DisableMatchMad, DisableLoopUnroll, EarlyCSE off, SelectiveLoopUnrollForDPEmu=0, EmulationFunctionControl / InlinedEmulationThreshold. Only -O0 does.
So the trigger appears to be IGC's -O2 CodeGen memory optimization mishandling the store/load of an emulated-FP64 struct field produced by PreCompiledFuncImport.
Silent miscompile of a software-emulated FP64 reduction at -O1/O2 (GPU without native FP64)
Summary
On an Intel GPU without native FP64 (Alder Lake-P iGPU, device
0xa7a0), wheredoubleis software-emulated by IGC, aoneapi::dpl::reduceover a small custom accumulator struct ({double, double, int}) returns a wrong result at -O2 while being correct at -O0 and correct on the OpenCL CPU device (which has native FP64). The reducedrunning_square_deviationsfield comes out0instead of5.This is a silent miscompilation — wrong data, no error, no crash.
Environment
0xa7a0(no native FP64)OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1)Reproduced on two independent stacks (so it is not fixed on current IGC):
Reproducer (self-contained, ~55 lines, SYCL + oneDPL)
Build / run
Observed (IGC 2.11.12 / compute-runtime (NEO) 25.18.33578 / Level-Zero GPU 1.6.33578)
Observed (IGC 2.34.4 / compute-runtime (NEO) 26.18.38308 / Level-Zero GPU 1.15.38308)
Triage already done
0.double→float(native) makes -O2 correct; an accumulator that uses only a single emulated+is also correct. Only the emulation-heavy merge (several emulateddoubledivides/muls) is miscompiled.IGC_ShaderDumpEnableAlldumps: thedoubleops survive the entire middle-end intact and are lowered to integer software emulation atCG_after_PreCompiledFuncImport. The wrong value appears after that, in the CodeGen memory-optimization stage:IGC_DisableMemOpt=1changes the result from0to-nan— i.e. therunning_square_deviationsfield value never reaches the output store; MemOpt's load/store forwarding turns that uninitialized read into0.DisableCustomUnsafeOpt,DisableFastMathConstantHandling,OCLEnableReassociate=0,DisableMatchMad,DisableLoopUnroll, EarlyCSE off,SelectiveLoopUnrollForDPEmu=0,EmulationFunctionControl/InlinedEmulationThreshold. Only-O0does.So the trigger appears to be IGC's -O2 CodeGen memory optimization mishandling the store/load of an emulated-FP64 struct field produced by
PreCompiledFuncImport.