Silent miscompile of a software-emulated FP64 reduction at -O1/O2

# Silent miscompile of a software-emulated FP64 reduction at -O1/O2 (GPU without native FP64)

## Summary

On an Intel GPU **without native FP64** (Alder Lake-P iGPU, device `0xa7a0`), where `double` is software-emulated by IGC, a `oneapi::dpl::reduce` over a small custom accumulator struct (`{double, double, int}`) **returns a wrong result at -O2** while being **correct at -O0** and **correct on the OpenCL CPU device** (which has native FP64). The reduced `running_square_deviations` field comes out `0` instead of `5`.

This is a **silent miscompilation** — wrong data, no error, no crash.

## Environment

| | |
|---|---|
| GPU | Intel Alder Lake-P iGPU, device id `0xa7a0` (**no native FP64**) |
| Compiler | Intel oneAPI DPC++/C++ Compiler 2026.0.0 |
| FP64 mode | software emulation (`OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1`) |
| OS | Ubuntu 22.04.5 (WSL2), kernel 6.6 |

Reproduced on two independent stacks (so it is **not** fixed on current IGC):
- **IGC 2.11.12** / compute-runtime (NEO) 25.18.33578 / Level-Zero GPU 1.6.33578
- **IGC 2.34.4** / compute-runtime (NEO) 26.18.38308 / Level-Zero GPU 1.15.38308

## Reproducer (self-contained, ~55 lines, SYCL + oneDPL)

```cpp
#include <sycl/sycl.hpp>
#include <oneapi/dpl/execution>
#include <oneapi/dpl/numeric>
#include <oneapi/dpl/iterator>
#include <cstdio>
using size_type = int;
struct var_std {                 // numerically-stable parallel variance accumulator
  double running_sum;
  double running_square_deviations;
  size_type count;
  var_std(double t = 0, double s = 0, size_type n = 0)
      : running_sum(t), running_square_deviations(s), count(n) {}
  var_std operator+(var_std const& rhs) const {
    auto m = count, n = rhs.count;
    if (m == 0) return rhs;
    if (n == 0) return *this;
    double tm = running_sum, tn = rhs.running_sum;
    double sm = running_square_deviations, sn = rhs.running_square_deviations;
    double tmn  = tm + tn;
    double diff = ((double)n / m) * tm - tn;
    double smn  = sm + sn + ((double)m / n) / (m + n) * diff * diff;
    return {tmn, smn, m + n};
  }
};
struct to_var_std { var_std operator()(double const& v) const { return {v, 0.0, 1}; } };
struct plus_op   { template <class T> T operator()(T const& a, T const& b) const { return a + b; } };

int main(int argc, char** argv) {
  int N = argc > 1 ? atoi(argv[1]) : 4;                       // {1,2,3,4}
  sycl::queue q{sycl::default_selector_v};                    // honors ONEAPI_DEVICE_SELECTOR
  std::printf("device: %s\n", q.get_device().get_info<sycl::info::device::name>().c_str());
  auto counting = oneapi::dpl::counting_iterator<size_type>(0);
  auto values   = oneapi::dpl::make_transform_iterator(
      counting, [](size_type i) { return (double)(i + 1); });
  auto acc_it   = oneapi::dpl::make_transform_iterator(values, to_var_std{});
  var_std r = oneapi::dpl::reduce(
      oneapi::dpl::execution::make_device_policy(q),
      acc_it, acc_it + N, var_std{0, 0, 0}, plus_op{});
  std::printf("sqdev = %g   (expected 5)%s\n",
              r.running_square_deviations,
              r.running_square_deviations == 0 ? "   <-- MISCOMPILE" : "");
  return r.running_square_deviations == 0 ? 1 : 0;
}
```

## Build / run

```bash
# FP64 software emulation must be enabled on an FP64-less iGPU:
export NEOReadDebugKeys=1 OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1
# Cold compiler caches so IGC actually recompiles each run:
export NEO_CACHE_PERSISTENT=0 SYCL_CACHE_PERSISTENT=0
FLAGS="-fsycl -fsycl-default-sub-group-size=32 -std=gnu++17"

icpx $FLAGS -O0 repro.cpp -o repro_O0 && ./repro_O0
icpx $FLAGS -O1 repro.cpp -o repro_O1 && ./repro_O1
icpx $FLAGS -O2 repro.cpp -o repro_O2 && ./repro_O2
ONEAPI_DEVICE_SELECTOR=opencl:cpu ./repro_O2
```

Observed (IGC 2.11.12 / compute-runtime (NEO) 25.18.33578 / Level-Zero GPU 1.6.33578)

```
device: Intel(R) Graphics [0xa7a0]
GPU -O0:  sqdev = 5   (expected 5)
GPU -O1:  sqdev = 0   (expected 5)   <-- MISCOMPILE
GPU -O2:  sqdev = 0   (expected 5)   <-- MISCOMPILE
CPU -O2:  sqdev = 5   (expected 5)
```

Observed (IGC 2.34.4 / compute-runtime (NEO) 26.18.38308 / Level-Zero GPU 1.15.38308)

```
device: Intel(R) Graphics [0xa7a0]
GPU -O0:  sqdev = 5   (expected 5)
GPU -O1:  sqdev = -nan   (expected 5)   <-- MISCOMPILE
GPU -O2:  sqdev = -nan   (expected 5)   <-- MISCOMPILE
CPU -O2:  sqdev = 5   (expected 5)
```

## Triage already done

- **Correct at -O0, wrong at -O2** on the GPU; the wrong field is exactly `0`.
- **Same source  is correct on the OpenCL CPU device** at every optimization level → the device program is well-defined; the defect is in the GPU **FP64-emulation + optimizer** path.
- **It is the FP64 emulation specifically**: changing `double` → `float` (native) makes -O2 correct; an accumulator that uses only a single emulated `+` is also correct. Only the emulation-heavy merge (several emulated `double` divides/muls) is miscompiled.
- From `IGC_ShaderDumpEnableAll` dumps: the `double` ops survive the entire middle-end intact and are lowered to integer software emulation at `CG_after_PreCompiledFuncImport`. The wrong value appears **after** that, in the CodeGen memory-optimization stage:
  - `IGC_DisableMemOpt=1` changes the result from `0` to **`-nan`** — i.e. the `running_square_deviations` field value never reaches the output store; MemOpt's load/store forwarding turns that uninitialized read into `0`.
  - Disabling individual passes does **not** restore correctness: `DisableCustomUnsafeOpt`, `DisableFastMathConstantHandling`, `OCLEnableReassociate=0`, `DisableMatchMad`, `DisableLoopUnroll`, EarlyCSE off, `SelectiveLoopUnrollForDPEmu=0`, `EmulationFunctionControl` / `InlinedEmulationThreshold`. Only `-O0` does.

So the trigger appears to be IGC's **-O2 CodeGen memory optimization mishandling the store/load of an emulated-FP64 struct field** produced by `PreCompiledFuncImport`.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silent miscompile of a software-emulated FP64 reduction at -O1/O2 #413

Silent miscompile of a software-emulated FP64 reduction at -O1/O2 (GPU without native FP64)

Summary

Environment

Reproducer (self-contained, ~55 lines, SYCL + oneDPL)

Build / run

Triage already done

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


GPU	Intel Alder Lake-P iGPU, device id `0xa7a0` (no native FP64)
Compiler	Intel oneAPI DPC++/C++ Compiler 2026.0.0
FP64 mode	software emulation (`OverrideDefaultFP64Settings=1 IGC_EnableDPEmulation=1`)
OS	Ubuntu 22.04.5 (WSL2), kernel 6.6

Silent miscompile of a software-emulated FP64 reduction at -O1/O2 #413

Description

Silent miscompile of a software-emulated FP64 reduction at -O1/O2 (GPU without native FP64)

Summary

Environment

Reproducer (self-contained, ~55 lines, SYCL + oneDPL)

Build / run

Triage already done

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions