Skip to content

adding atomic support with atomix#299

Merged
vchuravy merged 4 commits into
JuliaGPU:masterfrom
leios:atomic_attempts_2
May 31, 2022
Merged

adding atomic support with atomix#299
vchuravy merged 4 commits into
JuliaGPU:masterfrom
leios:atomic_attempts_2

Conversation

@leios

@leios leios commented May 25, 2022

Copy link
Copy Markdown
Contributor

After some discussions on #282, we decided to use Atomix for atomic support in KA.

A few quick questions:

  1. Because Base (and CUDA) both have an @atomic macro, we need to specify that we are using the Atomix.@atomic macro in code that needs atomic operations. Should we overdub any @atomic macros in KA to specifically use Atomix?
  2. Should we add in the tests from Atomic attempts #282?
  3. What about atomic primitives like atomic_add!(...), and atomic_sub!(...) from Atomic attempts #282? These come from either CUDA or Core.Intrinsics. Maybe it's a good idea to use Atomix on top of Atomic attempts #282? I don't know how many people will use the primitives over the macro, to be honest.

Note, this should not be merged until JuliaRegistries/General#61002 is automerged.

@tkf tkf left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KernelAbstractions.jl doesn't have to depend on UnsafeAtomicsLLVM.jl (and LLVM.jl)

Comment thread Project.toml Outdated
Comment thread Project.toml Outdated
Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>

@vchuravy vchuravy left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Probably needs docs as well as AMDGPU support.

Comment thread lib/CUDAKernels/Project.toml Outdated
Comment thread src/KernelAbstractions.jl Outdated
Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>
Comment thread examples/histogram.jl Outdated
@pxl-th

pxl-th commented May 31, 2022

Copy link
Copy Markdown
Member

Hi!

As for 3. What about atomic primitives like atomic_add!(...), I'd like to say that I have several kernels that use atomic_add! specifically because it returns the old value after adding. I'm not sure if this is achievable with macros.

Also I'm curious if it will support things like:

@atomic max(x[i], v)

@leios leios changed the title atting atomic support with atomix adding atomic support with atomix May 31, 2022
@vchuravy vchuravy marked this pull request as ready for review May 31, 2022 18:25
Co-authored-by: Takafumi Arakaki <takafumi.a@gmail.com>
@leios

leios commented May 31, 2022

Copy link
Copy Markdown
Contributor Author

I don't mind reworking this PR and #282 so we get both the macro and better ordering support from Atomix and also the atomic_... functions from either Core.Intrinsics or CUDA. I have a branch locally that basically does this and it works fine for my purposes.

I figure most people will want to use the macro, but some people will prefer the atomic_... functions, so why not just do both?

@vchuravy

Copy link
Copy Markdown
Member

Let's merge this for now and then you can open a second PR?

@vchuravy vchuravy merged commit 6374613 into JuliaGPU:master May 31, 2022
@leios

leios commented May 31, 2022

Copy link
Copy Markdown
Contributor Author

This one is not ready to be merged

@vchuravy

Copy link
Copy Markdown
Member

Oops. I got excited that it passed tests :)

@leios

leios commented May 31, 2022

Copy link
Copy Markdown
Contributor Author

It was missing docs and tests, at least... I will add them when I get the chance. To be fair, atomix should have all the necessary tests, I just wanted to double check here. Documentation does not need to be long, but having a section for atomics with an example would go a long way.

@leios

leios commented May 31, 2022

Copy link
Copy Markdown
Contributor Author

I was just waiting to add docs until we settled the atomic "primitive" discussion.

@claforte

claforte commented May 31, 2022

Copy link
Copy Markdown

Thanks a lot @leios ! @pxl-th and a few others in my team are very much looking forward to this PR being merged for our Instant NeRF (3D reconstruction) Julia implementation. If you'd like a sneak preview, let me know, I can invite you to our private Discord and Github. :-)

@pxl-th

pxl-th commented Jun 4, 2022

Copy link
Copy Markdown
Member

I've tried this PR and it looks like on CPU it only supports integer types.
While on GPU I get unsupported dynamic function invocation (call to modify!) for any type.
I'm on Julia 1.8.0-rc1, but the same errors are present on 1.7.2.

Error
ERROR: LoadError: InvalidIRError: compiling kernel #gpu_splat!(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to modify!)
Stacktrace:
 [1] modify!
   @ ~/.julia/packages/Atomix/F9VIX/src/core.jl:33
 [2] macro expansion
   @ ~/code/a.jl:28
 [3] gpu_splat!
   @ ~/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80
 [4] gpu_splat!
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/validation.jl:139
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:409 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:407 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
  [7] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [11] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:293
 [13] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/4VLF4/src/CUDAKernels.jl:272
 [15] main()
    @ Main ~/code/a.jl:40
 [16] top-level scope
    @ ~/code/a.jl:42
in expression starting at /home/pxl-th/code/a.jl:42

MWE:

Code
using CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic

CUDA.allowscalar(false)

n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512

Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)

to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)

@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    @atomic max(grid[idx], mlp_out[i])
end

function main()
    #device = CPU()
    device = CUDADevice()

    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n) # errors on CPU with Float32
    grid = rand(device, Int64, n) # errors on CPU with Float32

    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()

@leios

leios commented Jun 4, 2022

Copy link
Copy Markdown
Contributor Author

I was unable to replicate this error by running the provided code with 1.7.1 and 1.8.0-beta3 (just pulled from git). What OS are you using? Also, could you show the outputs of ] st?

@pxl-th

pxl-th commented Jun 4, 2022

Copy link
Copy Markdown
Member

I'm on Ubuntu 22.04
CPU: AMD Ryzen 7 5800HS
GPU: NVIDIA GeForce RTX 3060

]st:

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [a9b6321e] Atomix v0.1.0
  [052768ef] CUDA v3.10.1
  [72cfdca4] CUDAKernels v0.4.1
  [5789e2e9] FileIO v1.14.0
  [a09fc81d] ImageCore v0.9.3
  [82e4d734] ImageIO v0.6.5
  [02fcd773] ImageTransformations v0.9.4
  [b835a17e] JpegTurbo v0.1.1
  [63c18a36] KernelAbstractions v0.8.1 `https://github.com/JuliaGPU/KernelAbstractions.jl.git#master`

@pxl-th

pxl-th commented Jun 4, 2022

Copy link
Copy Markdown
Member

I've just updated MWE code, before I included code that does not error :)
You can also change grid & mlp_out eltypes to Float32 and to see that it does not work with them.

@leios

leios commented Jun 4, 2022

Copy link
Copy Markdown
Contributor Author

Right, I see the comments now, sorry!

try ]add CUDAKernels#master?

@pxl-th

pxl-th commented Jun 4, 2022

Copy link
Copy Markdown
Member

Yes, that works, thanks!

Although there is another issue, which is not critical for me, but might be worth mentioning:

MWE:

Code
using CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic

CUDA.allowscalar(false)

const NERF_STEPS = UInt32(1024)
const MIN_CONE_STEPSIZE = 3f0 / NERF_STEPS

n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512

Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)

Base.zeros(::CPU, T, shape) = zeros(T, shape)
Base.zeros(::CUDADevice, T, shape) = CUDA.zeros(T, shape)

to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)

@inline density_activation(x) = exp(x)

@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    old, new = @atomic max(grid[idx], mlp_out[i])
    @atomic grid[idx] = old
end

function main()
    # device = CPU()
    device = CUDADevice()

    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n)
    grid = zeros(device, Int64, n)

    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()
Error
ERROR: LoadError: LLVM error: Cannot select: 0x77adc60: ch = AtomicStore<(store seq_cst (s64) into %ir.41, addrspace 1)> 0x4926e08:1, 0x6f629c8, 0x4926e08, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:245 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:201 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:11 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:14 @[ /home/pxl-th/code/a.jl:29 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
  0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
    0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x4926cd0: i64 = Register %0
      0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x66aca88: i64 = Register %9
        0x66ac0c8: i32 = Constant<3>
    0x4926720: i64 = Constant<-8>
  0x4926e08: i64,ch = AtomicLoadMax<(load store seq_cst (s64) on %ir.39, addrspace 1)> 0x7195c50:1, 0x6f629c8, 0x7195c50, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:374 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:18 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ]
    0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
          0x4926cd0: i64 = Register %0
        0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x66aca88: i64 = Register %9
          0x66ac0c8: i32 = Constant<3>
      0x4926720: i64 = Constant<-8>
    0x7195c50: i64,ch = llvm.nvvm.ldg.global.i<(load (s64) from %ir.34, addrspace 1)> 0x65cc278, TargetConstant:i64<5104>, 0x6f62b68, Constant:i32<8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:219 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:40 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ] ] ]
      0x49271b0: i64 = TargetConstant<5104>
      0x6f62b68: i64 = add 0x4926b98, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x4926b98: i64 = add 0x6f62278, 0x4926ed8, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x6f62278: i64,ch = CopyFromReg 0x65cc278, Register:i64 %4, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x7803a58: i64 = Register %4
          0x4926ed8: i64 = shl nuw nsw 0x71ce2d8, Constant:i32<3>, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce2d8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %8, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x77ae620: i64 = Register %8
            0x66ac0c8: i32 = Constant<3>
        0x4926720: i64 = Constant<-8>
      0x7196880: i32 = Constant<8>
In function: _Z21julia_gpu_splat__430516CompilerMetadataI10StaticSizeI5_16__E12DynamicCheckvv7NDRangeILi1ES0_I4_1__ES0_I6_512__EvvEE13CuDeviceArrayI5Int64Li1ELi1EES3_I6UInt32Li1ELi1EES3_IS4_Li1ELi1EE
Stacktrace:
  [1] handle_error(reason::Cstring)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/core/context.jl:105
  [2] LLVMTargetMachineEmitToMemoryBuffer
    @ ~/.julia/packages/LLVM/YSJ2s/lib/13/libLLVM_h.jl:947 [inlined]
  [3] emit(tm::LLVM.TargetMachine, mod::LLVM.Module, filetype::LLVM.API.LLVMCodeGenFileType)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/targetmachine.jl:45
  [4] mcgen(job::GPUCompiler.CompilerJob, mod::LLVM.Module, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/mcgen.jl:74
  [5] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:421 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [8] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:418 [inlined]
  [9] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
 [10] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
 [11] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
 [12] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
 [13] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [14] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [15] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [16] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:292
 [17] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [18] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.StaticSize{(16,)}, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Nothing, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:273
 [19] Kernel
    @ ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:268 [inlined]
 [20] main()
    @ Main ~/code/a.jl:42
 [21] top-level scope
    @ ~/code/a.jl:45
in expression starting at /home/pxl-th/code/a.jl:45

@leios

leios commented Jun 4, 2022

Copy link
Copy Markdown
Contributor Author

Ah, I can replicate this error, but I am not sure if it is an Atomix or KernelAbstractions issue. It seems like the CPU version works fine, so maybe it's with UnsafeAtomicsLLVM?

Would you be willing to open up a new issue either here or on Atomix (https://github.com/JuliaConcurrent/Atomix.jl) and ping @tkf?

@tkf

tkf commented Jun 5, 2022

Copy link
Copy Markdown
Contributor

It's an LLVM issue but workaroundable at the level of (e.g.) CUDA.jl. See: JuliaConcurrent/Atomix.jl#33

This was referenced Jun 8, 2022
@leios

leios commented Jun 14, 2022

Copy link
Copy Markdown
Contributor Author

@pxl-th, if you are still having trouble with Atomix, I created a separate PR with the atomic support from Core.Intrinsics and CUDA directly in #306. I also added the pkg commands to load in the subdirectory of CUDAKernels in a comment so you can just use it for now if you need.

I've been struggling to get things to work as well, so I also added testing infrastructure for Atomix in #308. Hopefully we can iron out all the details there and get all this sorted. If you have run into any issues, please document them there!

@leios leios mentioned this pull request Jun 15, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants