Skip to content

[R] Data race issue from R API requests in parallel region #50239

Description

@MichaelChirico

Describe the bug, including details regarding any error messages, version, and platform.

We recently enabled ThreadSanitizer tests for R packages, and it surfaced some errors in the {arrow} suite, e.g.

WARNING: ThreadSanitizer: data race (pid=9303)
Read of size 8 at 0x72900075c008 by thread T9:
   #0 INTEGER src/main/memory.c:4205:8
   #1 arrow::r::Converter_Int<arrow::UInt8Type>::Ingest_some_nulls(SEXPREC*, std::shared_ptr<arrow::Array> const&, long, long, unsigned long) const r/src/array_to_vector.cpp:191:19 (arrow.so+0xf0337)
   #2 operator() r/src/array_to_vector.cpp:88:24 (arrow.so+0xe00a9)
   #3 arrow::internal::FnOnce<arrow::Status ()>::FnImpl<arrow::r::Converter::ScheduleConvertTasks(arrow::r::RTasks&, std::shared_ptr<arrow::r::Converter>)::'lambda'()>::invoke() cpp/src/arrow/util/functional.h:153:42 (arrow.so+0xe00a9)
   #4 operator() cpp/src/arrow/util/functional.h:141:17
   #5 operator() cpp/src/arrow/util/task_group.cc:114:18
   #6 arrow::internal::FnOnce<void ()>::FnImpl<arrow::internal::(anonymous namespace)::ThreadedTaskGroup::AppendReal(arrow::internal::FnOnce<arrow::Status ()>)::'lambda'()>::invoke() cpp/src/arrow/util/functional.h:153:42
   #7 operator() cpp/src/arrow/util/functional.h:141:17
   #8 WorkerLoop cpp/src/arrow/util/thread_pool.cc:457:11
   #9 operator() cpp/src/arrow/util/thread_pool.cc:618:7
   #10 __invoke<(lambda at cpp/src/arrow/util/thread_pool.cc:616:23)>
   #11 __thread_execute
   #12 void* std::__thread_proxy

I don't really have the knowledge of internals required to completely debug the issue, but Gemini found a fix and summarized it. As usual with LLMs, it looks good at a high level, if a bit wordy. I can also propose Gemini's edit as a PR, if so requested. For now, I'm just including its description of the solution, rather than the actual diff.


Description

During the conversion of Arrow Tables to R DataFrames (specifically when use_threads = TRUE is enabled), a data race can occur between the R main thread and background worker threads. This race is detected by ThreadSanitizer (TSAN) when worker threads access R vector headers (via INTEGER(), REAL(), etc.) concurrently with the main thread registering the vectors into the output list.

Root Cause Analysis

to_data_frame loops over columns and calls Converter::LazyConvert to schedule conversion tasks:

for (int i = 0; i < nc; i++) {
names[i] = data->schema()->field(i)->name();
tbl[i] = Converter::LazyConvert(to_chunks(data->column(i)), tasks);
}

  1. Immediate Execution: Currently, RTasks::Append immediately submits parallel tasks to the CPU thread pool when they are scheduled. These background tasks start running immediately and begin writing to the pre-allocated R vector (out) for their respective columns.
  2. Concurrent Access: To write data, the background worker threads call R API accessors like INTEGER(data) (or REAL, LOGICAL), which read the vector header to check the type via TYPEOF(x) (accessing x->sxpinfo.type).
  3. Main Thread Writes: Meanwhile, the main thread continues the loop. As soon as LazyConvert returns the allocated (but not yet fully populated) vector, the main thread assigns it to the list element: tbl[i] = out. This assignment calls SET_VECTOR_ELT, which modifies the vector's header metadata (e.g., updating reference counts or setting GC write barriers/generation flags).
  4. The Race: The worker threads are reading the vector's sxpinfo header concurrently with the main thread writing to the same sxpinfo header (which is a bitfield sharing the same memory word), leading to a data race.

Here is a draft for the bug report, suitable for submitting to the upstream Apache Arrow project (e.g., on GitHub Issues). It uses the sanitized stack trace and focuses on the generic R/C++ interface issue.


[R] Data race on R vector headers during parallel Table to DataFrame conversion

Description

During the conversion of Arrow Tables to R DataFrames (specifically when use_threads = TRUE is enabled), a data race can occur between the R main thread and background worker threads. This race is detected by ThreadSanitizer (TSAN) when worker threads access R vector headers (via INTEGER(), REAL(), etc.) concurrently with the main thread registering the vectors into the output list.

Root Cause Analysis

In r/src/array_to_vector.cpp, to_data_frame loops over columns and calls Converter::LazyConvert to schedule conversion tasks:

  for (int i = 0; i < nc; i++) {
    names[i] = data->schema()->field(i)->name();
    tbl[i] = Converter::LazyConvert(to_chunks(data->column(i)), tasks);
  }
  1. Immediate Execution: Currently, RTasks::Append immediately submits parallel tasks to the CPU thread pool when they are scheduled. These background tasks start running immediately and begin writing to the pre-allocated R vector (out) for their respective columns.
  2. Concurrent Access: To write data, the background worker threads call R API accessors like INTEGER(data) (or REAL, LOGICAL), which read the vector header to check the type via TYPEOF(x) (accessing x->sxpinfo.type).
  3. Main Thread Writes: Meanwhile, the main thread continues the loop. As soon as LazyConvert returns the allocated (but not yet fully populated) vector, the main thread assigns it to the list element: tbl[i] = out. This assignment calls SET_VECTOR_ELT, which modifies the vector's header metadata (e.g., updating reference counts or setting GC write barriers/generation flags).
  4. The Race: The worker threads are reading the vector's sxpinfo header concurrently with the main thread writing to the same sxpinfo header (which is a bitfield sharing the same memory word), leading to a data race.

Proposed Solution

Modify RTasks (in r/src/r_task_group.h and r/src/RTasks.cpp) to delay the submission of parallel tasks similar to how serial tasks are delayed:

  1. Introduce a std::vector<Task> delayed_parallel_tasks_ member to RTasks.
  2. In RTasks::Append, if the task is parallel, push it to delayed_parallel_tasks_ instead of appending it directly to the active TaskGroup.
  3. In RTasks::Finish(), first append all delayed_parallel_tasks_ to the parallel_tasks_ group, and then wait for them to finish using parallel_tasks_->Finish().

This guarantees that all R-side allocations and list assignments (which modify headers) are completed on the main thread before any background threads are allowed to start accessing those vectors, eliminating the data race.

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions