Describe the bug, including details regarding any error messages, version, and platform.
We recently enabled ThreadSanitizer tests for R packages, and it surfaced some errors in the {arrow} suite, e.g.
WARNING: ThreadSanitizer: data race (pid=9303)
Read of size 8 at 0x72900075c008 by thread T9:
#0 INTEGER src/main/memory.c:4205:8
#1 arrow::r::Converter_Int<arrow::UInt8Type>::Ingest_some_nulls(SEXPREC*, std::shared_ptr<arrow::Array> const&, long, long, unsigned long) const r/src/array_to_vector.cpp:191:19 (arrow.so+0xf0337)
#2 operator() r/src/array_to_vector.cpp:88:24 (arrow.so+0xe00a9)
#3 arrow::internal::FnOnce<arrow::Status ()>::FnImpl<arrow::r::Converter::ScheduleConvertTasks(arrow::r::RTasks&, std::shared_ptr<arrow::r::Converter>)::'lambda'()>::invoke() cpp/src/arrow/util/functional.h:153:42 (arrow.so+0xe00a9)
#4 operator() cpp/src/arrow/util/functional.h:141:17
#5 operator() cpp/src/arrow/util/task_group.cc:114:18
#6 arrow::internal::FnOnce<void ()>::FnImpl<arrow::internal::(anonymous namespace)::ThreadedTaskGroup::AppendReal(arrow::internal::FnOnce<arrow::Status ()>)::'lambda'()>::invoke() cpp/src/arrow/util/functional.h:153:42
#7 operator() cpp/src/arrow/util/functional.h:141:17
#8 WorkerLoop cpp/src/arrow/util/thread_pool.cc:457:11
#9 operator() cpp/src/arrow/util/thread_pool.cc:618:7
#10 __invoke<(lambda at cpp/src/arrow/util/thread_pool.cc:616:23)>
#11 __thread_execute
#12 void* std::__thread_proxy
I don't really have the knowledge of internals required to completely debug the issue, but Gemini found a fix and summarized it. As usual with LLMs, it looks good at a high level, if a bit wordy. I can also propose Gemini's edit as a PR, if so requested. For now, I'm just including its description of the solution, rather than the actual diff.
Description
During the conversion of Arrow Tables to R DataFrames (specifically when use_threads = TRUE is enabled), a data race can occur between the R main thread and background worker threads. This race is detected by ThreadSanitizer (TSAN) when worker threads access R vector headers (via INTEGER(), REAL(), etc.) concurrently with the main thread registering the vectors into the output list.
Root Cause Analysis
to_data_frame loops over columns and calls Converter::LazyConvert to schedule conversion tasks:
|
for (int i = 0; i < nc; i++) { |
|
names[i] = data->schema()->field(i)->name(); |
|
tbl[i] = Converter::LazyConvert(to_chunks(data->column(i)), tasks); |
|
} |
- Immediate Execution: Currently,
RTasks::Append immediately submits parallel tasks to the CPU thread pool when they are scheduled. These background tasks start running immediately and begin writing to the pre-allocated R vector (out) for their respective columns.
- Concurrent Access: To write data, the background worker threads call R API accessors like
INTEGER(data) (or REAL, LOGICAL), which read the vector header to check the type via TYPEOF(x) (accessing x->sxpinfo.type).
- Main Thread Writes: Meanwhile, the main thread continues the loop. As soon as
LazyConvert returns the allocated (but not yet fully populated) vector, the main thread assigns it to the list element: tbl[i] = out. This assignment calls SET_VECTOR_ELT, which modifies the vector's header metadata (e.g., updating reference counts or setting GC write barriers/generation flags).
- The Race: The worker threads are reading the vector's
sxpinfo header concurrently with the main thread writing to the same sxpinfo header (which is a bitfield sharing the same memory word), leading to a data race.
Here is a draft for the bug report, suitable for submitting to the upstream Apache Arrow project (e.g., on GitHub Issues). It uses the sanitized stack trace and focuses on the generic R/C++ interface issue.
[R] Data race on R vector headers during parallel Table to DataFrame conversion
Description
During the conversion of Arrow Tables to R DataFrames (specifically when use_threads = TRUE is enabled), a data race can occur between the R main thread and background worker threads. This race is detected by ThreadSanitizer (TSAN) when worker threads access R vector headers (via INTEGER(), REAL(), etc.) concurrently with the main thread registering the vectors into the output list.
Root Cause Analysis
In r/src/array_to_vector.cpp, to_data_frame loops over columns and calls Converter::LazyConvert to schedule conversion tasks:
for (int i = 0; i < nc; i++) {
names[i] = data->schema()->field(i)->name();
tbl[i] = Converter::LazyConvert(to_chunks(data->column(i)), tasks);
}
- Immediate Execution: Currently,
RTasks::Append immediately submits parallel tasks to the CPU thread pool when they are scheduled. These background tasks start running immediately and begin writing to the pre-allocated R vector (out) for their respective columns.
- Concurrent Access: To write data, the background worker threads call R API accessors like
INTEGER(data) (or REAL, LOGICAL), which read the vector header to check the type via TYPEOF(x) (accessing x->sxpinfo.type).
- Main Thread Writes: Meanwhile, the main thread continues the loop. As soon as
LazyConvert returns the allocated (but not yet fully populated) vector, the main thread assigns it to the list element: tbl[i] = out. This assignment calls SET_VECTOR_ELT, which modifies the vector's header metadata (e.g., updating reference counts or setting GC write barriers/generation flags).
- The Race: The worker threads are reading the vector's
sxpinfo header concurrently with the main thread writing to the same sxpinfo header (which is a bitfield sharing the same memory word), leading to a data race.
Proposed Solution
Modify RTasks (in r/src/r_task_group.h and r/src/RTasks.cpp) to delay the submission of parallel tasks similar to how serial tasks are delayed:
- Introduce a
std::vector<Task> delayed_parallel_tasks_ member to RTasks.
- In
RTasks::Append, if the task is parallel, push it to delayed_parallel_tasks_ instead of appending it directly to the active TaskGroup.
- In
RTasks::Finish(), first append all delayed_parallel_tasks_ to the parallel_tasks_ group, and then wait for them to finish using parallel_tasks_->Finish().
This guarantees that all R-side allocations and list assignments (which modify headers) are completed on the main thread before any background threads are allowed to start accessing those vectors, eliminating the data race.
Component(s)
R
Describe the bug, including details regarding any error messages, version, and platform.
We recently enabled ThreadSanitizer tests for R packages, and it surfaced some errors in the {arrow} suite, e.g.
I don't really have the knowledge of internals required to completely debug the issue, but Gemini found a fix and summarized it. As usual with LLMs, it looks good at a high level, if a bit wordy. I can also propose Gemini's edit as a PR, if so requested. For now, I'm just including its description of the solution, rather than the actual
diff.Description
During the conversion of Arrow Tables to R DataFrames (specifically when
use_threads = TRUEis enabled), a data race can occur between the R main thread and background worker threads. This race is detected by ThreadSanitizer (TSAN) when worker threads access R vector headers (viaINTEGER(),REAL(), etc.) concurrently with the main thread registering the vectors into the output list.Root Cause Analysis
to_data_frameloops over columns and callsConverter::LazyConvertto schedule conversion tasks:arrow/r/src/array_to_vector.cpp
Lines 1405 to 1408 in c75b82d
RTasks::Appendimmediately submits parallel tasks to the CPU thread pool when they are scheduled. These background tasks start running immediately and begin writing to the pre-allocated R vector (out) for their respective columns.INTEGER(data)(orREAL,LOGICAL), which read the vector header to check the type viaTYPEOF(x)(accessingx->sxpinfo.type).LazyConvertreturns the allocated (but not yet fully populated) vector, the main thread assigns it to the list element:tbl[i] = out. This assignment callsSET_VECTOR_ELT, which modifies the vector's header metadata (e.g., updating reference counts or setting GC write barriers/generation flags).sxpinfoheader concurrently with the main thread writing to the samesxpinfoheader (which is a bitfield sharing the same memory word), leading to a data race.Here is a draft for the bug report, suitable for submitting to the upstream Apache Arrow project (e.g., on GitHub Issues). It uses the sanitized stack trace and focuses on the generic R/C++ interface issue.
[R] Data race on R vector headers during parallel Table to DataFrame conversion
Description
During the conversion of Arrow Tables to R DataFrames (specifically when
use_threads = TRUEis enabled), a data race can occur between the R main thread and background worker threads. This race is detected by ThreadSanitizer (TSAN) when worker threads access R vector headers (viaINTEGER(),REAL(), etc.) concurrently with the main thread registering the vectors into the output list.Root Cause Analysis
In
r/src/array_to_vector.cpp,to_data_frameloops over columns and callsConverter::LazyConvertto schedule conversion tasks:RTasks::Appendimmediately submits parallel tasks to the CPU thread pool when they are scheduled. These background tasks start running immediately and begin writing to the pre-allocated R vector (out) for their respective columns.INTEGER(data)(orREAL,LOGICAL), which read the vector header to check the type viaTYPEOF(x)(accessingx->sxpinfo.type).LazyConvertreturns the allocated (but not yet fully populated) vector, the main thread assigns it to the list element:tbl[i] = out. This assignment callsSET_VECTOR_ELT, which modifies the vector's header metadata (e.g., updating reference counts or setting GC write barriers/generation flags).sxpinfoheader concurrently with the main thread writing to the samesxpinfoheader (which is a bitfield sharing the same memory word), leading to a data race.Proposed Solution
Modify
RTasks(inr/src/r_task_group.handr/src/RTasks.cpp) to delay the submission of parallel tasks similar to how serial tasks are delayed:std::vector<Task> delayed_parallel_tasks_member toRTasks.RTasks::Append, if the task is parallel, push it todelayed_parallel_tasks_instead of appending it directly to the activeTaskGroup.RTasks::Finish(), first append alldelayed_parallel_tasks_to theparallel_tasks_group, and then wait for them to finish usingparallel_tasks_->Finish().This guarantees that all R-side allocations and list assignments (which modify headers) are completed on the main thread before any background threads are allowed to start accessing those vectors, eliminating the data race.
Component(s)
R