Skip to content

Add data dumps#2047

Merged
ArtOfCode- merged 75 commits into
developfrom
art/data-dumps
May 21, 2026
Merged

Add data dumps#2047
ArtOfCode- merged 75 commits into
developfrom
art/data-dumps

Conversation

@ArtOfCode-
Copy link
Copy Markdown
Member

@ArtOfCode- ArtOfCode- commented May 12, 2026

Add data dumps. Weekly export of the entire database minus anything sensitive, uploaded to S3 and made available via a new page linked in the footer. Also adds an option for manually-created dump records, intended for quarterly uploads to Archive.org.

Incorporates #1950 by cherry-pick.

@ArtOfCode- ArtOfCode- requested review from Oaphi and cellio May 14, 2026 10:07
@ArtOfCode- ArtOfCode- changed the title Add some more data dump skeleton Add data dumps May 14, 2026
@ArtOfCode- ArtOfCode- changed the base branch from 0valt/1918/data-dump to develop May 14, 2026 10:10
@ArtOfCode- ArtOfCode- linked an issue May 14, 2026 that may be closed by this pull request
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 54.60993% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.30%. Comparing base (2023e74) to head (cfde135).

Files with missing lines Patch % Lines
lib/database_changes_checker.rb 0.00% 56 Missing ⚠️
app/jobs/data_dump_job.rb 92.75% 5 Missing ⚠️
app/controllers/dumps_controller.rb 60.00% 2 Missing ⚠️
app/models/dump.rb 87.50% 1 Missing ⚠️
Additional details and impacted files
Components Coverage Δ
controllers 76.02% <60.00%> (-0.03%) ⬇️
helpers 85.32% <100.00%> (+0.01%) ⬆️
jobs 75.11% <92.95%> (+8.22%) ⬆️
models 93.02% <87.50%> (-0.03%) ⬇️
tasks 61.11% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 20, 2026

Data dump-affecting changes

This pull request changes DB schema and may affect what data is included in the data dump. Please review:

db/migrate/20260120032240_create_dumps.rb

create_table :dumps do |t|

db/migrate/20260513135320_add_defaults_for_data_dumps.rb

change_column_default :filters, :user_id, -1
change_column_default :flags, :escalated, false
change_column_default :users, :sign_in_count, 0
change_column_default :users, :failed_attempts, 0
change_column_default :users, :deleted, false

db/migrate/20260513160917_add_automatic_to_dumps.rb

add_column :dumps, :automatic, :boolean, null: false, default: false

db/migrate/20260513185013_add_link_to_dumps.rb

add_column :dumps, :link, :string

db/migrate/20260519211432_add_checksum_to_dumps.rb

add_column :dumps, :checksum, :string

db/migrate/20260520085943_add_more_default_values.rb

change_column_default :votes, :created_at, '2000-01-01T00:00:00.000000Z'
change_column_default :votes, :updated_at, '2000-01-01T00:00:00.000000Z'

db/migrate/20260520112652_add_filter_default_name.rb

change_column_default :filters, :name, ''

db/migrate/20260520175923_add_user_creation_defaults.rb

change_column_default :users, :created_at, '2000-01-01T00:00:00.000000Z'
change_column_default :users, :updated_at, '2000-01-01T00:00:00.000000Z'
change_column_default :community_users, :created_at, '2000-01-01T00:00:00.000000Z'
change_column_default :community_users, :updated_at, '2000-01-01T00:00:00.000000Z'

@cellio
Copy link
Copy Markdown
Member

cellio commented May 20, 2026

I don't recall any details now, but I thought there might have been a case that resulted from timings (maybe a user pressing save on an answer after the question was deleted?), so unless we can rule that out we might get very rare new cases in future.

If that isn't already ruled out, would a potential solution be to have a job to delete such answers, and have it run just before the data dump each week?

Oh, I think you're right. A cleanup job sounds like a good idea (and is separable from this PR).

Copy link
Copy Markdown
Member

@cellio cellio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and works great for me. I read the code changes and understood most but not all of them, so treat this as a functionality review more than a code review.

Edit: Art walked me through the code I was having trouble understanding. LGTM!

Comment thread app/views/dumps/index.html.erb
@ArtOfCode-
Copy link
Copy Markdown
Member Author

I have no idea why a sign-in rate limit test has suddenly started failing and have run out of time to work on this for a few days - if anyone else has time to look please feel free.

@cellio
Copy link
Copy Markdown
Member

cellio commented May 21, 2026

I have no idea why a sign-in rate limit test has suddenly started failing and have run out of time to work on this for a few days - if anyone else has time to look please feel free.

It's because of the new default creation/update times for users. Don't merge until we resolve.

Copy link
Copy Markdown
Member

@cellio cellio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retested with the latest changes and all looks good now: new users have the correct (real) timestamps, and data dumps show 1970 for users, community users, and votes. Changing the default in the dump DB only from within the dump job is clever!

@ArtOfCode- ArtOfCode- merged commit d548a06 into develop May 21, 2026
13 of 14 checks passed
@ArtOfCode- ArtOfCode- deleted the art/data-dumps branch May 21, 2026 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a system for regular and/or on-demand data dumps

4 participants