Skip to content

nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843

Open
jmsperu wants to merge 1 commit into
apache:mainfrom
jmsperu:fix/nasbackup-timeout-spacecheck-trap
Open

nasbackup.sh: add timeout, cleanup trap, space check, quiesce support#12843
jmsperu wants to merge 1 commit into
apache:mainfrom
jmsperu:fix/nasbackup-timeout-spacecheck-trap

Conversation

@jmsperu
Copy link
Copy Markdown
Contributor

@jmsperu jmsperu commented Mar 17, 2026

Summary

  • Add BACKUP_TIMEOUT (default 6 hours) to prevent indefinitely stuck backup jobs; aborts via domjobabort when exceeded
  • Add EXIT trap with cleanup() that resumes paused VMs, removes temp dirs, and unmounts NFS on any exit (error, signal, or normal) — prevents orphan mounts from accumulating
  • Add check_free_space() pre-flight check (default 1 GB minimum) before writing backup data
  • Add -q/--quiesce flag for optional fsfreeze/thaw via qemu-guest-agent for application-consistent backups
  • Use set -eo pipefail for stricter error handling
  • Fix mount_operation(): proper if mount instead of broken $? check after pipe
  • Quote all variable expansions to prevent word splitting
  • Remove manual umount/rmdir from functions (now handled by trap)

Motivation

The current nasbackup.sh has several reliability issues observed in production:

  1. No timeout — backup polling loop (until domjobinfo --completed) runs forever if backup stalls, blocking the agent
  2. No cleanup on failure — failed backups leave orphan NFS mounts (/tmp/csbackup.XXXXX) and paused VMs that never resume (related: KVM NAS backup: VM remains paused indefinitely when backup job fails #12821)
  3. No space check — backups silently fail or produce corrupt output when the NAS is full
  4. No quiesce — running VM backups are crash-consistent only, even when qemu-guest-agent is available
  5. Broken error handling$? after a pipe always returns the pipe's exit status, not the command's

Test plan

  • Backup a running VM with sufficient space — verify backup completes normally
  • Backup a running VM with BACKUP_TIMEOUT=30 — verify timeout and domjobabort
  • Backup with NAS at <1GB free — verify pre-flight rejection
  • Kill backup mid-run — verify cleanup resumes VM and unmounts NFS
  • Backup with -q flag and qemu-guest-agent installed — verify fsfreeze/thaw in logs
  • Backup with -q flag without qemu-guest-agent — verify graceful fallback
  • Delete backup — verify cleanup trap handles unmount
  • Backup a stopped VM — verify space check and error handling on disk convert

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 16.25%. Comparing base (61afb4c) to head (937e646).
⚠️ Report is 18 commits behind head on 4.20.

Additional details and impacted files
@@            Coverage Diff            @@
##               4.20   #12843   +/-   ##
=========================================
  Coverage     16.24%   16.25%           
- Complexity    13411    13412    +1     
=========================================
  Files          5664     5664           
  Lines        500463   500463           
  Branches      60779    60779           
=========================================
+ Hits          81308    81333   +25     
+ Misses       410059   410035   -24     
+ Partials       9096     9095    -1     
Flag Coverage Δ
uitests 4.15% <ø> (ø)
unittests 17.10% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@abh1sar
Copy link
Copy Markdown
Contributor

abh1sar commented Mar 18, 2026

@jmsperu can you please check on 4.21?
There were lots of improvement done to the nasbackup.sh script
Quiesce support was also added (which can also be opt out of)

@jmsperu
Copy link
Copy Markdown
Contributor Author

jmsperu commented Mar 18, 2026

@jmsperu can you please check on 4.21? There were lots of improvement done to the nasbackup.sh script Quiesce support was also added (which can also be opt out of)

Thanks for the heads up! I see 4.22 already has quiesce and improved error handling. I'll rebase onto 4.22 and update the PR to only include what's still missing: backup timeout, trap handler with VM resume, and free space check. The other PRs (compression, verification, bandwidth throttle, encryption, parallel execution) are still new features on top of 4.22.

@jmsperu
Copy link
Copy Markdown
Contributor Author

jmsperu commented May 22, 2026

Belated reply @abh1sar — you were right, there's significant overlap with what's already landed on main. The quiesce support and EXIT_CLEANUP_FAILED handling from this PR have since been merged upstream (likely via the broader nasbackup.sh improvements in #12822 and others), so those bits are now redundant.

What's still uniquely unaddressed by what landed:

  • BACKUP_TIMEOUT env var (default 21600s = 6h) wrapping the long-running virsh backup-begin poll loop so a stuck backup eventually fails rather than holding the agent forever
  • MIN_FREE_SPACE check against the mounted NAS before qemu-img convert runs, with a fast-fail when the NAS doesn't have enough free space — currently we discover this mid-write
  • trap cleanup EXIT so cleanup() runs even on SIGTERM / shell death (today cleanup() only runs on graceful exits)

I'll rebase against main to drop the overlapping changes and keep only those three, then push again. If you'd rather close this and open a fresh narrower PR with just those bits, also happy to do that — your call.

…M resume

Three independent reliability fixes for the KVM NAS backup script, layered on
top of the existing quiesce + EXIT_CLEANUP_FAILED groundwork:

1. BACKUP_TIMEOUT env var (default 6h) bounds the libvirt domjobinfo wait
   loop in backup_running_vm. Today a stuck QEMU backup holds the agent's
   command slot until the orchestrator-level timeout fires. The new guard
   issues domjobabort and exits non-zero so the agent reclaims the slot
   promptly.

2. MIN_FREE_SPACE env var (default 1 GiB) + check_free_space() runs after
   mount and before any qemu-img convert in both backup_running_vm and
   backup_stopped_vm. Fail-fast on a near-full NAS instead of failing
   mid-write halfway through a multi-GiB convert.

3. trap cleanup EXIT replaces the six explicit cleanup() call sites as the
   primary cleanup mechanism so orphan NFS mounts no longer accumulate when
   the script dies to SIGTERM, SIGINT, or any uncaught set -e failure
   between the explicit call sites. cleanup() is now guarded by
   CLEANUP_DONE so the trap doesn't re-run an already-completed cleanup
   from an explicit call.

cleanup() additionally resumes the VM if it's still paused — backup-begin
holds the guest paused briefly and a failed backup mid-pause currently
leaves the guest stuck in 'paused' state until an operator intervenes.

Targets main; supersedes the 4.20-targeted version of this PR.
@jmsperu jmsperu force-pushed the fix/nasbackup-timeout-spacecheck-trap branch from 937e646 to 0deb4c4 Compare May 22, 2026 21:56
@jmsperu jmsperu changed the base branch from 4.20 to main May 22, 2026 21:56
@jmsperu
Copy link
Copy Markdown
Contributor Author

jmsperu commented May 22, 2026

Rebased. Force-pushed 0deb4c4 and changed the base from 4.20main.

What landed in main since this PR was opened (via other PRs):

  • QUIESCE arg + freeze/thaw
  • EXIT_CLEANUP_FAILED exit code
  • Basic cleanup() function (called explicitly at 6 sites)

So this PR is now reduced to just the three improvements those didn't cover:

  1. BACKUP_TIMEOUT env var (default 6h) bounds the domjobinfo wait loop in backup_running_vm so a stuck QEMU backup hits domjobabort and frees the agent's command slot rather than holding it open until the orchestrator-level timeout.

  2. MIN_FREE_SPACE env var (default 1 GiB) + check_free_space() runs after mount and before any qemu-img convert in both backup paths — fail-fast on a near-full NAS instead of failing mid-write.

  3. trap cleanup EXIT replaces the explicit calls as the primary cleanup mechanism so orphan NFS mounts no longer accumulate on SIGTERM/SIGINT or set -e failures between the explicit call sites. cleanup() is guarded by a CLEANUP_DONE flag so the trap doesn't re-run an already-completed cleanup from an explicit call. cleanup() also now resumes the VM if it's still paused — a backup that dies mid-pause currently leaves the guest stuck paused until an operator intervenes.

Total: 71 +/3 - on one file. Ready for another look @abh1sar.

@jmsperu
Copy link
Copy Markdown
Contributor Author

jmsperu commented May 22, 2026

Friendly nudge — the new push is sitting on GitHub Actions action_required, so no CI has actually run yet against the rebase. Could a committer (@abh1sar @DaanHoogland) click Approve and run when you have a moment? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants