Skip to content

tiproxy: revive graceful scaling-in pods when scaling out TiProxy#6964

Draft
djshow832 wants to merge 19 commits into
pingcap:mainfrom
djshow832:reuse_pod
Draft

tiproxy: revive graceful scaling-in pods when scaling out TiProxy#6964
djshow832 wants to merge 19 commits into
pingcap:mainfrom
djshow832:reuse_pod

Conversation

@djshow832

@djshow832 djshow832 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Background

Scaling in a TiProxy pod may take 24h. If the operator assigns a new pod when scaling out, there may be many extra pods in a frequent auto-scaling workload, which is a waste.
The goal is to revive graceful scaling in pods when scaling out with the best effort.

Changes

The restarting/upgrading pods can't be revived, so the steps should be different from those of scaling in pods.

Graceful scale-in steps:

  1. Set spec.offline=true. No deletionTimestamp because it may be reused.
  2. Call the TiProxy API to mark it unhealthy.
  3. Mark graceful-shutdown-begin-time on the pod.
  4. Delete the CR if no reviving happens.

Scale-out steps:

  1. Choose a pod to revive.
  2. Set spec.offline=false and clear graceful-shutdown-begin-time.
  3. Call the TiProxy API to clear the unhealthy status.

Follow-ups in HPA:

  • Skip the pods with graceful-shutdown-begin-time when calculating the average CPU/memory/network.

Notes:

  • There may be a race between scale-out and deleting pods, and a deleted pod may be revived. It will be solved in reconciliation by recreating a new pod or reviving another pod.
  • When scaling in and upgrading happen at the same time, a TiProxy pod with an old revision may be marked offline (in ScaleInUpdate) and may be revived in another scale-out. It will be solved in reconciliation by upgrading the pod.

Other refactors:

  • Rename IsStore() to SupportsStore().
  • Move some graceful restart logic into public files.

E2E tests

  • revive draining TiProxy pods on scale-out instead of creating new ones
  • create new TiProxy pods on scale-out when drained instances are no longer revivable
  • scale in and out TiProxy when health override API is unsupported

@ti-chi-bot ti-chi-bot Bot requested a review from howardlau1999 June 24, 2026 12:39
@ti-chi-bot

ti-chi-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign liubog2008 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions Bot added the v2 for operator v2 label Jun 24, 2026
@ti-chi-bot ti-chi-bot Bot added the size/XXL label Jun 24, 2026
@codecov-commenter

codecov-commenter commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 56.81818% with 114 lines in your changes missing coverage. Please review.
✅ Project coverage is 39.56%. Comparing base (28394a5) to head (051d879).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6964      +/-   ##
==========================================
+ Coverage   39.53%   39.56%   +0.02%     
==========================================
  Files         430      433       +3     
  Lines       24378    24492     +114     
==========================================
+ Hits         9638     9690      +52     
- Misses      14740    14802      +62     
Flag Coverage Δ
unittest 39.56% <56.81%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@djshow832 djshow832 marked this pull request as draft June 25, 2026 03:51
@djshow832 djshow832 marked this pull request as ready for review June 30, 2026 11:54
@ti-chi-bot ti-chi-bot Bot requested a review from shonge June 30, 2026 11:54
@djshow832 djshow832 requested a review from liubog2008 July 3, 2026 03:00
@djshow832 djshow832 marked this pull request as draft July 3, 2026 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants