Skip to content

bridge: add guest-side reconnect loop for live migration#2698

Open
shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
shreyanshjain7174:bridge-reconnect-v2
Open

bridge: add guest-side reconnect loop for live migration#2698
shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
shreyanshjain7174:bridge-reconnect-v2

Conversation

@shreyanshjain7174
Copy link
Copy Markdown
Contributor

@shreyanshjain7174 shreyanshjain7174 commented Apr 21, 2026

Fixes #2669

Problem

During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover — ListenAndServe returns with an I/O error, and the GCS has no way to re-establish communication with the new host.

What this does

Wraps the bridge serve call in a reconnect loop in cmd/gcs/main.go. When the vsock connection drops, the GCS re-dials the host and calls ListenAndServe again on the same Bridge. ListenAndServe already creates fresh channels (responseChan, quitChan) on each call, so the Bridge can be reused across reconnections without resetting any state.

The Host (containers, processes, cgroups) persists across reconnections since it lives outside the Bridge.

A Publisher is added so that container wait goroutines — spawned during CreateContainer and blocked on c.Wait() — can route exit notifications through whichever bridge is currently active. During the reconnect gap the notification is dropped, which is safe because the host-side shim re-queries container state after reconnecting.

Design

No mutating RPCs (CreateContainer, ExecProcess, etc.) are in-flight when migration starts — the LM orchestrator ensures all container setup is complete before initiating migration. The only long-lived handler goroutine during migration is waitOnProcessV2, which is blocked on select { case exitCode := <-exitCodeChan } and doesn't touch responseChan until the process exits (through Publisher). This means the Bridge can be safely reused across ListenAndServe calls without risk of handler goroutines racing on channel state.

During live migration the VM is frozen and only wakes up when the destination host shim is ready, so the vsock port should be immediately available. The reconnect loop uses a tight 100ms retry interval rather than exponential backoff.

The defer ordering in ListenAndServe is fixed so quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from blocking on a dead bridge.

Changes

File Change
cmd/gcs/main.go Reconnect loop with 100ms retry; Bridge+Mux created once outside the loop
internal/guest/bridge/bridge.go Publisher field, ShutdownRequested(), fixed defer ordering, buffered responseChan, priority select guard in PublishNotification
internal/guest/bridge/bridge_v2.go Container wait goroutine uses Publisher.Publish()
internal/guest/bridge/publisher.go Mutex-guarded bridge reference swap (40 lines)
internal/guest/bridge/publisher_test.go Tests for nil-bridge drop and bridge-set-publish

Testing

Tested on a two-node Hyper-V live migration setup using the TwoNodeInfra test module:

  • Invoke-FullLmTestCycle -Verbose — deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Container lcow-test migrated with pod sandbox intact.
  • Post-migration crictl exec — created an LCOW pod with our custom GCS (deployed via rootfs.vhd), started a container, exec'd cat /tmp/test.txt to verify bridge communication works after reconnect.
  • go build, go vet, gofmt clean.

@shreyanshjain7174 shreyanshjain7174 marked this pull request as ready for review April 21, 2026 17:28
@shreyanshjain7174 shreyanshjain7174 requested a review from a team as a code owner April 21, 2026 17:28
Comment thread cmd/gcs/main.go
}
const commandPort uint32 = 0x40000000

// Reconnect loop: on each iteration we create a fresh bridge+mux, dial the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, an exponential backoff is the right answer. But in this case, the VM is frozen in time, and only wakes up when the host shim is ready. The connection should be immediately available. I think I'd rather see this a very tight loop personally

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — the VM is frozen and wakes up with the host ready, so the vsock should be available right away. I'll switch to a tight fixed-interval retry (e.g. 100ms) instead of exponential backoff.

Comment thread cmd/gcs/main.go
logrus.Info("bridge connected, serving")
bo.Reset()

serveErr := b.ListenAndServe(bridgeIn, bridgeOut)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant you just reset the isQuitPending and call ListenAndServe again? Wouldnt that "just work"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It almost works, but there's a subtle issue with handler goroutines. The handler dispatch at line 356 spawns go func(r *Request) { ... b.responseChan <- br }(req) — this goroutine captures b and sends to b.responseChan, which is a struct field. If a handler is still in-flight when ListenAndServe returns (say a slow CreateContainer or ExecProcess), and we call ListenAndServe again on the same bridge, the new call overwrites b.responseChan = make(chan ...) while the old handler is about to send to it. That's a data race on the struct field — the old goroutine reads b.responseChan concurrently with the new ListenAndServe writing it.

In practice this window is very small (handlers finish fast), so it wouldn't show up in normal LM testing. But under load — say a CreateContainer request arrives right as the vsock drops during migration — the handler goroutine could be mid-flight when we re-enter ListenAndServe.

Recreating Bridge means the old handlers hold a reference to the old (now-dead) bridge with its own channels, and the new bridge has completely separate state. No shared mutable field.

That said, if you think the simplicity of reuse outweighs this edge case, we could make it work by not closing responseChan in the defers and adding a short drain period before re-entering. Happy to go either way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — I looked at the host side and no mutating RPCs (CreateContainer, ExecProcess, etc.) are in-flight when migration starts. The only long-lived handler goroutine during migration is waitOnProcessV2, which is blocked on select { case exitCode := <-exitCodeChan } — it doesn't touch responseChan until the process actually exits, and by then the notification goes through Publisher.

Simplified to reuse the same Bridge. ListenAndServe already creates fresh channels on each call, so re-entering it on the same struct works. Also switched from exponential backoff to a tight 100ms retry as discussed. Pushed and tested with LM — both nodes 100%.

During live migration the vsock connection between the host and the GCS
breaks when the VM moves to the destination node. The GCS bridge drops
and cannot recover, leaving the guest unable to communicate with the
new host.

This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge
after a connection loss. On each iteration a fresh Bridge and Mux are
created while the Host state (containers, processes) persists across
reconnections.

A Publisher abstraction is added to bridge/publisher.go so that container
wait goroutines spawned during CreateContainer can route exit notifications
through the current bridge. When the bridge is down between reconnect
iterations, notifications are dropped with a warning — the host-side shim
re-queries container state after reconnecting.

The defer ordering in ListenAndServe is fixed so that quitChan closes
before responseChan becomes invalid, and responseChan is buffered to
prevent PublishNotification from panicking on a dead bridge.

Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration
setup (Node_1 -> Node_2). Migration completes at 100% and container
exec works on the destination node after migration.

Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adds guest-side GCS changes for V2 shim support

2 participants