Skip to content

Fix RPC hang on abrupt connection disconnect.#438

Merged
WeiXinChan merged 1 commit into
oceanbase:masterfrom
suz-yang:fix_connection_disconnect
Jun 24, 2026
Merged

Fix RPC hang on abrupt connection disconnect.#438
WeiXinChan merged 1 commit into
oceanbase:masterfrom
suz-yang:fix_connection_disconnect

Conversation

@suz-yang

Copy link
Copy Markdown
Contributor

Notify pending invoke futures when the channel becomes inactive and return BOLT_SEND_FAILED for connection-closed responses, so in-flight requests fail immediately instead of waiting for RPC timeout. Skip suspect-server tracking when ObServerAddr is unset to avoid NPE in direct load reconnect paths.

Summary

Fix the issue where in-flight RPC requests hang until RPC timeout when the underlying connection is abruptly closed (network failure, server crash, etc.). Pending requests should fail immediately with a transport error instead of waiting for the timeout timer.

Solution Description

Root cause: When a connection drops unexpectedly, Netty only triggers channelInactive and does not go through Connection.close(). Bolt’s Connection.onClose() (which completes pending InvokeFutures) was therefore not called. Additionally, ObClientFuture and ObPacketFactory returned null from createConnectionClosedResponse, so BaseRemoting.invokeSync treated the result as a timeout (BOLT_TIMEOUT) rather than a connection-closed error.

Changes:

ObConnectionEventHandler — On channelInactive, call connection.onClose() so all pending invoke futures are completed immediately when the channel goes inactive.

ObClientFuture / ObPacketFactory — Implement createConnectionClosedResponse to return a transport error packet with BOLT_SEND_FAILED and message connection {addr} closed, instead of null.

ObTable — Use ObConnectionEventHandler instead of the default ConnectionEventHandler.

Direct load reconnect NPE — Skip RouteTableRefresher.addIntoSuspectIPs when ObServerAddr is unset (direct load does not use route refresh). Add a null guard in addIntoSuspectIPs as a safety net.

Expected behavior after fix: On disconnect, in-flight RPC fails within seconds with ObTableTransportException (transportCode: -20003, connection closed), not after RPC timeout (-20002). If reconnect succeeds on the next request, execution continues normally without throwing an exception.

@suz-yang suz-yang force-pushed the fix_connection_disconnect branch from 3685cd0 to 8867c32 Compare June 24, 2026 06:16
}

@Override
public void channelInactive(ChannelHandlerContext ctx) throws Exception {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缩进为 2 空格,项目其他文件多为 4 空格,风格不一致

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

Notify pending invoke futures when the channel becomes inactive and return BOLT_SEND_FAILED for connection-closed responses, so in-flight requests fail immediately instead of waiting for RPC timeout. Skip suspect-server tracking when ObServerAddr is unset to avoid NPE in direct load reconnect paths.
@suz-yang suz-yang force-pushed the fix_connection_disconnect branch from 8867c32 to d6a3457 Compare June 24, 2026 07:20
@WeiXinChan

Copy link
Copy Markdown
Contributor

LGTM

@WeiXinChan WeiXinChan merged commit ee2b006 into oceanbase:master Jun 24, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants