Add RequestSendType.ROUND_ROBIN for per-request IP round-robin#2202
Add RequestSendType.ROUND_ROBIN for per-request IP round-robin#2202pavel-ptashyts wants to merge 13 commits into
Conversation
Introduce a requestSendType config option (DEFAULT | ROUND_ROBIN) on DefaultAsyncHttpClientConfig. In ROUND_ROBIN mode, when a host resolves to several IPs, requests are spread strictly per request across all of them: - the resolved address list is rotated per host so each request targets the next IP first, keeping the remaining IPs for TCP failover; - connection reuse is pinned per IP via an IP-aware partition key, so pooled HTTP/1.1 connections and multiplexed HTTP/2 connections are kept per IP; - on failover the key is re-pinned to the IP actually connected to, so reuse stays correct; - the maxConnectionsPerHost semaphore stays per host (the permit is taken before the target IP is known). getRequestSendType() is a default interface method returning DEFAULT, so existing AsyncHttpClientConfig implementations keep compiling. DEFAULT mode behavior is unchanged.
|
This is great - this is essentially client-side load balancing, Let me think through. |
|
Good afternoon. I would be very grateful if you could review my PR, and I will gladly try to improve this solution if you find any shortcomings or issues in it. Client-side load balancing is actually something I sometimes critically miss in the client, as solving the load balancing problem outside the client either works unreliably or requires configuring additional load balancers, which increases system complexity. |
|
Agree - We will get this PR merged no worries. :) |
hyperxpro
left a comment
There was a problem hiding this comment.
Round 1 done - good work indeed!
| @@ -0,0 +1,77 @@ | |||
| /* | |||
| * Copyright (c) 2024 AsyncHttpClient Project. All rights reserved. | |||
There was a problem hiding this comment.
| * Copyright (c) 2024 AsyncHttpClient Project. All rights reserved. | |
| * Copyright (c) 2026 AsyncHttpClient Project. All rights reserved. |
| return address.getHostString(); | ||
| }); | ||
|
|
||
| private final ConcurrentHashMap<String, AtomicInteger> counters = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
counters is keyed by host and never evicted - unbounded growth for a client that touches many distinct hosts (crawler/gateway). Entries are tiny, but it's a structure with no equivalent in DEFAULT mode. Please bound/LRU-cap it.
There was a problem hiding this comment.
For full LRU support, it would be a good idea to integrate the Caffeine library. However, I don’t want to do that without your approval.
So for now, I implemented an approach where the maximum possible size is 4096, and if an overflow occurs, 10% of the oldest data is cleared.
I think that in 99% of cases, the cleanup will never even be triggered, since 4096 is a large value. And even if someone does need more, the delay caused by cleanup in a single request should not significantly affect the overall performance of the library.
Should we keep my solution, or would it be better to use the Caffeine library?
| * @return the same list instance when there is nothing to rotate (size {@code <= 1}), otherwise | ||
| * a new list whose first element is the round-robin-selected address | ||
| */ | ||
| public List<InetSocketAddress> rotate(String host, List<InetSocketAddress> resolved) { |
There was a problem hiding this comment.
The selector just rotates through the full resolved list and never learns which IPs are actually healthy. So if one of a host's IPs is down for a while, a steady fraction of requests will keep picking it. Each of those misses the pool, since the working connection is parked under a different IP's key after the repin, opens a fresh connection to the dead IP, fails, falls over to a good IP, and repins there. That repeats for as long as the IP stays down, and the pool ends up lopsided around whichever live IP keeps absorbing the failovers.
Nothing breaks here: requests still succeed and the max-connections limit still holds, so there's no leak. But during a partial outage you're paying a wasted connection attempt, worst case a full connect timeout of latency, on a slice of traffic indefinitely. Probably we can do some kind of health check and it should be configurable; normal TCP healthcheck or HTTP healthcheck?
Also maybe we should be keeping a short list of recently-failed IPs and skipping them for a little while would do it - like a temporary blacklist.
There was a problem hiding this comment.
Thanks for the detailed write-up — agreed on the mechanics. During a partial outage, round-robin will keep handing a temporarily-dead IP its share of traffic, paying a connect failure (worst case a full connect-timeout) before failing over to a healthy IP. As you said, nothing breaks: requests still succeed and the connection limits hold — it's wasted latency on a slice of traffic while the IP is down.
My take is that liveness of resolved IPs is fundamentally a DNS/resolver concern, not something the round-robin selector should own. The current behavior is consistent with DEFAULT mode: if DNS returns a dead IP, AHC doesn't health-check it there either — it just tries it and fails over. Round-robin doesn't change that in kind, only in frequency (it touches every IP rather than pinning the first). The cleanest place to keep dead IPs out of rotation is the resolver/DNS layer (short TTLs, the LB pulling unhealthy backends, or a health-aware resolver) so liveness logic lives in one place instead of being reimplemented inside the client.
For this PR I've documented the limitation on RequestSendType.ROUND_ROBIN. If you'd like client-side handling, I see two increments we could do (here or as a follow-up):
- Temporary failed-IP blacklist — on a failed connect attempt, mark that IP as "cooling down" for a short, configurable window and have the selector skip it until the window expires. Cheap, self-contained, no probing, recovers automatically.
- Active health checks — configurable TCP or HTTP probes per resolved IP with intervals/thresholds, removing unhealthy IPs from rotation proactively. More powerful, but a much larger feature (new config surface, a prober lifecycle, failure semantics).
My preference is to defer both to a follow-up issue and let DNS own liveness for now, but I'm happy to implement (1) in this PR if you want it included. Do you want either of these as part of this PR, or is documenting the limitation enough for now?
| } | ||
|
|
||
| List<InetSocketAddress> ordered = new ArrayList<>(resolved); | ||
| ordered.sort(STABLE_ORDER); |
There was a problem hiding this comment.
This copies the list and re-sorts it on every call. The list is short so it doesn't matter much, but for a busy multi-IP host you're doing that allocation and sort on every request. If you wanted to avoid it, you could cache the sorted order for a given set of resolved addresses and reuse it.
There was a problem hiding this comment.
Good point on the per-request allocation + sort. Rather than cache the sorted order, I've removed the client-side sort entirely and now rotate over the resolver's address list as-is. The ordering of a host's IPs is really the resolver's responsibility, so baking a sort into the round-robin selector was the wrong layer — it both duplicated work on every request and hid the fact that consistent rotation depends on a consistent resolution order.
I've updated the RequestSendType.ROUND_ROBIN javadoc to make the contract explicit:
- This mode takes the address order straight from the configured InetNameResolver and does not re-sort it, so for the rotation to map consistently across requests the resolver must return addresses in a stable order and not reorder them between resolutions — e.g. DnsNameResolver.
- It should not be paired with a resolver that intentionally rotates its results, such as RoundRobinInetAddressResolver — that one is meant for DEFAULT mode, where the client always targets the first address and the resolver provides the spreading instead.
Net effect: less per-request work (no copy+sort; we only allocate a rotated list when the selected index isn't already first), and the responsibility for IP ordering sits in the resolver where it belongs.
| /** | ||
| * Picks, per host and per request, which resolved IP a new connection should target first when | ||
| * {@link org.asynchttpclient.RequestSendType#ROUND_ROBIN} is enabled. | ||
| * | ||
| * <p>{@link #rotate(String, List)} returns the resolved addresses re-ordered so that the | ||
| * round-robin-selected address comes first; the remaining addresses follow (in a stable order) so | ||
| * the connector can still fail over to them. The addresses are sorted into a stable order before | ||
| * rotation so that the per-host counter maps consistently to the same address across requests, | ||
| * regardless of the order the resolver returns them in. | ||
| * | ||
| * <p>Thread-safe. | ||
| */ |
There was a problem hiding this comment.
This needs to be aligned after the below comments are addressed.
| return false; | ||
| } | ||
| Uri uri = request.getUri(); | ||
| return proxyServer == null || proxyServer.isIgnoredForHost(uri.getHost()) || !proxyServer.getProxyType().isHttp(); |
There was a problem hiding this comment.
This lets SOCKS proxies use round-robin too, which means we resolve the host ourselves, rotate the IPs, pin the pool per IP, and hand the chosen IP to the SOCKS server. With SOCKS5 people often expect the proxy to do the resolving and routing, so the combination is a bit odd. It isn't a regression, since we already resolve client-side for SOCKS today, but I'd either leave SOCKS out the way HTTP proxies are left out, or say plainly in the docs that round-robin resolves on the client even for SOCKS.
There was a problem hiding this comment.
Fixed. I left round robin logic only for direct connect without proxy. If need in the future this logic can be updated and extended
| Uri uri = request.getUri(); | ||
| String host = uri.getHost(); | ||
|
|
||
| resolveAddresses(request, proxyServer, newFuture, asyncHandler).addListener(new SimpleFutureListener<List<InetSocketAddress>>() { |
There was a problem hiding this comment.
Worth a short comment here so the ordering is obvious to the next person: in this mode we resolve before checking the pool and before taking the semaphore, so every eligible request resolves first, even ones that immediately reuse a pooled connection and even single-IP hosts.
On a pooled hit that also means the request timeout gets scheduled twice; the second call cancels the first so there's no leak, it's just a little redundant. It's all inherent to doing this per request and cheap with a caching resolver, but it's easy to miss reading the code.
There was a problem hiding this comment.
Good catch on both the ordering and the double timeout-schedule. I've added a comment at the resolveAddresses(...) call in sendRequestRoundRobin spelling out the non-obvious bits:
- Round-robin resolves up front — before the pool check and before the per-host semaphore — so every eligible request resolves first, including pooled-hit reuse and single-IP hosts. With a caching resolver that's cheap, but it's easy to miss when reading the code.
- On a pooled hit the request timeout is scheduled twice — once in resolveAddresses and again in sendRequestWithOpenChannel — and the second setTimeoutsHolder cancels the first, so it's redundant work, not a leak.
I kept it to a comment for now rather than suppressing the redundant schedule: scheduleRequestTimeout is on the shared resolve path used by every send mode, the extra schedule is harmless, and special-casing it risks a regression for negligible gain. That said — if you have ideas on how to do this better (e.g. a clean way to resolve-then-pool without the redundant schedule, or restructuring the ordering), I'm all ears.
| } | ||
| try { | ||
| return RequestSendType.valueOf(value.trim().toUpperCase(Locale.ROOT)); | ||
| } catch (IllegalArgumentException e) { |
There was a problem hiding this comment.
Quietly falling back to DEFAULT when the value doesn't parse will hide typos in someone's config. A warning log that includes the bad value would make that a lot easier to spot.
| @@ -0,0 +1,68 @@ | |||
| /* | |||
| * Copyright (c) 2024 AsyncHttpClient Project. All rights reserved. | |||
There was a problem hiding this comment.
| * Copyright (c) 2024 AsyncHttpClient Project. All rights reserved. | |
| * Copyright (c) 2026 AsyncHttpClient Project. All rights reserved. |
| @@ -0,0 +1,51 @@ | |||
| /* | |||
| * Copyright (c) 2024 AsyncHttpClient Project. All rights reserved. | |||
There was a problem hiding this comment.
| * Copyright (c) 2024 AsyncHttpClient Project. All rights reserved. | |
| * Copyright (c) 2026 AsyncHttpClient Project. All rights reserved. |
…and update Copyright to 2026
…ygemdev/async-http-client into requests-round-robin-per-host
Add RequestSendType.ROUND_ROBIN for per-request IP round-robin
Motivation
When a host resolves to several IP addresses (e.g. a service behind DNS that returns multiple A/AAAA records or multiple backend instances), AHC today effectively pins
all traffic to a single IP:
only).
So with keep-alive enabled, the first reachable IP receives essentially all requests, and there is no way to spread client load across a multi-IP host's addresses. This
change adds opt-in client-side round-robin so requests are distributed evenly across all of a host's IPs.
What changed
A new config option requestSendType on DefaultAsyncHttpClientConfig:
asyncHttpClient(config().setRequestSendType(RequestSendType.ROUND_ROBIN));
How ROUND_ROBIN works
the list so the connector can still fail over.
rather than per host. HTTP/2 reuse is governed by the same partition key, so the spread works there too.
Backward compatibility — does not change existing behavior
send/poll/connect/pool path is unchanged.
interface keep compiling (source- and binary-compatible).
explicit address, and proxied requests (the proxy host is resolved, not the target).
Testing
defaults / builder round-trip.