Model-, platform-, and payload-agnostic load testing and capacity planning for inference endpoints.
- Guaranteed client RPS – Customizes and wraps Locust to pace requests from the client so a target client requests-per-second (RPS) is sustained.
- Client-side bottleneck diagnostics – Detects when the client sending requests is the limiting factor.
- Any inference endpoint – Traditional ML, GenAI, or any other HTTPS endpoint.
- Defined in plain Python – Endpoint characteristics — invocation logic, payload generation, endpoint creation — are expressed as simple Python function definitions.
- Configurable server metrics – Latches to configurable server metrics to correlate hardware utilization (or any available metric) — to plan capacity management across multiple endpoints, autoscaling, and cost extrapolations.
- aiperf integration – Integrates with NVIDIA
aiperffor token usage metrics. - Comparison visualizations – Basic bar-plot reports for comparing configurations side by side.
- Clone the repo
git clone https://github.com/aws-samples/sample-InferenceBenchmarker.git && cd sample-InferenceBenchmarker
- Run the one-time setup
./benchmark --init
Use an existing virtual env? Enter its path, or press return to create a dedicated .venvIB: /path/to/your/venv → installs dependencies into it <return> → creates .venvIB and installs into it - Confirm it resolves
(Use
benchmark
./benchmarkfrom the repo root if you skipped adding to PATH.)
InferenceBenchmarker takes your invocation logic defined in Python functions in a file.
-
invoke_factory(endpoint_name)runs once per worker process its body is shared by all users on that worker, so put reusable code there. It returns a callableinvoke(payload), which runs per request.from typing import Any, Callable Payload = Any # whatever your endpoint accepts def invoke_factory(endpoint_name: str | None = None) -> Callable[[Payload], None]: # WORKER-LEVEL setup: this body runs ONCE per worker process and is shared by every # user on that worker. Put expensive, reusable client state here — e.g. the boto3 # client and its connection pool — so it isn't rebuilt per request. import json, boto3 client = boto3.client('sagemaker-runtime') def invoke(payload: Payload) -> None: # PER-REQUEST: called once for each request a user fires. resp = client.invoke_endpoint( EndpointName=endpoint_name or 'my-endpoint', ContentType='application/json', Body=json.dumps(payload).encode('utf-8'), ) resp['Body'].read() return invoke # the per-request callable
-
payload_factory()is called once per worker and has two modes.pre_computed=True— payload pre-built, at request time a payload is just picked from the list, so payload-build cost is not included. Use this mode if payload generation logic is compute heavy – causing CPU to be a client bottleneck. Might cause higher memory usage during benchmarking.def payload_factory() -> dict[str, bool | list[Payload]]: return { 'pre_computed': True, 'input': [{'instances': [[...]]}], # list[Payload] }
Requests cycle through the pre-computed inputs in order and wrap back to the start, so the list is never exhausted or cut off — fire more requests than there are inputs and it simply loops.
pre_computed=False— payload built per user request by returning a Python callable that is called on each inference request. payload-build cost is included. Use this mode to introduce dynamism in payload_generation and if maintaining a pre-computed payload is heavy on memory causing memory to be a client bottleneck. Might cause higher compute usage during benchmarking:def payload_factory() -> dict[str, bool | Callable[[], Payload]]: return { 'pre_computed': False, 'input': lambda: {'instances': [[...]]}, # Callable[[], Payload] }
benchmark \
--factories-file factories/sagemakerai_realtime/factories_cnn.py \
--endpoint-config server_capacity/server_metrics_configs/sagemakerai_realtime.json \
--client-rps 10 \
--obs-time 60 \
--workers 5RESULTS – InferenceBenchmarker
------------------------------
Total requests fired: 600
Duration: 59.9s
Server RPS: 10.0 req/s
Total requests: 599
Success rate: 99.8% (✓ passed, 95% target)
Detailed reports land under .tmp/<timestamp>_benchmark/.
--client-rps sets the rate (how many requests per second are fired). --obs-time and --num-requests
decide how long the wave runs — pass at least one:
| You pass | Wave ends when |
|---|---|
--obs-time S only |
S seconds elapse |
--num-requests N only |
N requests have fired & completed |
| both | S seconds elapse or N requests complete — whichever first |
Pass --endpoint-config <file.json> — currently supporting CloudWatch only. Add a lag if server publishes with a delay after wave ends. Metrics and Statistics are paired by position — the i-th metric uses the i-th list of statistics — and each (metric, statistic) is queried under the block's Namespace, Dimensions, Period, and Lag.
[
{
"stream": "cloudwatch",
"Namespace": "/aws/sagemaker/Endpoints",
"Dimensions": [{"Name": "EndpointName", "Value": "my-endpoint"},
{"Name": "VariantName", "Value": "AllTraffic"}],
"Period": 60,
"Lag": 30,
"Metrics": ["CPUUtilization", "MemoryUtilization", "GPUUtilization", "GPUMemoryUtilization"],
"Statistics": [["Average","Maximum"], ["Average","Maximum"], ["Average","Maximum"], ["Average","Maximum"]]
}
]The metrics for the wave window print after the wave results:
----------------------------------------
⏳ Fetching metrics from stream: cloudwatch (waiting 90s (longest lag: 30 + period: 60) for propagation)...
CPUUtilization: Average=312.4, Maximum=394.1
MemoryUtilization: Average=18.7, Maximum=21.3
GPUUtilization: Average=82.5, Maximum=97.0
GPUMemoryUtilization: Average=44.2, Maximum=51.8
benchmark --factories-file factories/sagemakerai_realtime/factories_llm_textgeneration.py \
--client-rps 10 \
--obs-time 60 \
--url https://my-endpoint/v1/chat/completions --api-key "$KEY" \
--aiperf--aiperfruns the Locust wave first, pauses to confirm the server is at a baseline (purposed for utilization metrics), then runs aiperf.--aiperf-onlyskips the InferenceBenchmarker wave via locust and runs aiperf directly.--aiperf-args '{"warmup-count": 50, "streaming": false}'overrides or adds any aiperf flag.
aiperf's input JSONL is auto-generated from the same payload_factory; to skip generation
and supply your own, pass it via --aiperf-args '{"input-file": "/path/to/inputs.jsonl"}'.
RESULTS — aiperf
----------------
Total requests fired: 599
Duration: 59.8s
Server RPS: 0.9 req/s
Total requests: 53
Success rate: 100.0% (✓ passed, 95% target)
Detailed token metrics land under .tmp/<timestamp>_benchmark/aiperf/.
benchmark --plot .tmp/<run1> .tmp/<run2>Label runs and attach hover info with --plot-metadata — a JSON object keyed by run-dir
basename, passed either inline or as a path to a .json file. For each run, legend renames
it in the shared legend; every other key/value is shown as a hover line on that run's bars:
benchmark --plot .tmp/<run1> .tmp/<run2> .tmp/<run3> \
--plot-metadata visualization/hover_configs/example.jsonThe sample plot above is rendered with this metadata.
--factories-file FILE Python file exposing invoke_factory / payload_factory (required for a wave)
--endpoint-config FILE enables server telemetry, purposed for hardware utilization
--client-rps R Target requests per second R to send from client
--obs-time S Run a wave with client-rps for S seconds
--num-requests N Run a wave with N requests, behavior with --obs-time refer Bounding a wave section
--workers N N Locust worker processes (default 1; ~1 per available core is recommend)
--success-threshold F Min acceptable success rate F, 0-1 (default 0.95)
--sample-client-hw Record client CPU/memory during the wave
--port P Locust primary worker port (default 5557)
--locust-file FILE Use a self-contained Locust file instead of factories (debugging)
--debug Verbose tracing (debugging)
--aiperf Run the wave, pause, then run aiperf (needs --url, --api-key)
--aiperf-only Skip the wave, run aiperf directly (needs --url, --api-key)
--url URL Endpoint URL for aiperf
--api-key KEY API key for aiperf URL
--aiperf-args JSON Override/add `aiperf profile` flags; e.g. '{"model": "Qwen/Qwen2.5-0.5B"}' to estimate token counts with that HF model's tokenizer
--plot DIR [DIR ...] Build a comparison report from existing run dirs (no wave); e.g. --plot <dir1> <dir2>
--plot-output-dir DIR Output dir for the report (default: first --plot dir; e.g. --plot <dir1> <dir2> -> <dir1>)
--plot-fields JSON Restrict plotted metrics per source; e.g. '{"locust": ["Latency (ms)"], "aiperf": ["Server RPS"]}'
--plot-metadata JSON|FILE Per-run legend rename + hover info, keyed by dir basename; inline JSON or a .json path (see Bar plots)
InferenceBenchmarker detects client bottleneck and alerts you (when you run the Locust wave) — after every wave it scans the Locust logs for CPU / heartbeat saturation and prints a warning if the client was overloaded:
⚠️ CLIENT BOTTLENECK detected in locust executions, test results might be unstable. Monitor client hardware. Pass --sample-client-hw to have InferenceBenchmarker benchmark client usage. Use diagnostic tools to find worker and rps saturation—worker_saturation.py/rps_saturation.py. Try pre-computed inputs in payload_factory if payload computation is a bottleneck. Use a client with higher cores and/or memory.
A simple manual check: correlate requests fired with the wave duration. If requests fired / duration falls short of your --client-rps, the client couldn't keep up — treat the run as
client-limited. --sample-client-hw records client CPU/memory during the wave to confirm. Or use own client telemetry tools to co-relate.
Diagnostic tools help you find where the client saturates:
find_worker_saturation(factories_file)— the max requests a single Locust worker can fire per second.find_rps_saturation(factories_file, saturation_users)— how many workers to run before total requests fired plateaus (adding workers stops helping). This is the client's ceiling: the--workerssetting beyond which you need a bigger / additional load-gen host.find_file_descriptors_limit()— reports the client's file-descriptor limits, which cap number of requests(client_rps).
See client_capacity/README.md for usage.
- endpoint_factory – Add endpoint creation code in factories — create a latch for Automatic RPS. tracking issues
- Hydrate w Examples – EKS, Hyperpod, EC2, OCP(on-prem) etc. examples. tracking issues
- Interactive CLI – Add traces while running benchmarks in the current dry benchmark tool. tracking issues
- Automatic RPS – Automate trial and error server rps supported at success threshold when --endpoint-config for hardware telemetry provided. tracking issue
- Plot metadata – Provide a JSON (inline or file) to set per-run legend names and hover info in plots.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.