InferenceBenchmarker

Model-, platform-, and payload-agnostic load testing and capacity planning for inference endpoints.

Guaranteed client RPS – Customizes and wraps Locust to pace requests from the client so a target client requests-per-second (RPS) is sustained.
Client-side bottleneck diagnostics – Detects when the client sending requests is the limiting factor.
Any inference endpoint – Traditional ML, GenAI, or any other HTTPS endpoint.
Defined in plain Python – Endpoint characteristics — invocation logic, payload generation, endpoint creation — are expressed as simple Python function definitions.
Configurable server metrics – Latches to configurable server metrics to correlate hardware utilization (or any available metric) — to plan capacity management across multiple endpoints, autoscaling, and cost extrapolations.
aiperf integration – Integrates with NVIDIA aiperf for token usage metrics.
Comparison visualizations – Basic bar-plot reports for comparing configurations side by side.

Built With

Usage

Installation

Clone the repo

git clone https://github.com/aws-samples/sample-InferenceBenchmarker.git && cd sample-InferenceBenchmarker

Run the one-time setup

./benchmark --init

Use an existing virtual env? Enter its path, or press return to create a dedicated .venvIB:
  /path/to/your/venv   → installs dependencies into it
  <return>             → creates .venvIB and installs into it

Confirm it resolves
```
benchmark
```
(Use ./benchmark from the repo root if you skipped adding to PATH.)

1. Describe your endpoint

InferenceBenchmarker takes your invocation logic defined in Python functions in a file.

invoke_factory(endpoint_name) runs once per worker process its body is shared by all users on that worker, so put reusable code there. It returns a callable invoke(payload), which runs per request.

from typing import Any, Callable

Payload = Any   # whatever your endpoint accepts

def invoke_factory(endpoint_name: str | None = None) -> Callable[[Payload], None]:
    # WORKER-LEVEL setup: this body runs ONCE per worker process and is shared by every
    # user on that worker. Put expensive, reusable client state here — e.g. the boto3
    # client and its connection pool — so it isn't rebuilt per request.
    import json, boto3
    client = boto3.client('sagemaker-runtime')

    def invoke(payload: Payload) -> None:
        # PER-REQUEST: called once for each request a user fires.
        resp = client.invoke_endpoint(
            EndpointName=endpoint_name or 'my-endpoint',
            ContentType='application/json',
            Body=json.dumps(payload).encode('utf-8'),
        )
        resp['Body'].read()

    return invoke            # the per-request callable

payload_factory() is called once per worker and has two modes.

pre_computed=True — payload pre-built, at request time a payload is just picked from the list, so payload-build cost is not included. Use this mode if payload generation logic is compute heavy – causing CPU to be a client bottleneck. Might cause higher memory usage during benchmarking.
```
def payload_factory() -> dict[str, bool | list[Payload]]:
    return {
        'pre_computed': True,
        'input': [{'instances': [[...]]}],   # list[Payload]
    }
```
Requests cycle through the pre-computed inputs in order and wrap back to the start, so the list is never exhausted or cut off — fire more requests than there are inputs and it simply loops.

pre_computed=False — payload built per user request by returning a Python callable that is called on each inference request. payload-build cost is included. Use this mode to introduce dynamism in payload_generation and if maintaining a pre-computed payload is heavy on memory causing memory to be a client bottleneck. Might cause higher compute usage during benchmarking:
```
def payload_factory() -> dict[str, bool | Callable[[], Payload]]:
    return {
        'pre_computed': False,
        'input': lambda: {'instances': [[...]]},   # Callable[[], Payload]
    }
```

2. Run a wave

benchmark \
  --factories-file   factories/sagemakerai_realtime/factories_cnn.py \
  --endpoint-config  server_capacity/server_metrics_configs/sagemakerai_realtime.json \
  --client-rps       10 \
  --obs-time         60 \
  --workers          5

RESULTS – InferenceBenchmarker
------------------------------
   Total requests fired: 600

   Duration:         59.9s
   Server RPS:       10.0 req/s
   Total requests:   599
   Success rate:     99.8% (✓ passed, 95% target)

Detailed reports land under .tmp/<timestamp>_benchmark/.

Bounding a wave: `--obs-time` and `--num-requests`

--client-rps sets the rate (how many requests per second are fired). --obs-time and --num-requests decide how long the wave runs — pass at least one:

You pass	Wave ends when
`--obs-time S` only	`S` seconds elapse
`--num-requests N` only	`N` requests have fired & completed
both	`S` seconds elapse or `N` requests complete — whichever first

3. Add server-side metrics

Pass --endpoint-config <file.json> — currently supporting CloudWatch only. Add a lag if server publishes with a delay after wave ends. Metrics and Statistics are paired by position — the i-th metric uses the i-th list of statistics — and each (metric, statistic) is queried under the block's Namespace, Dimensions, Period, and Lag.

[
  {
    "stream": "cloudwatch",
    "Namespace": "/aws/sagemaker/Endpoints",
    "Dimensions": [{"Name": "EndpointName", "Value": "my-endpoint"},
                   {"Name": "VariantName", "Value": "AllTraffic"}],
    "Period": 60,
    "Lag": 30,
    "Metrics": ["CPUUtilization", "MemoryUtilization", "GPUUtilization", "GPUMemoryUtilization"],
    "Statistics": [["Average","Maximum"], ["Average","Maximum"], ["Average","Maximum"], ["Average","Maximum"]]
  }
]

The metrics for the wave window print after the wave results:

----------------------------------------

   ⏳ Fetching metrics from stream: cloudwatch (waiting 90s (longest lag: 30 + period: 60) for propagation)...

   CPUUtilization: Average=312.4, Maximum=394.1
   MemoryUtilization: Average=18.7, Maximum=21.3
   GPUUtilization: Average=82.5, Maximum=97.0
   GPUMemoryUtilization: Average=44.2, Maximum=51.8

4. Get Token metrics with aiperf

benchmark --factories-file factories/sagemakerai_realtime/factories_llm_textgeneration.py \
  --client-rps 10 \ 
  --obs-time 60 \
  --url https://my-endpoint/v1/chat/completions --api-key "$KEY" \
  --aiperf

--aiperf runs the Locust wave first, pauses to confirm the server is at a baseline (purposed for utilization metrics), then runs aiperf.
--aiperf-only skips the InferenceBenchmarker wave via locust and runs aiperf directly.
--aiperf-args '{"warmup-count": 50, "streaming": false}' overrides or adds any aiperf flag.

aiperf's input JSONL is auto-generated from the same payload_factory; to skip generation and supply your own, pass it via --aiperf-args '{"input-file": "/path/to/inputs.jsonl"}'.

RESULTS — aiperf
----------------
   Total requests fired: 599

   Duration:         59.8s
   Server RPS:       0.9 req/s
   Total requests:   53
   Success rate:     100.0% (✓ passed, 95% target)

Detailed token metrics land under .tmp/<timestamp>_benchmark/aiperf/.

5. Bar plots

benchmark --plot .tmp/<run1> .tmp/<run2>

sample plot

Label runs and attach hover info with --plot-metadata — a JSON object keyed by run-dir basename, passed either inline or as a path to a .json file. For each run, legend renames it in the shared legend; every other key/value is shown as a hover line on that run's bars:

benchmark --plot .tmp/<run1> .tmp/<run2> .tmp/<run3> \
  --plot-metadata visualization/hover_configs/example.json

The sample plot above is rendered with this metadata.

All flags

--factories-file FILE     Python file exposing invoke_factory / payload_factory (required for a wave)
--endpoint-config FILE    enables server telemetry, purposed for hardware utilization
--client-rps R            Target requests per second R to send from client
--obs-time S              Run a wave with client-rps for S seconds
--num-requests N          Run a wave with N requests, behavior with --obs-time refer Bounding a wave section
--workers N               N Locust worker processes (default 1; ~1 per available core is recommend)
--success-threshold F     Min acceptable success rate F, 0-1 (default 0.95)
--sample-client-hw        Record client CPU/memory during the wave
--port P                  Locust primary worker port (default 5557)
--locust-file FILE        Use a self-contained Locust file instead of factories (debugging)
--debug                   Verbose tracing (debugging)

--aiperf                  Run the wave, pause, then run aiperf            (needs --url, --api-key)
--aiperf-only             Skip the wave, run aiperf directly             (needs --url, --api-key)
--url URL                 Endpoint URL for aiperf
--api-key KEY             API key for aiperf URL
--aiperf-args JSON        Override/add `aiperf profile` flags; e.g. '{"model": "Qwen/Qwen2.5-0.5B"}' to estimate token counts with that HF model's tokenizer

--plot DIR [DIR ...]      Build a comparison report from existing run dirs (no wave); e.g. --plot <dir1> <dir2>
--plot-output-dir DIR     Output dir for the report (default: first --plot dir; e.g. --plot <dir1> <dir2> -> <dir1>)
--plot-fields JSON        Restrict plotted metrics per source; e.g. '{"locust": ["Latency (ms)"], "aiperf": ["Server RPS"]}'
--plot-metadata JSON|FILE Per-run legend rename + hover info, keyed by dir basename; inline JSON or a .json path (see Bar plots)

Client Diagnostics

InferenceBenchmarker detects client bottleneck and alerts you (when you run the Locust wave) — after every wave it scans the Locust logs for CPU / heartbeat saturation and prints a warning if the client was overloaded:

   ⚠️ CLIENT BOTTLENECK detected in locust executions, test results might be unstable. Monitor client hardware. Pass --sample-client-hw to have InferenceBenchmarker benchmark client usage. Use diagnostic tools to find worker and rps saturation—worker_saturation.py/rps_saturation.py. Try pre-computed inputs in payload_factory if payload computation is a bottleneck. Use a client with higher cores and/or memory.

A simple manual check: correlate requests fired with the wave duration. If requests fired / duration falls short of your --client-rps, the client couldn't keep up — treat the run as client-limited. --sample-client-hw records client CPU/memory during the wave to confirm. Or use own client telemetry tools to co-relate.

Diagnostic tools help you find where the client saturates:

find_worker_saturation(factories_file) — the max requests a single Locust worker can fire per second.
find_rps_saturation(factories_file, saturation_users) — how many workers to run before total requests fired plateaus (adding workers stops helping). This is the client's ceiling: the --workers setting beyond which you need a bigger / additional load-gen host.
find_file_descriptors_limit() — reports the client's file-descriptor limits, which cap number of requests(client_rps).

See client_capacity/README.md for usage.

Upcoming Improvements

endpoint_factory – Add endpoint creation code in factories — create a latch for Automatic RPS. tracking issues
Hydrate w Examples – EKS, Hyperpod, EC2, OCP(on-prem) etc. examples. tracking issues
Interactive CLI – Add traces while running benchmarks in the current dry benchmark tool. tracking issues
Automatic RPS – Automate trial and error server rps supported at success threshold when --endpoint-config for hardware telemetry provided. tracking issue
Plot metadata – Provide a JSON (inline or file) to set per-run legend names and hover info in plots.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

InferenceBenchmarker

Built With

Usage

Installation

1. Describe your endpoint

2. Run a wave

Bounding a wave: `--obs-time` and `--num-requests`

3. Add server-side metrics

4. Get Token metrics with aiperf

5. Bar plots

All flags

Client Diagnostics

Upcoming Improvements

Security

License

Uh oh!

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

InferenceBenchmarker

Built With

Usage

Installation

1. Describe your endpoint

2. Run a wave

Bounding a wave: --obs-time and --num-requests

3. Add server-side metrics

4. Get Token metrics with aiperf

5. Bar plots

All flags

Client Diagnostics

Upcoming Improvements

Security

License

Bounding a wave: `--obs-time` and `--num-requests`