feat: Bonsai Ternary 8B MIG deployment with LiteLLM proxy and Grafana… by markhembrow · Pull Request #978 · vllm-project/production-stack

markhembrow · 2026-06-20T12:24:57Z

… monitoring

Deploy Bonsai Ternary 1.58-bit 8B model on 3x MIG 1g.10gb partitions
Use Prism fork llama.cpp server with GPU acceleration
Configure 64K context window with q8_0 KV-cache quantization
Set --parallel 2 for optimal throughput (~89 tok/s aggregate)
Add LiteLLM proxy with Prometheus metrics and drop_params
Deploy PostgreSQL for cost tracking
Create Grafana dashboards (simple + provisioned)
Add performance test scripts and results documentation

Performance results:

4 instances: 89.33 tok/s aggregate (64K ctx, parallel=2)
3 instances: 48.59 tok/s aggregate

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE

Make sure the code changes pass the pre-commit checks.
Sign-off your commit by using -s when doing git commit
Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].

Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
[Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
The code need to be well-documented to ensure future contributors can easily understand the code.
Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

… monitoring - Deploy Bonsai Ternary 1.58-bit 8B model on 3x MIG 1g.10gb partitions - Use Prism fork llama.cpp server with GPU acceleration - Configure 64K context window with q8_0 KV-cache quantization - Set --parallel 2 for optimal throughput (~89 tok/s aggregate) - Add LiteLLM proxy with Prometheus metrics and drop_params - Deploy PostgreSQL for cost tracking - Create Grafana dashboards (simple + provisioned) - Add performance test scripts and results documentation Performance results: - 4 instances: 89.33 tok/s aggregate (64K ctx, parallel=2) - 3 instances: 48.59 tok/s aggregate

gemini-code-assist

Code Review

This pull request introduces deployment manifests, Helm values, performance testing scripts, and documentation for deploying the Bonsai Ternary 8B model on NVIDIA MIG instances. Feedback on these changes highlights several key improvements: removing hostNetwork: true from the deployment to prevent port binding conflicts, updating both the deployment and Helm configurations to use the optimized settings (such as a 64K context size and quantized KV cache) recommended in the performance documentation, replacing a custom StopEvent class in the continuous performance test script with Python's standard threading.Event, and replacing absolute local file paths in the documentation with relative paths.

gemini-code-assist · 2026-06-20T12:26:14Z

+      hostNetwork: true
+      dnsPolicy: ClusterFirstWithHostNet
+      terminationGracePeriodSeconds: 30


Using hostNetwork: true on a deployment with replicas: 4 will cause port binding conflicts on port 8080 if multiple replicas are scheduled on the same Kubernetes node (which is the standard case for MIG partitions on a single GPU/node). Since a Kubernetes Service bonsai-ternary-mig is used to route traffic to the pods, hostNetwork: true is unnecessary and should be removed to allow proper scheduling and port isolation.

terminationGracePeriodSeconds: 30

gemini-code-assist · 2026-06-20T12:26:14Z

+        - "--ctx-size"
+        - "2048"
+        - "--threads"
+        - "8"


The deployment manifest is configured with a context size of 2048 and is missing the optimized settings (--parallel 2, --cache-type-k q8_0, --cache-type-v q8_0) recommended in BONSAI_TERNARY_MIG_PERFORMANCE.md for optimal throughput. Let's update the command arguments to match the recommended configuration.

- "--ctx-size" - "65536" - "--parallel" - "2" - "--cache-type-k" - "q8_0" - "--cache-type-v" - "q8_0" - "--threads" - "8"

gemini-code-assist · 2026-06-20T12:26:14Z

+      - "--ctx-size"
+      - "2048"
+      - "--threads"
+      - "8"


The Helm values file is configured with a context size of 2048 and is missing the optimized settings (--parallel 2, --cache-type-k q8_0, --cache-type-v q8_0) recommended in BONSAI_TERNARY_MIG_PERFORMANCE.md for optimal throughput. Let's update the command arguments to match the recommended configuration.

- "--ctx-size" - "65536" - "--parallel" - "2" - "--cache-type-k" - "q8_0" - "--cache-type-v" - "q8_0" - "--threads" - "8"

gemini-code-assist · 2026-06-20T12:26:14Z

+class StopEvent:
+    def __init__(self):
+        self._stop = False
+    def is_set(self):
+        return self._stop
+    def set(self):
+        self._stop = True


Instead of implementing a custom StopEvent class, we should use the standard, thread-safe threading.Event from the Python standard library.

import threading

gemini-code-assist · 2026-06-20T12:26:14Z

+    global stop_event
+    stop_event = StopEvent()


Instantiate threading.Event() instead of the custom StopEvent() class.

Suggested change

global stop_event

stop_event = StopEvent()

global stop_event

stop_event = threading.Event()

gemini-code-assist · 2026-06-20T12:26:14Z

+- `/home/mctouch/code/production-stack/perf_test.py` — fixed-batch throughput test
+- `/home/mctouch/code/production-stack/perf_test_continuous.py` — continuous saturation test (not run to completion due to long per-request times)
+- `/tmp/bonsai-ternary-mig-deployment.yaml` — current deployment manifest
+- `/home/mctouch/code/production-stack/BONSAI_TERNARY_MIG_PERFORMANCE.md` — this file


The documentation contains absolute local file paths specific to a user's environment (e.g., /home/mctouch/... and /tmp/...). These should be updated to use relative paths relative to the repository root so that they are correct for all users and environments.

gemini-code-assist Bot reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Bonsai Ternary 8B MIG deployment with LiteLLM proxy and Grafana…#978

feat: Bonsai Ternary 8B MIG deployment with LiteLLM proxy and Grafana…#978
markhembrow wants to merge 1 commit into
vllm-project:mainfrom
markhembrow:release/bonsai-ternary-mig

markhembrow commented Jun 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

markhembrow commented Jun 20, 2026

PR Title and Classification

Code Quality

DCO and Signed-off-by

What to Expect for the Reviews

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant