feat: Bonsai Ternary 8B MIG deployment with LiteLLM proxy and Grafana…#978
feat: Bonsai Ternary 8B MIG deployment with LiteLLM proxy and Grafana…#978markhembrow wants to merge 1 commit into
Conversation
… monitoring - Deploy Bonsai Ternary 1.58-bit 8B model on 3x MIG 1g.10gb partitions - Use Prism fork llama.cpp server with GPU acceleration - Configure 64K context window with q8_0 KV-cache quantization - Set --parallel 2 for optimal throughput (~89 tok/s aggregate) - Add LiteLLM proxy with Prometheus metrics and drop_params - Deploy PostgreSQL for cost tracking - Create Grafana dashboards (simple + provisioned) - Add performance test scripts and results documentation Performance results: - 4 instances: 89.33 tok/s aggregate (64K ctx, parallel=2) - 3 instances: 48.59 tok/s aggregate
There was a problem hiding this comment.
Code Review
This pull request introduces deployment manifests, Helm values, performance testing scripts, and documentation for deploying the Bonsai Ternary 8B model on NVIDIA MIG instances. Feedback on these changes highlights several key improvements: removing hostNetwork: true from the deployment to prevent port binding conflicts, updating both the deployment and Helm configurations to use the optimized settings (such as a 64K context size and quantized KV cache) recommended in the performance documentation, replacing a custom StopEvent class in the continuous performance test script with Python's standard threading.Event, and replacing absolute local file paths in the documentation with relative paths.
| hostNetwork: true | ||
| dnsPolicy: ClusterFirstWithHostNet | ||
| terminationGracePeriodSeconds: 30 |
There was a problem hiding this comment.
Using hostNetwork: true on a deployment with replicas: 4 will cause port binding conflicts on port 8080 if multiple replicas are scheduled on the same Kubernetes node (which is the standard case for MIG partitions on a single GPU/node). Since a Kubernetes Service bonsai-ternary-mig is used to route traffic to the pods, hostNetwork: true is unnecessary and should be removed to allow proper scheduling and port isolation.
terminationGracePeriodSeconds: 30| - "--ctx-size" | ||
| - "2048" | ||
| - "--threads" | ||
| - "8" |
There was a problem hiding this comment.
The deployment manifest is configured with a context size of 2048 and is missing the optimized settings (--parallel 2, --cache-type-k q8_0, --cache-type-v q8_0) recommended in BONSAI_TERNARY_MIG_PERFORMANCE.md for optimal throughput. Let's update the command arguments to match the recommended configuration.
- "--ctx-size"
- "65536"
- "--parallel"
- "2"
- "--cache-type-k"
- "q8_0"
- "--cache-type-v"
- "q8_0"
- "--threads"
- "8"| - "--ctx-size" | ||
| - "2048" | ||
| - "--threads" | ||
| - "8" |
There was a problem hiding this comment.
The Helm values file is configured with a context size of 2048 and is missing the optimized settings (--parallel 2, --cache-type-k q8_0, --cache-type-v q8_0) recommended in BONSAI_TERNARY_MIG_PERFORMANCE.md for optimal throughput. Let's update the command arguments to match the recommended configuration.
- "--ctx-size"
- "65536"
- "--parallel"
- "2"
- "--cache-type-k"
- "q8_0"
- "--cache-type-v"
- "q8_0"
- "--threads"
- "8"| class StopEvent: | ||
| def __init__(self): | ||
| self._stop = False | ||
| def is_set(self): | ||
| return self._stop | ||
| def set(self): | ||
| self._stop = True |
| global stop_event | ||
| stop_event = StopEvent() |
| - `/home/mctouch/code/production-stack/perf_test.py` — fixed-batch throughput test | ||
| - `/home/mctouch/code/production-stack/perf_test_continuous.py` — continuous saturation test (not run to completion due to long per-request times) | ||
| - `/tmp/bonsai-ternary-mig-deployment.yaml` — current deployment manifest | ||
| - `/home/mctouch/code/production-stack/BONSAI_TERNARY_MIG_PERFORMANCE.md` — this file |
There was a problem hiding this comment.
… monitoring
Performance results:
FILL IN THE PR DESCRIPTION HERE
FIX #xxxx (link existing issues this PR will resolve)
BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE
-swhen doinggit commit[Bugfix],[Feat], and[CI].Detailed Checklist (Click to Expand)
Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]for bug fixes.[CI/Build]for build or continuous integration improvements.[Doc]for documentation fixes and improvements.[Feat]for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).[Router]for changes to thevllm_router(e.g., routing algorithm, router observability, etc.).[Misc]for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
pre-committo format your code. SeeREADME.mdfor installation.DCO and Signed-off-by
When contributing changes to this project, you must agree to the DCO. Commits must include a
Signed-off-by:header which certifies agreement with the terms of the DCO.Using
-swithgit commitwill automatically add this header.What to Expect for the Reviews
We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.