Skip to content

feat: Bonsai Ternary 8B MIG deployment with LiteLLM proxy and Grafana…#978

Open
markhembrow wants to merge 1 commit into
vllm-project:mainfrom
markhembrow:release/bonsai-ternary-mig
Open

feat: Bonsai Ternary 8B MIG deployment with LiteLLM proxy and Grafana…#978
markhembrow wants to merge 1 commit into
vllm-project:mainfrom
markhembrow:release/bonsai-ternary-mig

Conversation

@markhembrow

Copy link
Copy Markdown

… monitoring

  • Deploy Bonsai Ternary 1.58-bit 8B model on 3x MIG 1g.10gb partitions
  • Use Prism fork llama.cpp server with GPU acceleration
  • Configure 64K context window with q8_0 KV-cache quantization
  • Set --parallel 2 for optimal throughput (~89 tok/s aggregate)
  • Add LiteLLM proxy with Prometheus metrics and drop_params
  • Deploy PostgreSQL for cost tracking
  • Create Grafana dashboards (simple + provisioned)
  • Add performance test scripts and results documentation

Performance results:

  • 4 instances: 89.33 tok/s aggregate (64K ctx, parallel=2)
  • 3 instances: 48.59 tok/s aggregate

FILL IN THE PR DESCRIPTION HERE

FIX #xxxx (link existing issues this PR will resolve)

BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE


  • Make sure the code changes pass the pre-commit checks.
  • Sign-off your commit by using -s when doing git commit
  • Try to classify PRs for easy understanding of the type of changes, such as [Bugfix], [Feat], and [CI].
Detailed Checklist (Click to Expand)

Thank you for your contribution to production-stack! Before submitting the pull request, please ensure the PR meets the following criteria. This helps us maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Please try to classify PRs for easy understanding of the type of changes. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Feat] for new features in the cluster (e.g., autoscaling, disaggregated prefill, etc.).
  • [Router] for changes to the vllm_router (e.g., routing algorithm, router observability, etc.).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • Pass all linter checks. Please use pre-commit to format your code. See README.md for installation.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Please include sufficient tests to ensure the change is stay correct and robust. This includes both unit tests and integration tests.

DCO and Signed-off-by

When contributing changes to this project, you must agree to the DCO. Commits must include a Signed-off-by: header which certifies agreement with the terms of the DCO.

Using -s with git commit will automatically add this header.

What to Expect for the Reviews

We aim to address all PRs in a timely manner. If no one reviews your PR within 5 days, please @-mention one of YuhanLiu11
, Shaoting-Feng or ApostaC.

… monitoring

- Deploy Bonsai Ternary 1.58-bit 8B model on 3x MIG 1g.10gb partitions
- Use Prism fork llama.cpp server with GPU acceleration
- Configure 64K context window with q8_0 KV-cache quantization
- Set --parallel 2 for optimal throughput (~89 tok/s aggregate)
- Add LiteLLM proxy with Prometheus metrics and drop_params
- Deploy PostgreSQL for cost tracking
- Create Grafana dashboards (simple + provisioned)
- Add performance test scripts and results documentation

Performance results:
- 4 instances: 89.33 tok/s aggregate (64K ctx, parallel=2)
- 3 instances: 48.59 tok/s aggregate

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces deployment manifests, Helm values, performance testing scripts, and documentation for deploying the Bonsai Ternary 8B model on NVIDIA MIG instances. Feedback on these changes highlights several key improvements: removing hostNetwork: true from the deployment to prevent port binding conflicts, updating both the deployment and Helm configurations to use the optimized settings (such as a 64K context size and quantized KV cache) recommended in the performance documentation, replacing a custom StopEvent class in the continuous performance test script with Python's standard threading.Event, and replacing absolute local file paths in the documentation with relative paths.

Comment on lines +18 to +20
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
terminationGracePeriodSeconds: 30

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using hostNetwork: true on a deployment with replicas: 4 will cause port binding conflicts on port 8080 if multiple replicas are scheduled on the same Kubernetes node (which is the standard case for MIG partitions on a single GPU/node). Since a Kubernetes Service bonsai-ternary-mig is used to route traffic to the pods, hostNetwork: true is unnecessary and should be removed to allow proper scheduling and port isolation.

      terminationGracePeriodSeconds: 30

Comment on lines +32 to +35
- "--ctx-size"
- "2048"
- "--threads"
- "8"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The deployment manifest is configured with a context size of 2048 and is missing the optimized settings (--parallel 2, --cache-type-k q8_0, --cache-type-v q8_0) recommended in BONSAI_TERNARY_MIG_PERFORMANCE.md for optimal throughput. Let's update the command arguments to match the recommended configuration.

        - "--ctx-size"
        - "65536"
        - "--parallel"
        - "2"
        - "--cache-type-k"
        - "q8_0"
        - "--cache-type-v"
        - "q8_0"
        - "--threads"
        - "8"

Comment on lines +27 to +30
- "--ctx-size"
- "2048"
- "--threads"
- "8"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Helm values file is configured with a context size of 2048 and is missing the optimized settings (--parallel 2, --cache-type-k q8_0, --cache-type-v q8_0) recommended in BONSAI_TERNARY_MIG_PERFORMANCE.md for optimal throughput. Let's update the command arguments to match the recommended configuration.

      - "--ctx-size"
      - "65536"
      - "--parallel"
      - "2"
      - "--cache-type-k"
      - "q8_0"
      - "--cache-type-v"
      - "q8_0"
      - "--threads"
      - "8"

Comment thread perf_test_continuous.py
Comment on lines +44 to +50
class StopEvent:
def __init__(self):
self._stop = False
def is_set(self):
return self._stop
def set(self):
self._stop = True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of implementing a custom StopEvent class, we should use the standard, thread-safe threading.Event from the Python standard library.

import threading

Comment thread perf_test_continuous.py
Comment on lines +54 to +55
global stop_event
stop_event = StopEvent()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instantiate threading.Event() instead of the custom StopEvent() class.

Suggested change
global stop_event
stop_event = StopEvent()
global stop_event
stop_event = threading.Event()

Comment on lines +158 to +161
- `/home/mctouch/code/production-stack/perf_test.py` — fixed-batch throughput test
- `/home/mctouch/code/production-stack/perf_test_continuous.py` — continuous saturation test (not run to completion due to long per-request times)
- `/tmp/bonsai-ternary-mig-deployment.yaml` — current deployment manifest
- `/home/mctouch/code/production-stack/BONSAI_TERNARY_MIG_PERFORMANCE.md` — this file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation contains absolute local file paths specific to a user's environment (e.g., /home/mctouch/... and /tmp/...). These should be updated to use relative paths relative to the repository root so that they are correct for all users and environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant