Brev 9032/Use Nebius Capacity Advisor to determine GPU instance availability#123
Brev 9032/Use Nebius Capacity Advisor to determine GPU instance availability#123kirtiip20 wants to merge 3 commits into
Conversation
769f1d0 to
d2a77f4
Compare
| return key, available, true | ||
| } | ||
|
|
||
| func buildResourceAdviceMapFromItems(items []*capacityv1.ResourceAdvice) map[string]uint32 { |
There was a problem hiding this comment.
Can we move this to the test file as this is only used there?
There was a problem hiding this comment.
Sure, moved buildResourceAdviceMapFromItems from instancetype.go to test file.
| return available > 0 && hasQuota | ||
| } | ||
|
|
||
| func resourceAdviceEntry(item *capacityv1.ResourceAdvice) (key string, available uint32, ok bool) { |
There was a problem hiding this comment.
Let's err on the side of no named returns (instead of (key string, available uint32, ok bool), just use (string, uint32, bool))
There was a problem hiding this comment.
Updated resourceAdviceEntry to use no named return types (string, uint32, bool) instead of named returns.
| } | ||
| isAvailable := c.resolvePresetAvailability( | ||
| ctx, isCPUOnly, hasQuota, | ||
| location.Name, platform.Metadata.Name, preset.Name, |
There was a problem hiding this comment.
Minor but it might be nice to just hand the capacity lookup key here directly, rather than the individual components that will only be used to build the key.
There was a problem hiding this comment.
Updated, The call site now constructs the key once using capacityAdviceKey(location.Name, platform.Metadata.Name, preset.Name) and passes it directly, and resolvePresetAvailability now accepts a capacityKey string instead of individual region, platform, and preset components.
d2a77f4 to
29e8fe7
Compare
29e8fe7 to
35c8a4d
Compare
Problem
Nebius GPU instance types showed as available in the Brev UI based on tenant quota alone, even when Nebius had no on-demand capacity in that region. Users could select such a type (e.g. 8× H200, L40s) and the launch would then fail at provisioning time.
Root cause
Availability was computed only from tenant quota allowances, with no check against the provider's actual capacity. A tenant can hold quota in a region where Nebius currently has no capacity available so the type was still marked available and failed on launch.
Fix
Integrated the Nebius Capacity Advisor (ResourceAdvice) API so availability reflects real-time on-demand capacity & tenant quota:
Remaining tenant quota.
Treated DATA_STATE_UNKNOWN and AVAILABILITY_LEVEL_LIMIT_REACHED as unavailable capacity