Configuration Reference
Values
Key |
Type |
Default |
Description |
---|---|---|---|
nameOverride |
string |
|
Unique identifier of SuperSONIC instance (equal to release name by default) |
triton.replicas |
int |
|
Number of Triton server instances (if autoscaling is disabled) |
triton.image |
string |
|
Docker image for the Triton server |
triton.command |
list |
|
Command and arguments to run in Triton container |
triton.args[0] |
string |
|
|
triton.resources |
object |
|
Resource limits and requests for each Triton instance. You can add necessary GPU request here. |
triton.affinity |
object |
|
Affinity rules for Triton pods - another way to request GPUs |
triton.modelRepository |
object |
|
Model repository configuration |
triton.modelRepository.mountPath |
string |
|
Model repository mount path |
triton.service.labels |
object |
|
|
triton.service.annotations |
object |
|
|
triton.service.ports |
list |
|
Ports for communication with Triton servers |
triton.resetReadinessProbe |
bool |
|
If ture, will ignore custom readinness probe settings (not recommended when using autoscaler) |
envoy.enabled |
bool |
|
Enable Envoy Proxy |
envoy.replicas |
int |
|
Number of Envoy Proxy pods in Deployment |
envoy.image |
string |
|
Envoy Proxy Docker image |
envoy.args |
list |
|
Arguments for Envoy |
envoy.resources |
object |
|
Resource requests and limits for Envoy Proxy. Note: an Envoy Proxy with too many connections might run out of CPU |
envoy.service.type |
string |
|
This is the client-facing endpoint. In order to be able to connect to it, either enable ingress, or use type: LoadBalancer. |
envoy.service.ports |
list |
|
Envoy Service ports |
envoy.rate_limiter.listener_level |
object |
|
This rate limiter explicitly controls the number of client connections to the Envoy Proxy. |
envoy.rate_limiter.listener_level.enabled |
bool |
|
Enable rate limiter |
envoy.rate_limiter.listener_level.max_tokens |
int |
|
Maximum number of simultaneous connections to the Envoy Proxy. Each new connection takes a “token” from the “bucket” which initially contains |
envoy.rate_limiter.listener_level.tokens_per_fill |
int |
|
|
envoy.rate_limiter.listener_level.fill_interval |
string |
|
For example, adding a new token every 12 seconds allows 5 new connections every minute. |
envoy.rate_limiter.prometheus_based |
object |
|
This rate limiter rejects new connections based on metric extracted from Prometheus (e.g. inference queue latency). The metric is taken from parameter |
envoy.rate_limiter.prometheus_based.enabled |
bool |
|
Enable rate limiter |
envoy.loadBalancerPolicy |
string |
|
Envoy load balancer policy. Options: ROUND_ROBIN, LEAST_REQUEST, RING_HASH, RANDOM, MAGLEV |
envoy.auth.enabled |
bool |
|
Enable authentication in Envoy proxy |
envoy.auth.jwt_issuer |
string |
|
|
envoy.auth.jwt_remote_jwks_uri |
string |
|
|
envoy.auth.audiences |
list |
|
|
envoy.auth.url |
string |
|
|
envoy.auth.port |
int |
|
|
autoscaler.enabled |
bool |
|
Enable autoscaling (requires Prometheus to also be enabled). Autoscaling will be based on the metric is taken from parameter |
autoscaler.minReplicas |
int |
|
Minimum and maximum number of Triton servers. Warning: if min=0 and desired Prometheus metric is empty, the first server will never start |
autoscaler.maxReplicas |
int |
|
|
autoscaler.zeroIdleReplicas |
bool |
|
If set to true, the server will release all GPUs when idle. Be careful: if the scaling metric is extracted from Triton servers, it will be unavailable, and scaling from 0 to 1 will never happen. |
autoscaler.scaleUp.window |
int |
|
|
autoscaler.scaleUp.period |
int |
|
|
autoscaler.scaleUp.stepsize |
int |
|
|
autoscaler.scaleDown.window |
int |
|
|
autoscaler.scaleDown.period |
int |
|
|
autoscaler.scaleDown.stepsize |
int |
|
|
prometheus |
object |
|
Connection to a Prometheus server is required for KEDA autoscaler and Envoy’s prometheus-based rate limiter |
prometheus.url |
string |
|
Prometheus server url and port number (find in documentation of a given cluster or ask admins) |
prometheus.scheme |
string |
|
Specify whether Prometheus endpoint is exposed as http or https |
prometheus.serverLoadMetric |
string |
|
A metric used by both KEDA autoscaler and Envoy’s prometheus-based rate limiter. # Default metric (inference queue latency) is defined in templates/_helpers.tpl |
prometheus.serverLoadThreshold |
int |
|
Threshold for the metric |
ingress.enabled |
bool |
|
|
ingress.hostName |
string |
|
|
nodeSelector |
object |
|
Node selector for all pods (Triton and Envoy) |
tolerations |
list |
|
Tolerations for all pods (Triton and Envoy) |