Virtual Appliance Scaling
Mode:Batch Realtime Deployments:Virtual ApplianceMulti Threaded Workers
Batch Mode
Mode:BatchThe number of concurrent threads available to a job worker depends on the length of the file being transcribed. Workers can be assigned a single thread or multiple, depending on the setting of scaling_mode
in the API. scaling_mode
can take two values: simple
, meaning each transcription job runs in a single thread, or adaptive
, where the number of threads depends on the length of the audio.
Depending on the scaling mode and transcription features requested by the job, workers will reserve a specific amount of CPU and Memory resources. On job creation, if enough resources are available, Kubernetes will schedule the job, if not enough resources are available, jobs will be marked as pending until resources are freed.
curl -L -u admin:$PWD -X 'POST' \
"http://${APPLIANCE_HOST}/v2/management/host/scaling" \
-d '{"scaling_mode": "simple"}'
In adaptive mode, jobs run in parallel depending on their length, up to a maximum of 4 threads. For this reason, adaptive mode is only available if the node has at least 4 cores.
Length in Seconds | Threads |
---|---|
0 < s <= 300 | 1 |
300 < s <= 600 | 2 |
600 < s <= 900 | 3 |
900 < s <= max | 4 |
Since adaptive jobs use multiple threads, they also apply a greater load to the GPU Inference Server (if enabled).
As a result the max_jobs
configuration setting has been introduced to protect the Inference Server from being overwhelmed see GPU Configuration
Realtime Mode
Mode:RealtimeRealtime mode supports multi-threaded workers by default, the worker has a configurable number of streams (threads) it can process at a given time, see Realtime GPU configuration for more details.