Skip to main content

Virtual Appliance Scaling

Mode:Batch Realtime   Deployments:Virtual Appliance

Multi Threaded Workers

Batch Mode

Mode:Batch 

The number of concurrent threads available to a job worker depends on the length of the file being transcribed. Workers can be assigned a single thread or multiple, depending on the setting of scaling_mode in the API. scaling_mode can take two values: simple, meaning each transcription job runs in a single thread, or adaptive, where the number of threads depends on the length of the audio. Depending on the scaling mode and transcription features requested by the job, workers will reserve a specific amount of CPU and Memory resources. On job creation, if enough resources are available, Kubernetes will schedule the job, if not enough resources are available, jobs will be marked as pending until resources are freed.

curl -L -u admin:$PWD -X 'POST' \
  "http://${APPLIANCE_HOST}/v2/management/host/scaling" \
  -d '{"scaling_mode": "simple"}'

In adaptive mode, jobs run in parallel depending on their length, up to a maximum of 4 threads. For this reason, adaptive mode is only available if the node has at least 4 cores.

Length in SecondsThreads
0 < s <= 3001
300 < s <= 6002
600 < s <= 9003
900 < s <= max4

Since adaptive jobs use multiple threads, they also apply a greater load to the GPU Inference Server (if enabled). As a result the max_jobs configuration setting has been introduced to protect the Inference Server from being overwhelmed see GPU Configuration

Realtime Mode

Mode:Realtime 

Realtime mode supports multi-threaded workers by default, the worker has a configurable number of streams (threads) it can process at a given time, see Realtime GPU configuration for more details.