Translation GPU Inference Container
Transcription:BatchReal-TimeDeployments:ContainerPrerequisites
- A license file or a license token
- The Inference Container itself does not need a license, but its client (a transcriber Container) must have a valid license with Translation enabled
- Access to our Docker repository
System Requirements
Note: System requirements for the Translation inference server are the same as for the GPU Inference Container for transcription, except for RAM and CPU requirements which are lower. The two servers cannot use the same GPU.
The system must have:
- Nvidia GPU(s) with at least 16GB of GPU memory
- Nvidia drivers (see below for supported versions)
- CUDA compute capability of 7.5-9.0 inclusive, which corresponds to the Turing, Ampere, Lovelace, Hopper architecture. Cards with the Volta architecture or below are not able to run the models
- 5GB RAM
- 4 vCPUs
- The nvidia-container-toolkit installed
- Docker version > 19.03
The raw Docker image size of the Translation Container is around 10GB.
Nvidia Drivers
- The GPU Inference Container is based on CUDA 12.3.2, which requires NVIDIA Driver release 545 or later.
- If you are running on a data center GPU (e.g, a T4) you can use drivers: 470.57 (or later R470), 525.85 (or later R525), 535.86 (or later R535), or 545.23 (or later R545).
Driver installation can be validated by running nvidia-smi
. This command should return the Nvidia driver version and show additional information about the GPU(s).
Azure Instances
The GPU node can be provisioned in the cloud. Our SaaS deployment uses
- Azure Standard_NC4as_T4_v3
but any NC
or ND
series with sufficient memory should work.
Running the Image
Currently, each Translation Container can only run on a single GPU.
If a system has more than one GPU, the device must be specified using CUDA_VISIBLE_DEVICES
or selecting the device using the --gpus
argument. See Nvidia/CUDA documentation for details.
docker run --rm -it \
--gpus '"device=0"' \
-e CUDA_VISIBLE_DEVICES \
-p 8001:8001 \ # the grpc endpoint uses port 8001, can be mapped to any host port
speechmatics-docker-public.jfrog.io/sm-translation-inference-server:10.7.0
On startup you will see logs detailing available GPU memory. As set out in the requirements section, the system must have a minimum of 16GB of GPU memory, though extra GPU memory may be used if available.
Total GPU memory: 40960MiB
Approx. size models: 5GB
Available GPU memory after models loaded: 35GB
Sending Requests
Batch and Real-Time (RT) transcribers handle sending requests to the Translation Inference Server. To run a transcription job with Translation, follow the instuctions for running the CPU Container and additionally:
- Set the environment variable
SM_TRANSLATION_ENDPOINT
in the transcriber to the GRPC endpoint of the running Translation Inference Server, in the form<server_ip_address>:<port>
where the port is the one bound to port 8001 of the Translation Docker Container (see running the image) - Include a
translation_config
inside of your job config. More details - Use a transcriber version 10.3.0 or newer
- Ensure you use a license which allows Translation
Translation Language Pairs
The Translation Inference Container is not language specific, meaning that all 69 translation language pairs supported can run on a single Inference Container. The source language is defined by the language of the transcriber sending requests.
By default, a maximum of 5 target languages can be requested at once. This behaviour can be changed by setting the environment variable SM_TRANSLATION_MAX_TARGET_LANGUAGES
in the transcriber. Setting this to 0 will disable the limit.
Example of Running Translation
Assuming the following config file:
{
"type": "transcription",
"transcription_config": {
"operating_point": "enhanced",
"language": "en"
},
"translation_config": {
"target_languages": ["es", "de"] # Set languages here to enable translation
}
}
You can run Batch Transcription and Translation with:
cat ~/$AUDIO_FILE | docker run -i \
-v ~/$CONFIG_FILE:/config.json \
-e LICENSE_TOKEN=eyJhbGciOiJ... \
-e SM_TRANSLATION_ENDPOINT=<server>:<port> \
batch-asr-transcriber-en:10.7.0
Or start a Translation enabled Real-Time Container with:
docker run -p 9000:9000 -e LICENSE_TOKEN=eyJhbGciOiJ... \
-e SM_TRANSLATION_ENDPOINT=<server>:<port> \
-e SM_TRANSLATION_MAX_TARGET_LANGUAGES=10 \ # raise the allowed number of target languages
rt-asr-transcriber-en:10.7.0
Monitoring the Server
The Inference Server is based on Nvidia's Triton architecture and as such can be monitored using Triton's inbuilt Prometheus metrics, or the GRPC/HTTP APIs. To expose these, configure an external mapping for port 8002(Prometheus) or 8000(HTTP).
Docker-Compose Example
This docker-compose file will create a Speechmatics GPU translation server: