For the complete documentation index, see llms.txt. This page is also available as Markdown.

Why is my self-hosted Engine instance running on the CPU?

When self-hosting Deepgram services, the Engine container is typically run on an instance with access to a GPU. The GPU speeds the inference process by orders of magnitude faster than what can be processed by a CPU. In order for the Engine to make use of the GPU, there needs to be the necessary drivers, configuration settings, and other considerations. Below are some of the most common reasons why an Engine will not recognize or make use of the GPU.

Reasons why the Engine does not recognize a GPU

Compose File does not specify NVIDIA runtime

If using Docker or Podman, ensure that in your Docker Compose or Podman Compose file, the Engine node uses the NVIDIA runtime. If not explicitly specified, Deepgram will not attempt to use a GPU.

Docker:

  # The speech engine service.
  engine:
    image: quay.io/deepgram/onprem-engine:release-<version>

    # Utilize a GPU, if available.
    runtime: nvidia

Podman:

  # The speech engine service.
  engine:
    image: quay.io/deepgram/onprem-engine:release-<version>

    # Utilize a GPU, if available.
    devices:
      - nvidia.com/gpu=all

If using Kubernetes, make sure the Engine Deployment manifest requests a GPU. This is automatically applied in the deepgram-self-hosted Helm chart as well.

CUDA version incompatabilities

The underlying system needs to have the necessary CUDA version installed. Since the October 2023 self-hosted release (quay.io/deepgram/onprem-engine:release-231026 or quay.io/deepgram/onprem-engine:3.59.2) the minimum CUDA version is 12.1.1.

No NVIDIA Container Runtime available

The NVIDIA Container Runtime is a container runtime which allows your containers to access the GPUs on the machine. It is typically installed with the NVIDIA Container Toolkit.

With Kubernetes, please consider the NVIDIA GPU Operator tooling.

Missing drivers

If the necessary drivers are not installed and the NVIDIA runtime is enabled, typically this will result in an error. Please reference our self-hosted Drivers and Containerization Platforms documentation for information on installing the necessary drivers when using Docker/Podman. For Kubernetes, the deepgram-self-hosted Helm chart will automatically install the proper drivers for you.

Detecting Common NVIDIA Issues with Docker/Podman

Deepgram's public self-hosted resources has a diagnostic script to validate NVIDIA issues when deploying with Docker/Podman. If you're encountering issues, you can download this script to your specific machine and run it to validate your setup.

Google Kubernetes Engine Bug from Oct 2023 - Jan 2024

There is a known library path issue with the following Deepgram self-hosted releases when run on Google Kubernetes Engine:

  • release-231026

  • release-231114

  • release-231207

  • release-240104

This prevents NVIDIA libraries from being detected, and the Engine containers will fall back to CPU mode even when a GPU is available.

The recommended workaround for these images is to upgrade to release-240228 or later.

How can I verify my self-hosted deployment is using the GPU?

On startup, the Engine container will output logs indicating whether it is using the CPU or GPU(s). Those logs will look like:

Last updated