---
myst:
  html_meta:
    product-name: tt-inference-server, TT-QuietBox™, TT-LoudBox, Wormhole™ Networked AI Processor, Blackhole™ Networked AI Processor, n150s, n150d, n300s, n300d, p100a, p150a, p150b
    technology-concepts: LLM, vLLM, Hugging Face, Llama 3, deployment
    document-type: how-to
---

# Deploying LLMs

This page demonstrates how to deploy LLMs using the [tt-inference-server](https://github.com/tenstorrent/tt-inference-server) project. We currently use [vLLM](https://docs.vllm.ai/en/latest/) to serve LLMs for production applications. It is also a convenient entry-point into Tenstorrent's software ecosystem. You will learn how to prepare your Tenstorrent system, configure access to gated models on Hugging Face, and deploy a vLLM-powered API endpoint using the tt-inference-server project.

---

## **Before You Begin**

Before beginning this procedure, ensure that you have completed the base software installation. This process has specific system and hardware requirements.

:::{admonition} Important
:class: warning
* This guide assumes that you have already followed the [Installing the Tenstorrent Software Stack guide](./README.md).
* Deploying the recommended models requires a minimum of 360 GB of free disk space in your root partition.
:::

### **(Wormhole™ only)**

:::{admonition} If you are using a TT-QuietBox™ or TT-LoudBox™ you must complete the following step
:class: warning

For these systems you must configure a system-level mesh topology between the Wormhole™ Networked AI Processors. Run the following script to install `tt-topology` and configure the mesh.

```bash
TMP_DIR=$(mktemp -d); (trap 'echo "---"; echo "Cleaning up..."; if type deactivate &>/dev/null; then deactivate; fi; echo "Removing temporary directory: $TMP_DIR"; rm -rf "$TMP_DIR"; cd; echo "Cleanup complete."' EXIT; trap 'echo -e "\033[0;31m!!! ERROR: Failed to configure mesh topology\033[0m"' ERR; set -e; cd "$TMP_DIR"; echo "Working in temporary directory: $TMP_DIR"; echo "---"; echo "Creating Python virtual environment..."; python3 -m venv tt-topology-venv; source tt-topology-venv/bin/activate; echo "Virtual environment activated."; echo "---"; echo "Installing tt-topology from git..."; pip install --quiet git+https://github.com/tenstorrent/tt-topology.git; echo "tt-topology installed."; echo "---"; echo "Running tt-topology command. This may take a moment..."; tt-topology -l mesh; echo "---"; echo "Script finished successfully.";)
```
:::

### **Installing Docker**
tt-inference-server requires [Docker](https://www.docker.com) to be installed. To install Docker, please follow the official [installation instructions](https://docs.docker.com/engine/install/ubuntu/).

Verify that the installation was successful by running the `hello-world` image:
```bash
sudo docker run hello-world
```

You must also follow the official [post-installation instructions](https://docs.docker.com/engine/install/linux-postinstall/) to run Docker without root permissions.

Verify that the post-installation was successful by running the `hello-world` image again, this time without root permissions:
```bash
docker run hello-world
```

:::{admonition} Important
:class: danger
Do not continue if you cannot run the `hello-world` image without root permissions, be sure to follow all post-installation instructions.
:::

---

## Step 1: Getting Model Access on Hugging Face
The recommended large language models are gated and require a Hugging Face account.

### **1\. Request Access to the Model**

Visit the model's page on Hugging Face and follow the instructions to request access.

* For TT-QuietBox™ and TT-LoudBox™ systems, we recommend [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
* For add-in-card products (n-series, p-series), we recommend [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)

For a full list of the currently available and tested models, please visit the [tt-inference-server GitHub page](https://github.com/tenstorrent/tt-inference-server).

:::{admonition} Important
:class: warning
Access is granted by the model owner and is not controlled by Tenstorrent. This process may take several days.
:::

### **2\. Create a Hugging Face Access Token**

Once you have access, [generate an access token](https://huggingface.co/docs/hub/en/security-tokens) with a minimum of **read** permissions. This token is required to download the model's weights from Hugging Face.

### **3\. Export the Token**

On the system where you will deploy the server, export your token as an environment variable.

```bash
export HF_TOKEN="<your-hugging-face-access-token>"
```

---

## Step 2: Configuring the vLLM Server

### **1\. Select Your Hardware**

Run the following script to specify your hardware. This script sets the required environment variables and selects the recommended model for your system.

```bash
select_device_and_model(){ echo -e "\nSelect a Tenstorrent system from the list below:"; PS3=$'\n#? '; options=("TT-QuietBox (Wormhole)" "TT-QuietBox (Blackhole)" "TT-LoudBox (Wormhole)" "TT-LoudBox (Blackhole)" "n150s" "n150d" "n300s" "n300d" "p100a" "p150a" "p150b" "Quit"); select opt in "${options[@]}"; do case "$opt" in "TT-QuietBox (Wormhole)") DEVICE="t3k"; MODEL="Llama-3.3-70B-Instruct";; "TT-QuietBox (Blackhole)") DEVICE="p150x4"; MODEL="Llama-3.3-70B-Instruct";; "TT-LoudBox (Wormhole)") DEVICE="t3k"; MODEL="Llama-3.3-70B-Instruct";; "TT-LoudBox (Blackhole)") DEVICE="p150x8"; MODEL="Llama-3.3-70B-Instruct";; "n150s"|"n150d") DEVICE="n150"; MODEL="Llama-3.1-8B-Instruct";; "n300s"|"n300d") DEVICE="n300"; MODEL="Llama-3.1-8B-Instruct";; "p100a") DEVICE="p100"; MODEL="Llama-3.1-8B-Instruct";; "p150a"|"p150b") DEVICE="p150"; MODEL="Llama-3.1-8B-Instruct";; "Quit") echo "❌ Exiting without setting any environment variables."; return;; *) echo "❌ Invalid option. Try again."; continue; esac; export DEVICE MODEL; echo -e "\n✅ DEVICE set to '$DEVICE'"; echo "✅ MODEL set to '$MODEL'"; break; done; }; select_device_and_model
```

### **2\. Check Model Access**
Execute this script to confirm you can access the recommended model's weights:
```bash
check_hf_access() { [ -z "$MODEL" ] && { printf "✖ Error: Please provide a Hugging Face repository ID.\n"; return 1; }; ! command -v curl &>/dev/null && { printf "✖ Error: curl is not installed.\n"; return 1; }; local REPO_ID="meta-llama/$MODEL"; local TOKEN=${HF_TOKEN:-$(cat "$HOME/.cache/huggingface/token" 2>/dev/null)}; [ -z "$TOKEN" ] && printf "ℹ️ Info: No Hugging Face token found.\n   You can only access public repositories.\n"; local AUTH_HEADER=""; [ -n "$TOKEN" ] && AUTH_HEADER="Authorization: Bearer $TOKEN"; printf "Checking access for: %s...\n" "$REPO_ID"; local URL="https://huggingface.co/$REPO_ID/resolve/main/config.json"; local HTTP_CODE=$(curl -s -L -o /dev/null -w "%{http_code}" -H "$AUTH_HEADER" "$URL"); case $HTTP_CODE in 200) printf "✔ Access granted.\n";; 401) printf "✖ Access denied (401 Unauthorized).\n  This is a private or gated repository.\n  Ensure your token is valid and has the correct permissions.\n";; 403) printf "✖ Access forbidden (403 Forbidden).\n  The repository is gated.\n  You need to visit the repository page on Hugging Face and request access.\n";; 404) printf "✖ Repository or 'config.json' not found (404 Not Found).\n  Please check if the repository ID '$REPO_ID' is correct.\n";; *) printf "✖ Failed to check access.\n  Received HTTP status code: %s\n" "$HTTP_CODE";; esac; }; HF_HUB_DISABLE_XET=1; check_hf_access;
```

If the command does not succeed and print `✔ Access granted.`, please make sure you have exported your Hugging Face token as per [the above instructions](#3-export-the-token).

### **3\. Clone the Server Repository**

```bash
# 1. Clone the repository
git clone https://github.com/tenstorrent/tt-inference-server.git
cd tt-inference-server

# 2. Update all tags from the remote (standardizing the local list)
git fetch --tags

# 3. Checkout the highest Semantic Version tag starting with "v"
git checkout $(git tag -l "v*" --sort=-v:refname | head -n 1)
```

---

## Step 3: Launching the vLLM Server

### **1\. Set the JWT Secret**

This string is used to seed the generation of your server's API key.

```bash
export JWT_SECRET="testing"
```

### **2\. Run the Deployment Script**

Execute the following command. The script prompts you for configuration details; in most cases, you may accept the default values.

```bash
python3 run.py --model $MODEL --device $DEVICE --workflow server --docker-server
```

:::{Important}
The first time you run this command, it will download the model's weights. This download can take more than 30 minutes.
:::

### **3\. Wait for the server to initialize**
After the download completes, the server will start initializing in a docker container.

:::{Important}
The first time you start the server, the initialization process for a 70B model should take about 40 minutes. For an 8B model it should take about 10 minutes.
:::

---

## Step 4: Testing the Server Endpoint

### **1\. Check Server Health**

Use the following command to check if the server is ready.

```bash
check_server_health(){ code=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:8000/health); exit_code=$?; if [[ $exit_code -ne 0 ]]; then echo "❌ Error: Unable to connect to server at localhost:8000"; elif [[ $code -eq 200 ]]; then echo "✅ Server is ready (HTTP 200)"; else echo "⚠️ Server responded with status: $code"; fi; }; check_server_health
```

Wait until the output indicates `✅ Server is ready (HTTP 200)`.

### **2\. Generate an API Key**

Your `JWT_SECRET` is used to create an API key for authenticating requests.

```bash
python3 -m venv request-venv
source request-venv/bin/activate
pip3 install --upgrade pip
pip install pyjwt==2.7.0
export VLLM_API_KEY=$(python3 -c 'import os; import json; import jwt; json_payload = json.loads("{\"team_id\": \"tenstorrent\", \"token_id\": \"debug-test\"}"); encoded_jwt = jwt.encode(json_payload, os.environ["JWT_SECRET"], algorithm="HS256"); print(encoded_jwt)')
```

### **3\. Send an Example Request**

The vLLM server exposes an OpenAI-compatible API. The first request will be slow as it performs a final warmup.

```bash
# Warmup request
curl -sS "http://localhost:8000/v1/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VLLM_API_KEY" \
  -d "{
    \"model\": \"meta-llama/$MODEL\",
    \"prompt\": \"San Francisco is a\",
    \"max_tokens\": 50,
    \"temperature\": 0
  }" | jq
```

Run the command again to observe the server respond at full speed.

---

## **Need Additional Support?**
If you encounter any issues, or have a question that isn't covered in the documentation, please [raise a support request.](https://tenstorrent.atlassian.net/servicedesk/customer/portal/1) Our team will review your request and provide assistance.