62 lines
2.2 KiB
Plaintext
62 lines
2.2 KiB
Plaintext
---
|
|
title: vLLM
|
|
slug: guides/server/providers/vllm
|
|
---
|
|
|
|
|
|
<Tip>To use Letta with vLLM, set the environment variable `VLLM_API_BASE` to point to your vLLM ChatCompletions server.</Tip>
|
|
|
|
## Setting up vLLM
|
|
1. Download + install [vLLM](https://docs.vllm.ai/en/latest/getting_started/installation.html)
|
|
2. Launch a vLLM **OpenAI-compatible** API server using [the official vLLM documentation](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
|
|
|
|
For example, if we want to use the model `dolphin-2.2.1-mistral-7b` from [HuggingFace](https://huggingface.co/ehartford/dolphin-2.2.1-mistral-7b), we would run:
|
|
|
|
```sh
|
|
python -m vllm.entrypoints.openai.api_server \
|
|
--model ehartford/dolphin-2.2.1-mistral-7b
|
|
```
|
|
|
|
vLLM will automatically download the model (if it's not already downloaded) and store it in your [HuggingFace cache directory](https://huggingface.co/docs/datasets/cache).
|
|
|
|
## Enabling vLLM as a provider
|
|
To enable the vLLM provider, you must set the `VLLM_API_BASE` environment variable. When this is set, Letta will use available LLM and embedding models running on vLLM.
|
|
|
|
### Using the `docker run` server with vLLM
|
|
|
|
**macOS/Windows:**
|
|
Since vLLM is running on the host network, you will need to use `host.docker.internal` to connect to the vLLM server instead of `localhost`.
|
|
```bash
|
|
# replace `~/.letta/.persist/pgdata` with wherever you want to store your agent data
|
|
docker run \
|
|
-v ~/.letta/.persist/pgdata:/var/lib/postgresql/data \
|
|
-p 8283:8283 \
|
|
-e VLLM_API_BASE="http://host.docker.internal:8000" \
|
|
letta/letta:latest
|
|
```
|
|
|
|
**Linux:**
|
|
Use `--network host` and `localhost`:
|
|
```bash
|
|
docker run \
|
|
-v ~/.letta/.persist/pgdata:/var/lib/postgresql/data \
|
|
--network host \
|
|
-e VLLM_API_BASE="http://localhost:8000" \
|
|
letta/letta:latest
|
|
```
|
|
|
|
<Accordion icon="square-terminal" title="CLI (pypi only)">
|
|
### Using `letta run` and `letta server` with vLLM
|
|
To chat with an agent, run:
|
|
```bash
|
|
export VLLM_API_BASE="http://localhost:8000"
|
|
letta run
|
|
```
|
|
To run the Letta server, run:
|
|
```bash
|
|
export VLLM_API_BASE="http://localhost:8000"
|
|
letta server
|
|
```
|
|
To select the model used by the server, use the dropdown in the ADE or specify a `LLMConfig` object in the Python SDK.
|
|
</Accordion>
|