* Update README.md * fix: 'ollama run' should be 'ollama pull' * fix: linting, syntax, spelling corrections for all docs * fix: markdown linting rules and missed fixes * fix: readded space to block * fix: changed sh blocks to text * docs: added exception for bare urls in markdown * docs: added exception for in-line html (MD033/no-inline-html) * docs: made python indentation level consistent (4 space tabs) even though I prefer 2. --------- Co-authored-by: Charles Packer <packercharles@gmail.com>
1.5 KiB
1.5 KiB
title, excerpt, category
| title | excerpt | category |
|---|---|---|
| vLLM | Setting up MemGPT with vLLM | 6580da9a40bb410016b8b0c3 |
- Download + install vLLM
- Launch a vLLM OpenAI-compatible API server using the official vLLM documentation
For example, if we want to use the model dolphin-2.2.1-mistral-7b from HuggingFace, we would run:
python -m vllm.entrypoints.openai.api_server \
--model ehartford/dolphin-2.2.1-mistral-7b
vLLM will automatically download the model (if it's not already downloaded) and store it in your HuggingFace cache directory.
In your terminal where you're running MemGPT, run memgpt configure to set the default backend for MemGPT to point at vLLM:
# if you are running vLLM locally, the default IP address + port will be http://localhost:8000
? Select LLM inference provider: local
? Select LLM backend (select 'openai' if you have an OpenAI compatible proxy): vllm
? Enter default endpoint: http://localhost:8000
? Enter HuggingFace model tag (e.g. ehartford/dolphin-2.2.1-mistral-7b): ehartford/dolphin-2.2.1-mistral-7b
...
If you have an existing agent that you want to move to the vLLM backend, add extra flags to memgpt run:
memgpt run --agent your_agent --model-endpoint-type vllm --model-endpoint http://localhost:8000 --model ehartford/dolphin-2.2.1-mistral-7b