diff --git a/README.md b/README.md index da9d1ece..2f070c21 100644 --- a/README.md +++ b/README.md @@ -13,9 +13,9 @@ -MemGPT makes it easy to build and deploy stateful LLM agents with support for: -* Long term memory/state management -* Connections to [external data sources](https://memgpt.readme.io/docs/data_sources) (e.g. PDF files) for RAG +MemGPT makes it easy to build and deploy stateful LLM agents with support for: +* Long term memory/state management +* Connections to [external data sources](https://memgpt.readme.io/docs/data_sources) (e.g. PDF files) for RAG * Defining and calling [custom tools](https://memgpt.readme.io/docs/functions) (e.g. [google search](https://github.com/cpacker/MemGPT/blob/main/examples/google_search.py)) You can also use MemGPT to depoy agents as a *service*. You can use a MemGPT server to run a multi-user, multi-agent application on top of supported LLM providers. @@ -28,17 +28,18 @@ Install MemGPT: ```sh pip install -U pymemgpt ``` + To use MemGPT with OpenAI, set the environment variable `OPENAI_API_KEY` to your OpenAI key then run: ``` memgpt quickstart --backend openai ``` -To use MemGPT with a free hosted endpoint, you run run: +To use MemGPT with a free hosted endpoint, you run run: ``` memgpt quickstart --backend memgpt ``` -For more advanced configuration options or to use a different [LLM backend](https://memgpt.readme.io/docs/endpoints) or [local LLMs](https://memgpt.readme.io/docs/local_llm), run `memgpt configure`. +For more advanced configuration options or to use a different [LLM backend](https://memgpt.readme.io/docs/endpoints) or [local LLMs](https://memgpt.readme.io/docs/local_llm), run `memgpt configure`. -## Quickstart (CLI) +## Quickstart (CLI) You can create and chat with a MemGPT agent by running `memgpt run` in your CLI. The `run` command supports the following optional flags (see the [CLI documentation](https://memgpt.readme.io/docs/quickstart) for the full list of flags): * `--agent`: (str) Name of agent to create or to resume chatting with. * `--first`: (str) Allow user to sent the first message. @@ -63,13 +64,12 @@ MemGPT provides a developer portal that enables you to easily create, edit, moni **Option 2:** Run with the CLI: 1. Run `memgpt server` -2. Go to `localhost:8283` in the browser to view the developer portal +2. Go to `localhost:8283` in the browser to view the developer portal Once the server is running, you can use the [Python client](https://memgpt.readme.io/docs/admin-client) or [REST API](https://memgpt.readme.io/reference/api) to connect to `memgpt.localhost` (if you're running with docker compose) or `localhost:8283` (if you're running with the CLI) to create users, agents, and more. The service requires authentication with a MemGPT admin password, which can be set with running `export MEMGPT_SERVER_PASS=password`. - -## Supported Endpoints & Backends -MemGPT is designed to be model and provider agnostic. The following LLM and embedding endpoints are supported: +## Supported Endpoints & Backends +MemGPT is designed to be model and provider agnostic. The following LLM and embedding endpoints are supported: | Provider | LLM Endpoint | Embedding Endpoint | |---------------------|-----------------|--------------------| diff --git a/docs/local_llm.md b/docs/local_llm.md index a25f8750..55a37439 100644 --- a/docs/local_llm.md +++ b/docs/local_llm.md @@ -38,7 +38,7 @@ poetry install -E local ### Quick overview -1. Put your own LLM behind a web server API (e.g. [oobabooga web UI](https://github.com/oobabooga/text-generation-webui#starting-the-web-ui)) +1. Put your own LLM behind a web server API (e.g. [llama.cpp server](https://github.com/ggerganov/llama.cpp/tree/master/examples/server#quick-start) or [oobabooga web UI](https://github.com/oobabooga/text-generation-webui#starting-the-web-ui)) 2. Run `memgpt configure` and when prompted select your backend/endpoint type and endpoint address (a default will be provided but you may have to override it) For example, if we are running web UI (which defaults to port 5000) on the same computer as MemGPT, running `memgpt configure` would look like this: @@ -47,8 +47,8 @@ For example, if we are running web UI (which defaults to port 5000) on the same ? Select LLM inference provider: local ? Select LLM backend (select 'openai' if you have an OpenAI compatible proxy): webui ? Enter default endpoint: http://localhost:5000 -? Select default model wrapper (recommended: airoboros-l2-70b-2.1): airoboros-l2-70b-2.1 -? Select your model's context window (for Mistral 7B models, this is probably 8k / 8192): 8192 +? Select default model wrapper (optimal choice depends on specific llm, for llama3 we recommend llama3-grammar, for legacy llms it is airoboros-l2-70b-2.1): llama3-grammar +? Select your model's context window (for Mistral 7B models and Meta-Llama-3-8B-Instruct, this is probably 8k / 8192): 8192 ? Select embedding provider: local ? Select default preset: memgpt_chat ? Select default persona: sam_pov @@ -77,7 +77,7 @@ When you use local LLMs, you can specify a **model wrapper** that changes how th You can change the wrapper used with the `--model-wrapper` flag: ```sh -memgpt run --model-wrapper airoboros-l2-70b-2.1 +memgpt run --model-wrapper llama3-grammar ``` You can see the full selection of model wrappers by running `memgpt configure`: @@ -86,8 +86,11 @@ You can see the full selection of model wrappers by running `memgpt configure`: ? Select LLM inference provider: local ? Select LLM backend (select 'openai' if you have an OpenAI compatible proxy): webui ? Enter default endpoint: http://localhost:5000 -? Select default model wrapper (recommended: airoboros-l2-70b-2.1): (Use arrow keys) - » airoboros-l2-70b-2.1 +? Select default model wrapper (recommended: llama3-grammar for llama3 llms, airoboros-l2-70b-2.1 for legacy models): (Use arrow keys) + » llama3 + llama3-grammar + llama3-hints-grammar + airoboros-l2-70b-2.1 airoboros-l2-70b-2.1-grammar dolphin-2.1-mistral-7b dolphin-2.1-mistral-7b-grammar @@ -123,6 +126,12 @@ If you would like us to support a new backend, feel free to open an issue or pul > > To see a list of recommended LLMs to use with MemGPT, visit our [Discord server](https://discord.gg/9GEQrxmVyE) and check the #model-chat channel. +Most recently, one of the best models to run locally is Meta's [Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or its quantized version such as [Meta-Llama-3-8B-Instruct-Q6_K.gguf](https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF). + If you are experimenting with MemGPT and local LLMs for the first time, we recommend you try the Dolphin Mistral finetune (e.g. [ehartford/dolphin-2.2.1-mistral-7b](https://huggingface.co/ehartford/dolphin-2.2.1-mistral-7b) or a quantized variant such as [dolphin-2.2.1-mistral-7b.Q6_K.gguf](https://huggingface.co/TheBloke/dolphin-2.2.1-mistral-7B-GGUF)), and use the default `airoboros` wrapper. Generating MemGPT-compatible outputs is a harder task for an LLM than regular text output. For this reason **we strongly advise users to NOT use models below Q5 quantization** - as the model gets worse, the number of errors you will encounter while using MemGPT will dramatically increase (MemGPT will not send messages properly, edit memory properly, etc.). + +> 📘 Advanced LLMs / models +> +Enthusiasts with high-VRAM GPUS (3090,4090) or apple silicon macs with >32G VRAM might find [IQ2_XS quantization of Llama-3-70B](https://huggingface.co/MaziyarPanahi/Meta-Llama-3-70B-Instruct-GGUF) interesting, as it is currently the highest-performing opensource/openweights model. You can run it in llama.cpp with setup such as this: `./server -m Meta-Llama-3-70B-Instruct.IQ2_XS.gguf --n-gpu-layers 99 --no-mmap --ctx-size 8192 -ctk q8_0 --chat-template llama3 --host 0.0.0.0 --port 8888`. diff --git a/memgpt/local_llm/llm_chat_completion_wrappers/llama3.py b/memgpt/local_llm/llm_chat_completion_wrappers/llama3.py new file mode 100644 index 00000000..2141bdaa --- /dev/null +++ b/memgpt/local_llm/llm_chat_completion_wrappers/llama3.py @@ -0,0 +1,349 @@ +import json + +from memgpt.constants import JSON_ENSURE_ASCII, JSON_LOADS_STRICT +from memgpt.errors import LLMJSONParsingError +from memgpt.local_llm.json_parser import clean_json +from memgpt.local_llm.llm_chat_completion_wrappers.wrapper_base import ( + LLMChatCompletionWrapper, +) + +PREFIX_HINT = """# Reminders: +# Important information about yourself and the user is stored in (limited) core memory +# You can modify core memory with core_memory_replace +# You can add to core memory with core_memory_append +# Less important information is stored in (unlimited) archival memory +# You can add to archival memory with archival_memory_insert +# You can search archival memory with archival_memory_search +# You will always see the statistics of archival memory, so you know if there is content inside it +# If you receive new important information about the user (or yourself), you immediately update your memory with core_memory_replace, core_memory_append, or archival_memory_insert""" + +FIRST_PREFIX_HINT = """# Reminders: +# This is your first interaction with the user! +# Initial information about them is provided in the core memory user block +# Make sure to introduce yourself to them +# Your inner thoughts should be private, interesting, and creative +# Do NOT use inner thoughts to communicate with the user +# Use send_message to communicate with the user""" +# Don't forget to use send_message, otherwise the user won't see your message""" + + +class LLaMA3InnerMonologueWrapper(LLMChatCompletionWrapper): + """ChatML-style prompt formatter, tested for use with https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct""" + + supports_first_message = True + + def __init__( + self, + json_indent=2, + # simplify_json_content=True, + simplify_json_content=False, + clean_function_args=True, + include_assistant_prefix=True, + assistant_prefix_extra='\n{\n "function":', + assistant_prefix_extra_first_message='\n{\n "function": "send_message",', + allow_custom_roles=True, # allow roles outside user/assistant + use_system_role_in_user=False, # use the system role on user messages that don't use "type: user_message" + # allow_function_role=True, # use function role for function replies? + allow_function_role=False, # use function role for function replies? + no_function_role_role="assistant", # if no function role, which role to use? + no_function_role_prefix="FUNCTION RETURN:\n", # if no function role, what prefix to use? + # add a guiding hint + assistant_prefix_hint=False, + ): + self.simplify_json_content = simplify_json_content + self.clean_func_args = clean_function_args + self.include_assistant_prefix = include_assistant_prefix + self.assistant_prefix_extra = assistant_prefix_extra + self.assistant_prefix_extra_first_message = assistant_prefix_extra_first_message + self.assistant_prefix_hint = assistant_prefix_hint + + # role-based + self.allow_custom_roles = allow_custom_roles + self.use_system_role_in_user = use_system_role_in_user + self.allow_function_role = allow_function_role + # extras for when the function role is disallowed + self.no_function_role_role = no_function_role_role + self.no_function_role_prefix = no_function_role_prefix + + # how to set json in prompt + self.json_indent = json_indent + + def _compile_function_description(self, schema, add_inner_thoughts=True) -> str: + """Go from a JSON schema to a string description for a prompt""" + # airorobos style + func_str = "" + func_str += f"{schema['name']}:" + func_str += f"\n description: {schema['description']}" + func_str += "\n params:" + if add_inner_thoughts: + from memgpt.local_llm.constants import ( + INNER_THOUGHTS_KWARG, + INNER_THOUGHTS_KWARG_DESCRIPTION, + ) + + func_str += f"\n {INNER_THOUGHTS_KWARG}: {INNER_THOUGHTS_KWARG_DESCRIPTION}" + for param_k, param_v in schema["parameters"]["properties"].items(): + # TODO we're ignoring type + func_str += f"\n {param_k}: {param_v['description']}" + # TODO we're ignoring schema['parameters']['required'] + return func_str + + def _compile_function_block(self, functions) -> str: + """functions dict -> string describing functions choices""" + prompt = "" + + # prompt += f"\nPlease select the most suitable function and parameters from the list of available functions below, based on the user's input. Provide your response in JSON format." + prompt += "Please select the most suitable function and parameters from the list of available functions below, based on the ongoing conversation. Provide your response in JSON format." + prompt += "\nAvailable functions:" + for function_dict in functions: + prompt += f"\n{self._compile_function_description(function_dict)}" + + return prompt + + # NOTE: BOS/EOS chatml tokens are NOT inserted here + def _compile_system_message(self, system_message, functions, function_documentation=None) -> str: + """system prompt + memory + functions -> string""" + prompt = "" + prompt += system_message + prompt += "\n" + if function_documentation is not None: + prompt += "Please select the most suitable function and parameters from the list of available functions below, based on the ongoing conversation. Provide your response in JSON format." + prompt += "\nAvailable functions:\n" + prompt += function_documentation + else: + prompt += self._compile_function_block(functions) + return prompt + + def _compile_function_call(self, function_call, inner_thoughts=None): + """Go from ChatCompletion to Airoboros style function trace (in prompt) + + ChatCompletion data (inside message['function_call']): + "function_call": { + "name": ... + "arguments": { + "arg1": val1, + ... + } + + Airoboros output: + { + "function": "send_message", + "params": { + "message": "Hello there! I am Sam, an AI developed by Liminal Corp. How can I assist you today?" + } + } + """ + airo_func_call = { + "function": function_call["name"], + "params": { + "inner_thoughts": inner_thoughts, + **json.loads(function_call["arguments"], strict=JSON_LOADS_STRICT), + }, + } + return json.dumps(airo_func_call, indent=self.json_indent, ensure_ascii=JSON_ENSURE_ASCII) + + # NOTE: BOS/EOS chatml tokens are NOT inserted here + def _compile_assistant_message(self, message) -> str: + """assistant message -> string""" + prompt = "" + + # need to add the function call if there was one + inner_thoughts = message["content"] + if "function_call" in message and message["function_call"]: + prompt += f"\n{self._compile_function_call(message['function_call'], inner_thoughts=inner_thoughts)}" + elif "tool_calls" in message and message["tool_calls"]: + for tool_call in message["tool_calls"]: + prompt += f"\n{self._compile_function_call(tool_call['function'], inner_thoughts=inner_thoughts)}" + else: + # TODO should we format this into JSON somehow? + prompt += inner_thoughts + + return prompt + + # NOTE: BOS/EOS chatml tokens are NOT inserted here + def _compile_user_message(self, message) -> str: + """user message (should be JSON) -> string""" + prompt = "" + if self.simplify_json_content: + # Make user messages not JSON but plaintext instead + try: + user_msg_json = json.loads(message["content"], strict=JSON_LOADS_STRICT) + user_msg_str = user_msg_json["message"] + except: + user_msg_str = message["content"] + else: + # Otherwise just dump the full json + try: + user_msg_json = json.loads(message["content"], strict=JSON_LOADS_STRICT) + user_msg_str = json.dumps( + user_msg_json, + indent=self.json_indent, + ensure_ascii=JSON_ENSURE_ASCII, + ) + except: + user_msg_str = message["content"] + + prompt += user_msg_str + return prompt + + # NOTE: BOS/EOS chatml tokens are NOT inserted here + def _compile_function_response(self, message) -> str: + """function response message (should be JSON) -> string""" + # TODO we should clean up send_message returns to avoid cluttering the prompt + prompt = "" + try: + # indent the function replies + function_return_dict = json.loads(message["content"], strict=JSON_LOADS_STRICT) + function_return_str = json.dumps( + function_return_dict, + indent=self.json_indent, + ensure_ascii=JSON_ENSURE_ASCII, + ) + except: + function_return_str = message["content"] + + prompt += function_return_str + return prompt + + def chat_completion_to_prompt(self, messages, functions, first_message=False, function_documentation=None): + """chatml-style prompt formatting, with implied support for multi-role""" + prompt = "<|begin_of_text|>" + + # System insturctions go first + assert messages[0]["role"] == "system" + system_block = self._compile_system_message( + system_message=messages[0]["content"], + functions=functions, + function_documentation=function_documentation, + ) + prompt += f"<|start_header_id|>system<|end_header_id|>\n\n{system_block.strip()}<|eot_id|>" + + # Last are the user/assistant messages + for message in messages[1:]: + assert message["role"] in ["user", "assistant", "function", "tool"], message + + if message["role"] == "user": + # Support for AutoGen naming of agents + role_str = message["name"].strip().lower() if (self.allow_custom_roles and "name" in message) else message["role"] + msg_str = self._compile_user_message(message) + + if self.use_system_role_in_user: + try: + msg_json = json.loads(message["content"], strict=JSON_LOADS_STRICT) + if msg_json["type"] != "user_message": + role_str = "system" + except: + pass + prompt += f"\n<|start_header_id|>{role_str}<|end_header_id|>\n\n{msg_str.strip()}<|eot_id|>" + + elif message["role"] == "assistant": + # Support for AutoGen naming of agents + role_str = message["name"].strip().lower() if (self.allow_custom_roles and "name" in message) else message["role"] + msg_str = self._compile_assistant_message(message) + + prompt += f"\n<|start_header_id|>{role_str}<|end_header_id|>\n\n{msg_str.strip()}<|eot_id|>" + + elif message["role"] in ["tool", "function"]: + if self.allow_function_role: + role_str = message["role"] + msg_str = self._compile_function_response(message) + prompt += f"\n<|start_header_id|>{role_str}<|end_header_id|>\n\n{msg_str.strip()}<|eot_id|>" + else: + # TODO figure out what to do with functions if we disallow function role + role_str = self.no_function_role_role + msg_str = self._compile_function_response(message) + func_resp_prefix = self.no_function_role_prefix + # NOTE whatever the special prefix is, it should also be a stop token + prompt += f"\n<|start_header_id|>{role_str}\n{func_resp_prefix}{msg_str.strip()}<|eot_id|>" + + else: + raise ValueError(message) + + if self.include_assistant_prefix: + prompt += "\n<|start_header_id|>assistant\n\n" + if self.assistant_prefix_hint: + prompt += f"\n{FIRST_PREFIX_HINT if first_message else PREFIX_HINT}" + if self.supports_first_message and first_message: + if self.assistant_prefix_extra_first_message: + prompt += self.assistant_prefix_extra_first_message + else: + if self.assistant_prefix_extra: + # assistant_prefix_extra='\n{\n "function":', + prompt += self.assistant_prefix_extra + + return prompt + + def _clean_function_args(self, function_name, function_args): + """Some basic MemGPT-specific cleaning of function args""" + cleaned_function_name = function_name + cleaned_function_args = function_args.copy() if function_args is not None else {} + + if function_name == "send_message": + # strip request_heartbeat + cleaned_function_args.pop("request_heartbeat", None) + + inner_thoughts = None + if "inner_thoughts" in function_args: + inner_thoughts = cleaned_function_args.pop("inner_thoughts") + + # TODO more cleaning to fix errors LLM makes + return inner_thoughts, cleaned_function_name, cleaned_function_args + + def output_to_chat_completion_response(self, raw_llm_output, first_message=False): + """Turn raw LLM output into a ChatCompletion style response with: + "message" = { + "role": "assistant", + "content": ..., + "function_call": { + "name": ... + "arguments": { + "arg1": val1, + ... + } + } + } + """ + # if self.include_opening_brance_in_prefix and raw_llm_output[0] != "{": + # raw_llm_output = "{" + raw_llm_output + assistant_prefix = self.assistant_prefix_extra_first_message if first_message else self.assistant_prefix_extra + if assistant_prefix and raw_llm_output[: len(assistant_prefix)] != assistant_prefix: + # print(f"adding prefix back to llm, raw_llm_output=\n{raw_llm_output}") + raw_llm_output = assistant_prefix + raw_llm_output + # print(f"->\n{raw_llm_output}") + + try: + # cover llama.cpp server for now #TODO remove this when fixed + raw_llm_output = raw_llm_output.rstrip() + if raw_llm_output.endswith("<|eot_id|>"): + raw_llm_output = raw_llm_output[: -len("<|eot_id|>")] + function_json_output = clean_json(raw_llm_output) + except Exception as e: + raise Exception(f"Failed to decode JSON from LLM output:\n{raw_llm_output} - error\n{str(e)}") + try: + # NOTE: weird bug can happen where 'function' gets nested if the prefix in the prompt isn't abided by + if isinstance(function_json_output["function"], dict): + function_json_output = function_json_output["function"] + # regular unpacking + function_name = function_json_output["function"] + function_parameters = function_json_output["params"] + except KeyError as e: + raise LLMJSONParsingError( + f"Received valid JSON from LLM, but JSON was missing fields: {str(e)}. JSON result was:\n{function_json_output}" + ) + + if self.clean_func_args: + ( + inner_thoughts, + function_name, + function_parameters, + ) = self._clean_function_args(function_name, function_parameters) + + message = { + "role": "assistant", + "content": inner_thoughts, + "function_call": { + "name": function_name, + "arguments": json.dumps(function_parameters, ensure_ascii=JSON_ENSURE_ASCII), + }, + } + return message diff --git a/memgpt/local_llm/utils.py b/memgpt/local_llm/utils.py index c41e66b4..496b074e 100644 --- a/memgpt/local_llm/utils.py +++ b/memgpt/local_llm/utils.py @@ -8,6 +8,7 @@ import memgpt.local_llm.llm_chat_completion_wrappers.airoboros as airoboros import memgpt.local_llm.llm_chat_completion_wrappers.chatml as chatml import memgpt.local_llm.llm_chat_completion_wrappers.configurable_wrapper as configurable_wrapper import memgpt.local_llm.llm_chat_completion_wrappers.dolphin as dolphin +import memgpt.local_llm.llm_chat_completion_wrappers.llama3 as llama3 import memgpt.local_llm.llm_chat_completion_wrappers.zephyr as zephyr @@ -71,6 +72,7 @@ def load_grammar_file(grammar): return grammar_str +# TODO: support tokenizers/tokenizer apis available in local models def count_tokens(s: str, model: str = "gpt-4") -> int: encoding = tiktoken.encoding_for_model(model) return len(encoding.encode(s)) @@ -220,6 +222,9 @@ def num_tokens_from_messages(messages: List[dict], model: str = "gpt-4") -> int: def get_available_wrappers() -> dict: return { + "llama3": llama3.LLaMA3InnerMonologueWrapper(), + "llama3-grammar": llama3.LLaMA3InnerMonologueWrapper(), + "llama3-hints-grammar": llama3.LLaMA3InnerMonologueWrapper(assistant_prefix_hint=True), "experimental-wrapper-neural-chat-grammar-noforce": configurable_wrapper.ConfigurableJSONWrapper( post_prompt="### Assistant:", sys_prompt_start="### System:\n",