New top story on Hacker News: Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs

Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs
37 by krawfy | 5 comments on Hacker News.
Hey HN! We’re Kevin and Steve. We’re building PromptTools ( https://ift.tt/WSbC0fZ ): open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. Evaluating prompts, LLMs, and vector databases is a painful, time-consuming but necessary part of the product engineering process. Our tools allow engineers to do this in a lot less time. By “evaluating” we mean checking the quality of a model's response for a given use case, which is a combination of testing and benchmarking. As examples: - For generated JSON, SQL, or Python, you can check that the output is actually JSON, SQL, or executable Python. - For generated emails, you can use another model to assess the quality of the generated email given some requirements, like whether or not the email is written professionally. - For a question-answering chatbot, you can check that the actual answer is semantically similar to an expected answer. At Google, Steve worked with HuggingFace and Lightning to support running the newest open-source models on TPUs. He realized that while the open-source community was contributing incredibly powerful models, it wasn’t so easy to discover and evaluate them. It wasn’t clear when you could use Llama or Falcon instead of GPT-4. We began looking for ways to simplify and scale this evaluation process. With PromptTools, you can write a short Python script (as short as 5 lines) to run such checks across models, parameters, and prompts, and pass the results into an evaluation function to get scores. All these can be executed on your local machine without sending data to third-parties. Then we help you turn those experiments into unit tests and CI/CD that track your model’s performance over time. Today we support all of the major model providers like OpenAI, Anthropic, Google, HuggingFace, and even LlamaCpp, and vector databases like ChromaDB and Weaviate. You can evaluate responses via semantic similarity, auto-evaluation by a language model, or structured output validations like JSON and Python. We even have a notebook UI for recording manual feedback. Quickstart: pip install prompttools git clone https://ift.tt/gTtJGuh cd prompttools && jupyter notebook examples/notebooks/OpenAIChatExperiment.ipynb For detailed instructions, see our documentation at https://ift.tt/SromYQ2 . We also have a playground UI, built in streamlit, which is currently in beta: https://ift.tt/WGkEOJR... . Launch it with: pip install prompttools git clone https://ift.tt/gTtJGuh cd prompttools && streamlit run prompttools/ui/playground.py We’d love it if you tried our product out and let us know what you think! We just got started a month ago and we’re eager to get feedback and keep building.

August 1, 2023 at 11:23PM krawfy 37 https://ift.tt/aZTOm1p Show HN: PromptTools – open-source tools for evaluating LLMs and vector DBs 5 Hey HN! We’re Kevin and Steve. We’re building PromptTools ( https://ift.tt/WSbC0fZ ): open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. Evaluating prompts, LLMs, and vector databases is a painful, time-consuming but necessary part of the product engineering process. Our tools allow engineers to do this in a lot less time. By “evaluating” we mean checking the quality of a model's response for a given use case, which is a combination of testing and benchmarking. As examples: - For generated JSON, SQL, or Python, you can check that the output is actually JSON, SQL, or executable Python. - For generated emails, you can use another model to assess the quality of the generated email given some requirements, like whether or not the email is written professionally. - For a question-answering chatbot, you can check that the actual answer is semantically similar to an expected answer. At Google, Steve worked with HuggingFace and Lightning to support running the newest open-source models on TPUs. He realized that while the open-source community was contributing incredibly powerful models, it wasn’t so easy to discover and evaluate them. It wasn’t clear when you could use Llama or Falcon instead of GPT-4. We began looking for ways to simplify and scale this evaluation process. With PromptTools, you can write a short Python script (as short as 5 lines) to run such checks across models, parameters, and prompts, and pass the results into an evaluation function to get scores. All these can be executed on your local machine without sending data to third-parties. Then we help you turn those experiments into unit tests and CI/CD that track your model’s performance over time. Today we support all of the major model providers like OpenAI, Anthropic, Google, HuggingFace, and even LlamaCpp, and vector databases like ChromaDB and Weaviate. You can evaluate responses via semantic similarity, auto-evaluation by a language model, or structured output validations like JSON and Python. We even have a notebook UI for recording manual feedback. Quickstart: pip install prompttools git clone https://ift.tt/gTtJGuh cd prompttools && jupyter notebook examples/notebooks/OpenAIChatExperiment.ipynb For detailed instructions, see our documentation at https://ift.tt/SromYQ2 . We also have a playground UI, built in streamlit, which is currently in beta: https://ift.tt/WGkEOJR... . Launch it with: pip install prompttools git clone https://ift.tt/gTtJGuh cd prompttools && streamlit run prompttools/ui/playground.py We’d love it if you tried our product out and let us know what you think! We just got started a month ago and we’re eager to get feedback and keep building. https://ift.tt/WSbC0fZ

Nhận xét

Bài đăng phổ biến từ blog này