New top story on Hacker News: Launch HN: Hamming (YC S24) – Automated Testing for Voice Agents

Launch HN: Hamming (YC S24) – Automated Testing for Voice Agents
21 by sumanyusharma | 5 comments on Hacker News.
Hi HN! Sumanyu and Marius here from Hamming ( https://www.hamming.ai ). Hamming lets you automatically test your LLM voice agent. In our interactive demo, you play the role of the voice agent, and our agent will play the role of a difficult end user. We'll then score your performance on the call. Try it here: https://ift.tt/BjALpzs (no signup needed). In practice, our agents call your agent! LLM voice agents currently require a lot of iteration and tuning. For example, one of our customers is building an LLM drive-through voice agent for fast food chains. Their KPI is order accuracy. It's crucial for their system to gracefully handle dietary restrictions like allergies and customers who get distracted or otherwise change their minds mid-order. Mistakes in this context could lead to unhappy customers, potential health risks, and financial losses. How do you make sure that such a thing actually works? Most teams spend hours calling their voice agent to find bugs, change the prompt or function definitions, and then call their voice agent again to ensure they fixed the problem and didn't create regressions. This is slow, ad hoc, and feels like a waste of time. In other areas of software development, automated testing has already eliminated this kind of repetitive grunt work — so why not here, too? We were initially working on helping users create evals for prompts & LLM pipelines for a few months but noticed two things: 1) Many of our friends were building LLM voice agents. 2) They were spending too much time on manual testing. This gave us evidence that there will be more voice companies in the future, and they will need something to make the iteration process easier. We decided to build it! Our solution involves four steps: (1) Create diverse but realistic user personas and scenarios covering the expected conversation space. We create these ourselves for each of our customers. Getting LLMs to create diverse scenarios even with high temperatures is surprisingly tricky. We're learning a lot of tricks along the way to create more randomness and more faithful role-play from the folks at https://ift.tt/d9tJUmK . (2) Have our agents call your agent when we test your agent's ability to handle things like background noise, long silences, or interruptions. Or have us test just the LLM / logic layer (function calls, etc.) via an API hook. (3) We score the outputs for each conversation using deterministic checks and LLM judges tailored to the specific problem domain (e.g., order accuracy, tone, friendliness). An LLM judge reviews the entire conversation transcript (including function calls and traces) against predefined success criteria, using examples of both good and bad transcripts as references. It then provides a classification output and detailed reasoning to justify its decisions. Building LLM judges that consistently align with human preferences is challenging, but we're improving with each judge we manually develop. (4) Re-use the checks and judges above to score production traffic and use it to track quality metrics in production. (i.e., online evals) We created a Loom recording showing our customers' logged-in experience. We cover how you store and manage scenarios, how you can trigger an experiment run, and how we score each transcript. See the video here: https://ift.tt/RS3OB61 We're inspired by our experiences at Tesla, where Sumanyu led growth initiatives as a data scientist, and Anduril, where Marius headed a data infrastructure team. At both companies, simulations were key to testing autonomous systems before deployment. A common challenge, however, was that simulations often fell short of capturing real-world complexity, resulting in outcomes that didn't always translate to reality. In voice testing, we're optimistic about overcoming this issue. With tools like PlayHT and ElevenLabs, we can generate highly realistic voice interactions, and by integrating LLMs that exhibit human-like reasoning, we hope our simulations will closely replicate how real users interact with voice agents. For now, we're manually onboarding and activating each user. We're working hard to make it self-serve in the next few weeks. The demo at https://ift.tt/BjALpzs doesn't require any signup, though! Our current pricing is a mix of usage and the number of seats: https://ift.tt/w21pCvA . We don't use customer data for training purposes or to benefit other customers, and we don't sell any data. We use PostHog to track usage. We're in the process of getting HIPAA compliance, with SOC 2 being next on the list. Looking ahead, we're focused on making scenario generation and LLM judge creation more automated and self-serve. We also want to create personas based on real production conversations to make it easier to ‘replay’ a user on demand. A natural next step beyond testing is optimization. We're considering building a voice agent optimizer (like DSPy) that uses scenarios from testing that failed to generate a new set of prompts or function call definitions to make the scenario pass. We find the potential of self-play and self-improvement here super exciting. We'd love to hear about your experiences with voice agents, whether as a user or someone building them. If you're building in the voice or agentic space, we're curious about what is working well for you and what challenges you are encountering. We're eager to learn from your insights about setting up evals and simulation pipelines or your thoughts on where this space is heading.

August 15, 2024 at 10:44PM sumanyusharma 21 https://ift.tt/nByitvf Launch HN: Hamming (YC S24) – Automated Testing for Voice Agents 5 Hi HN! Sumanyu and Marius here from Hamming ( https://www.hamming.ai ). Hamming lets you automatically test your LLM voice agent. In our interactive demo, you play the role of the voice agent, and our agent will play the role of a difficult end user. We'll then score your performance on the call. Try it here: https://ift.tt/BjALpzs (no signup needed). In practice, our agents call your agent! LLM voice agents currently require a lot of iteration and tuning. For example, one of our customers is building an LLM drive-through voice agent for fast food chains. Their KPI is order accuracy. It's crucial for their system to gracefully handle dietary restrictions like allergies and customers who get distracted or otherwise change their minds mid-order. Mistakes in this context could lead to unhappy customers, potential health risks, and financial losses. How do you make sure that such a thing actually works? Most teams spend hours calling their voice agent to find bugs, change the prompt or function definitions, and then call their voice agent again to ensure they fixed the problem and didn't create regressions. This is slow, ad hoc, and feels like a waste of time. In other areas of software development, automated testing has already eliminated this kind of repetitive grunt work — so why not here, too? We were initially working on helping users create evals for prompts & LLM pipelines for a few months but noticed two things: 1) Many of our friends were building LLM voice agents. 2) They were spending too much time on manual testing. This gave us evidence that there will be more voice companies in the future, and they will need something to make the iteration process easier. We decided to build it! Our solution involves four steps: (1) Create diverse but realistic user personas and scenarios covering the expected conversation space. We create these ourselves for each of our customers. Getting LLMs to create diverse scenarios even with high temperatures is surprisingly tricky. We're learning a lot of tricks along the way to create more randomness and more faithful role-play from the folks at https://ift.tt/d9tJUmK . (2) Have our agents call your agent when we test your agent's ability to handle things like background noise, long silences, or interruptions. Or have us test just the LLM / logic layer (function calls, etc.) via an API hook. (3) We score the outputs for each conversation using deterministic checks and LLM judges tailored to the specific problem domain (e.g., order accuracy, tone, friendliness). An LLM judge reviews the entire conversation transcript (including function calls and traces) against predefined success criteria, using examples of both good and bad transcripts as references. It then provides a classification output and detailed reasoning to justify its decisions. Building LLM judges that consistently align with human preferences is challenging, but we're improving with each judge we manually develop. (4) Re-use the checks and judges above to score production traffic and use it to track quality metrics in production. (i.e., online evals) We created a Loom recording showing our customers' logged-in experience. We cover how you store and manage scenarios, how you can trigger an experiment run, and how we score each transcript. See the video here: https://ift.tt/RS3OB61 We're inspired by our experiences at Tesla, where Sumanyu led growth initiatives as a data scientist, and Anduril, where Marius headed a data infrastructure team. At both companies, simulations were key to testing autonomous systems before deployment. A common challenge, however, was that simulations often fell short of capturing real-world complexity, resulting in outcomes that didn't always translate to reality. In voice testing, we're optimistic about overcoming this issue. With tools like PlayHT and ElevenLabs, we can generate highly realistic voice interactions, and by integrating LLMs that exhibit human-like reasoning, we hope our simulations will closely replicate how real users interact with voice agents. For now, we're manually onboarding and activating each user. We're working hard to make it self-serve in the next few weeks. The demo at https://ift.tt/BjALpzs doesn't require any signup, though! Our current pricing is a mix of usage and the number of seats: https://ift.tt/w21pCvA . We don't use customer data for training purposes or to benefit other customers, and we don't sell any data. We use PostHog to track usage. We're in the process of getting HIPAA compliance, with SOC 2 being next on the list. Looking ahead, we're focused on making scenario generation and LLM judge creation more automated and self-serve. We also want to create personas based on real production conversations to make it easier to ‘replay’ a user on demand. A natural next step beyond testing is optimization. We're considering building a voice agent optimizer (like DSPy) that uses scenarios from testing that failed to generate a new set of prompts or function call definitions to make the scenario pass. We find the potential of self-play and self-improvement here super exciting. We'd love to hear about your experiences with voice agents, whether as a user or someone building them. If you're building in the voice or agentic space, we're curious about what is working well for you and what challenges you are encountering. We're eager to learn from your insights about setting up evals and simulation pipelines or your thoughts on where this space is heading.

Nhận xét

Bài đăng phổ biến từ blog này