Why We Built MedspAI's Voice Pipeline From Scratch

Most AI voice products use the same handful of off-the-shelf components in roughly the same order. There's a speech-to-text service that turns the caller's audio into words. A large language model that decides what to say back. A text-to-speech service that turns the response into audio. Some glue code in between to handle the call routing and timing.

You can build something that works in about a weekend, if you don't care how it sounds.

We cared.

When Aidan and Harrison started asking me to help build the voice pipeline, the first thing I did was call a bunch of AI receptionist products and listen to them. Not as a developer evaluating tech, but as a fake patient trying to book a fake appointment. The good ones sounded okay. The bad ones sounded so robotic it made me wince. Almost all of them had some kind of latency problem, where you'd ask a question and there'd be a long enough pause that your brain registered something was off, even if you couldn't put a finger on what.

This post is about why we made the technical choices we did, and what we'd tell anyone evaluating an AI voice product to ask about.

The latency problem nobody talks about

A normal phone conversation has natural rhythm. You say something. The other person responds within about 200 milliseconds, maybe 400 if they're thinking. Anything longer than that starts to feel weird. By 800 milliseconds, the conversation is broken. By 1500 milliseconds, the caller has either started talking again or assumed you're not there.

Most AI voice products run somewhere over a second end-to-end. The reason is the standard pipeline: audio in, transcribe via API, hit the LLM via API, generate response, send to TTS via API, get audio back, play it. Each one of those API calls is a network hop. They stack up.

We knew before writing a line of code that we'd lose on quality if we couldn't get latency under 600 milliseconds. Patients can tolerate a slight delay. They cannot tolerate the rhythm of talking to something that sounds like it's loading.

The first big architectural decision we made was running embeddings locally. Most products use OpenAI's embedding API to convert spoken text into vectors for retrieval against a knowledge base. That's a network call that costs you 400 to 2000 milliseconds depending on the day. We use a smaller open-source embedding model that runs in the same process as the rest of the pipeline. It's faster (2 to 15 milliseconds vs hundreds) and it's free.

There's a tradeoff. The local embeddings are slightly less accurate than OpenAI's. We had to tune our retrieval thresholds lower to account for it. But the latency win was worth more than the accuracy loss in our testing, and the cost savings are real once you're handling thousands of calls per day per client.

Bypassing RAG when you can

The other thing we noticed early is that most patient calls aren't actually retrieval problems. They're lookup problems.

When a patient asks "do you have any specials right now," that's a structured data question with a definite answer. It doesn't need a vector search across the knowledge base. It needs to look at a table of active specials and read it back.

Same with hours. "What time are you open Saturday?" doesn't need to retrieve anything. Look at the hours table for the relevant location. Done.

Same with pricing on a specific named treatment.

The mistake a lot of vendors make is sending every question through the same RAG pipeline, which adds latency for no benefit and sometimes returns weird results because the embeddings don't capture structured data well. We built specific bypass paths for specials, pricing, and hours. Those questions skip the vector search entirely. The AI gets a structured answer in milliseconds and reads it back accurately.

For everything else (general questions, treatment information, post-care protocols, business policies), we still use RAG. But the bypass paths handle a meaningful percentage of inbound calls without ever touching it.

Why we use multiple voice providers

Most products use one TTS service. We tested a lot of them. Inworld came out on top for our use case for two reasons: the voices sound natural, and the latency is low because it's optimized for real-time applications. Some of the more popular TTS services in this space are designed for batch generation (audiobooks, advertising) and aren't tuned for the kind of interruption-heavy back-and-forth that phone calls require.

We picked five voices that consistently scored high in our testing for clarity, warmth, and natural intonation, and we made them selectable per tenant. Different med spas have different brand vibes. A solo skincare studio in a suburb might want a softer, friendlier voice. A high-end aesthetics practice in a city might want something more polished. Letting clients pick lets them get a voice that fits the experience they're trying to create.

When the AI shouldn't try

This is the part most vendors don't get right.

We talk about it a lot in posts on this blog: AI needs to know when to step back. The technical implementation of that is more interesting than people realize.

We have what we call a confidence threshold. When the AI generates a response, it has a measure of how well the patient's question matches what's in the knowledge base. If that match is below a certain score, the AI doesn't try to bluff. It tells the patient it's going to have someone follow up, takes a callback number, and the conversation gets logged for staff to handle.

We also have a patient-specific check. This is a separate piece of logic that runs on every utterance to detect personal medical questions, complaints, or anything emotionally charged. If it triggers, the AI escalates immediately, regardless of the confidence score on the answer.

The reason we built two separate systems for this is that confidence thresholds alone aren't enough. A patient could ask "is it normal that my filler is bruising?" and the AI's confidence on retrieving general filler aftercare info from the knowledge base might be high. But the question is personal medical advice, and the AI shouldn't answer it regardless of how confident it feels.

These two systems together catch most of the cases where AI shouldn't be on the call. Could we do better? Probably. We're still tuning. But the principle is right.

What we'd tell anyone evaluating an AI vendor

Three questions worth asking, from a developer's perspective.

What's your end-to-end latency? Most vendors will quote you the LLM latency, which is the easy part. Ask for the full pipeline: audio in to audio out. If they can't tell you, or if the answer is anything over a second, listen carefully to a real call. You'll hear it.

How do you handle structured data? Specials, pricing, hours. If they tell you everything goes through RAG, that's a sign their architecture isn't tuned for this domain. The good products have specific paths for the things that don't need retrieval.

How does the AI know when to escalate? Confidence thresholds, patient-specific checks, emotional intensity detection. The vendors with weak answers here will give you a vague "we have escalation logic" and not be able to walk through the specifics.

What we got right and what we didn't

Honest take on our own work.

What we got right: latency is good. The local embeddings call was the most important architectural decision we made early on. The bypass paths for structured data are working. The voice quality is high enough that most callers don't realize they're talking to AI for the first 30 seconds.

What we don't do yet: the AI captures qualified lead information when a patient wants to book, but it doesn't actually book the appointment for you. That's the team's job. We chose this on purpose for now. Booking integrations are deep, every booking platform works differently, and we'd rather hand a clean lead to a human than book the wrong slot in your calendar. We're working on integrations with the major med spa booking platforms (Boulevard, Mangomint, Vagaro, Zenoti) and they'll come, but the right answer for now is human-in-the-loop on the booking step.

This is the kind of thing I think is worth being honest about. We built something we're proud of. It's not finished. The vendors who claim their AI handles everything perfectly are either lying or haven't shipped to enough real users yet to know what's actually hard.

Where MedspAI fits

If you've read this far and you care about how the technical decisions show up in the patient experience, that's exactly the kind of customer we're built for. Aidan and Harrison can walk you through the product side. I can walk you through the engineering side if you want to go deeper.

See MedspAI in action

Built specifically for med spas. Walks through every feature in a personalized demo.

Book a Demo