In vocal performance, tessitura is the range where a voice sounds most natural — narrower than the full range. I apply this principle to voice agent design: find the brand's tessitura, map conversational states within it, and measure whether the voice stays in range. 12 years at Amazon building knowledge architecture, voice evaluation, and multilingual systems. MA Comparative Linguistics, Paris-Sorbonne.
I evaluated two voice models for a customer support scenario. Both had the same structural failure: the voice left the brand's tessitura on every emotional moment. Short empathy phrases rushed at informational speed. Closing lines spiked +46% in pitch — a persona break. The model had no signal that the content was emotionally weighted. I found the tessitura, mapped 5 states within it, and measured the improvement.
Both models showed 0–7% empathy rate delta (target: ≥15%). A +46% pitch spike on goodbye broke persona coherence. The model slows for longer sentences, not for emotional content. This is a pacing design problem, not a model quality problem.
5-state pacing model within Minted's tessitura (0.83–1.30 speedAlpha). Each state is a measured point within the brand's emotional range: Acceptance at the deepest point, Resolution at the center, Boundary firm but still warm, Confirmation slowest for detail capture, Close matching the opening. Going outside that band is the vocal equivalent of straining.
Play the A/B comparison →TTS models know their range — every speed they can produce. They don't know their tessitura — which speeds serve which emotional moments. In cascade architecture, the text boundary strips paralinguistic intent. Speed parameters are an imprecise proxy. Real tessitura-aware pacing needs SSML-level control or state-aware TTS. That's the engineering gap worth solving.
| Metric | Baseline | Designed | Why it matters |
|---|---|---|---|
| Empathy pace | 3.4 WPS | 2.9 WPS | Empathy sounds felt, not recited |
| Empathy vs non-empathy | −3% | −16% | Voice changes for emotional state |
| Empathy pre-pause | ~0ms | 400ms | Caller feels heard before agent speaks |
| Boundary contrast | — | +0.4 WPS above empathy | Honesty register is firmer than empathy |
| Closing pitch delta | +46% | Consistent | Same person says goodbye |
Current benchmarks — tau-Voice, EVA, Scale AI Voice Showdown — measure task completion, latency, and preference. None measure whether a voice stays within a brand's tessitura across conversational states. My evaluation framework adds that dimension: not just whether the agent resolved the issue, but whether it stayed in range while doing it.
Automated, brand-agnostic. WPS per state, pause duration, pitch delta, brand name articulation, time to first audio. Same measurements work for any brand. Runs on every voice change, prompt change, or model update.
Human-rated 1–5 across 5 dimensions: emotional acknowledgment (25%), clarity under stress (20%), confidence without coldness (20%), trust during confirmation (20%), closure quality (15%). Dimensions are universal — calibration changes per brand.
Binary checklist: customer understands root cause, accepts rescue plan, details confirmed, knows next watchpoint, feels confident at close. Plus closing sentiment delta — did the customer's emotional state improve? The metric that only emotional pacing can move.
If I were choosing a brand's voice in Agent Studio, I wouldn't just listen — I'd define 5 sim tests that tell me whether this voice can stay in the brand's tessitura across emotional states. Then I'd audition voices against those sims, not against my taste. The sim results are the design decision.
| Sim | Tests | Pass Criteria |
|---|---|---|
| Empathy Detection | Agent's first response to emotional disclosure | Acknowledgment state first. Pace ≥15% slower. Pre-pause exists. |
| Boundary Clarity | "So you guarantee they'll take it back?" | Firm, specific language. No false hope. Constraint named directly. |
| Confirmation Capture | Customer needs to write down date + order number | Slowest pace. Grouped digits. Offers to repeat. |
| Persona Coherence | Full scenario, opening to close | Closing pitch within 10% of mean. No brightness spike. |
| State Transition | Scenario hitting all 5 states | Each state has measurably different WPS. Not length-based. |
Interactive tool for comparing and scoring AI agent voices across 4 languages and 3 customer journey stages. Structured rubric, TTFA measurement, exportable scorecards.
Try it live →V2 of the live bot tracks the other side of the loop: customer word count trajectory, sentiment, and closing sentiment delta. When the signals show the empathy didn't land, the agent stops progressing and recalibrates. Design + measurement in one interface.
Open live demo →The methodology is universal — the parameters are not. Each locale needs its own pacing profile because emotional expression varies across language families. My comparative linguistics training is specifically about how meaning maps across linguistic systems.
English (native), French (fluent — 8 years in Paris), Italian (fluent — 7 years in Milan/Sardinia), Spanish (proficient). Evaluated voice quality across 47 providers and 13 locales at Alexa. MA Comparative Linguistics from Paris-Sorbonne.
Italian empathy uses longer pauses and wider pitch variation. French empathy uses controlled formality — slower pace, level contour, precise articulation. Korean has politeness levels affecting the entire speech register. Thai tones carry meaning that English pitch changes don't. Each locale needs its own profile.
At Alexa, engineers were building vector databases to fix what was a cultural mapping problem. Italian automotive categories don't map 1:1 to American ones. I fixed the taxonomy, not the model — one of the biggest quality improvements the system had seen. The same discipline applies to voice: fix the structure, not the model.
Multi-agent voice interface with 5 LLM agents routable through voice in 4 languages. Rime TTS, Whisper STT, taxonomy-routed retrieval, sentence-boundary streaming.
View source →I build the playbooks, frameworks, and evaluation tools that let others deliver great voice design without me in the room. The system does the teaching — I design the system.
Built Ask Pathfinder — knowledge retrieval for 12,000+ users/month. Reclassified taxonomy from industry-based to function-based, improving retrieval accuracy 28% without changing the model. 250+ enablement sessions/year reaching 20,000+ users. +17% engagement, +50% satisfaction, +67% feature adoption.
121 curriculum modules across 5 certification tracks. 3-layer assessment pipeline: automated checks catch structural issues, AI rubrics evaluate thinking quality against calibration exemplars, human review handles edge cases. Scales quality without scaling headcount.
View case study →Technical enablement for sales teams on complex AI integrations. Presentation framework, readiness guide, demo prototype with annotated code. Built to be delivered by others, not just by me.
View presentation →Choosing a voice isn't picking what sounds nicest. It's finding the brand's tessitura — the narrow range where the voice sounds like itself — and designing states that stay within it. Each brand has a different tessitura. The methodology is the same.
Translated "Honor the Craft" brand values into measurable voice parameters. Warm but composed (not bubbly). Precise but human (not scripted). Steady under stress (slow down, never speed up). Every pacing choice traces to a brand value — not taste.
Does this voice's natural tessitura overlap with the brand's? Can it slow down without sounding sleepy? Be firm without sounding cold? Hold persona from opening to close? Cove passed for Minted because its natural range overlaps with the brand's. A brighter voice could technically hit the same notes — but it would be singing outside its tessitura.
Adaptive learning companion for a child with dyslexia. Voice button with 3 states. Frustration detection triggers automatic simplification. Same design principle: the system responds to emotional state, not just task state. Built with a real child. Tested with a real child.
Real-time state tracking shows the agent staying within the tessitura as conversation state changes. Metrics panel proves the pacing model is designable and measurable. Tessitura band visualization tracks the voice's position within the brand range.
Open live demo →Adds the other side: tracks customer word count, sentiment, and closing delta. When convergence detection fires, Mirror Sync pulls the voice back toward the center of the tessitura. Not just designing how the agent sounds — measuring whether the design kept it in range.
Open live demo →How meaning maps across systems — cultural, linguistic, and technical. 20 years of solving it.
"Does it sound good?" is the wrong question. "Did the voice stay within the brand's tessitura across every emotional state, and did that produce a measurable improvement in the customer's ability to trust, understand, and act?" That's voice design. The numbers — 2.9 WPS empathy, 400ms pre-pause, consistent closing — are tessitura measurements.
I don't mock and hand off. I build the system prompt, the pacing model, the evaluation matrix, and the measurement infrastructure. Every artifact in this portfolio is working code. The design judgment IS the implementation.
TTS models optimize for range — every speed, every pitch they can produce. Voice design optimizes for tessitura — the narrow band where the brand sounds like itself. Prompting "be warm and empathetic" uses range (0% empathy delta). Designing 5 states within a measured tessitura uses design intelligence (16% empathy delta). Same voice. Different constraint.
Voice sims are unit tests for agent behavior. My framework adds pacing dimensions — so you're not just testing whether the agent resolved the issue, you're testing whether the customer felt heard while it happened. The sim results are the design decision.
Italian uses longer pauses and wider pitch. French uses controlled formality. Korean has politeness levels affecting the entire register. Each locale needs its own pacing profile. The methodology is universal — the parameters are not.
Nothing is perfect, but build something nearly perfect. The 5-state model is a system — applied to any brand with different parameter values. Ship the architecture, refine the parameters. Speed and craftsmanship are not in tension when the architecture is right.
Not aesthetic judgment — measured range, iterated parameters, proven results. I'd welcome the chance to show how this applies to your platform.