Voice Agent Design — Finding the Brand's Tessitura

Minted × Mother's Day Gift Rescue · Case Study Demo

Core thesis: In vocal performance, tessitura is the range where a voice sounds most natural — narrower than the full range. Every brand has a tessitura too. This case study finds Minted's: a 5-state pacing model where each conversational state sits at a measured point within the brand's emotional range (0.83–1.30 speedAlpha). The improvement comes from constraining to tessitura, not from a different voice.
Sample A — Weaker Model Reference
Original test sample — outside the brand's tessitura. Higher bandwidth (32kHz) but length-based pacing. Opening rushes, empathy is flat, +46% pitch spike on goodbye.
0:00
Sample B — Stronger Model Reference
Original test sample — outside the brand's tessitura. Better opening warmth, cleaner brand name. Still length-based pacing — empathy lines show 0% rate delta from informational.
0:00
Baseline — Adjective-Prompted Only Outside Tessitura
Length-based pacing — no tessitura awareness. All states cluster at ~3.4 WPS. Prompted with "friendly, warm, professional, empathetic." No pacing rules. Same voice (Rime Luna).
0:00
Designed — 5-State Pacing Within Tessitura In Tessitura
5-state pacing within Minted's tessitura (0.83–1.30). Each state at a distinct position in the range. Same voice. Different behavior: slower for empathy, moderate for options, firm for boundary, slowest for confirmation, warm for close.
0:00

Prosodic Measurements

MetricBaselineDesignedTargetVerdict
Opening pace3.5 WPS3.4 WPS3.2–3.8✓ In range
Empathy pace3.4 WPS2.9 WPSDistinctly slower than other states✓ Slowest non-confirmation state
Empathy vs non-empathy avg−3.0%+16%≥15%✓ 2.9 vs 3.5 avg (+19pp swing)
Empathy pre-pause~0ms500ms250–450msObserved — intentionally stronger
Options structureStacked in one sentenceSeparated into two optionsOne option per sentence✓ Easier to follow
Boundary contrast3.3 WPSDistinct from empathy✓ +0.4 WPS above empathy
Critical-detail isolationDense confirmation sentenceDate and watchpoint broken outStand-alone chunks
Closing pitch continuity+20–46% spikeConsistent<10%✓ Same persona at close

Key Finding: TTS Models Don't Know Their Tessitura

TTS models know their range — every speed they can produce. They don't know their tessitura. Rime's speed parameter gives directional control over pacing but not precise per-state control. The TTS engine's internal prosody model partially overrides speed hints non-deterministically. True tessitura-aware pacing requires SSML-level control (pause tags, emphasis markers, rate-per-phrase) or a model that accepts emotional state as an input parameter — not just a global speed knob. This finding applies to any TTS provider using speed-only controls.

What This Proves

Even with limited TTS control, the design intervention produces measurable improvement. The empathy delta swung +19 percentage points (from −3% to +16%). The boundary state creates a distinct register for honesty moments. Critical details are isolated instead of buried in one dense confirmation sentence. With SSML or state-aware TTS, these improvements would be larger and more reliable. The improvement comes from design, not from a different voice.

Brand Tessitura Comparison

Same methodology, different tessitura. Change the brand, change the range.

0.60 0.70 0.80 0.90 1.00 1.10 1.20 1.30 1.40 1.50 speedAlpha Minted warm-premium 1.00 PROVEN Discord playful-fast 0.82 DESIGNED Ritz-Carlton luxury-deliberate 1.20 DESIGNED

Solid = proven (voiced and measured). Dashed = designed (tessitura mapped, not yet voiced). Same methodology, different tessitura. Change the brand, change the range.