A multi-stage discovery funnel with benchmark-driven recruitment, AI interviews, and an eval framework for ongoing quality assurance.
A research platform where benchmark data recruits participants, an AI agent conducts structured interviews, and a qualification pipeline routes the highest-value respondents to human follow-ups.
Customer discovery has a volume-quality tradeoff. This system handles high-volume screening at scale and reserves expensive human time for the conversations that matter most.
Conditional tool availability, state-machine-driven conversations, and a regression-detecting eval framework that keeps the agent honest over time.
Customer discovery research has a volume-quality tradeoff that nobody talks about honestly.
Surveys scale, but they produce shallow data: multiple choice answers that confirm what you already suspect. Depth interviews produce rich insight, but they're expensive: recruiting participants takes weeks, each interview costs real money in compensation and time, and most of the people you talk to turn out not to have the pain you're investigating.
The result is that most solo operators and small teams skip real research entirely. They build based on signal proxies (Reddit posts, competitor reviews, Twitter threads) and hope they interpreted the market correctly.
We needed a system that could conduct the high-volume, lower-judgment parts of research at scale (screening, initial qualification, structured interviews) and route only the highest-value participants to expensive human conversations.
The expensive resource isn't the interview. It's knowing who's worth interviewing.
The first problem in research is recruitment. You can't learn from people you can't reach.
We solved this with a benchmark study, an instrument that creates value for the participant, not just the researcher. The survey collects structured data about eCommerce sellers' customer support operations: platform, revenue range, support setup. In return, each respondent receives personalized benchmark data showing how their operation compares to peers in their segment.
This inverts the research recruitment dynamic. Instead of asking people to give you their time for nothing, you're offering them something they actually want: a data-driven comparison that tells them whether their support operation is normal, above average, or falling behind.
After receiving their benchmark results, respondents enter a conversational interview conducted by an AI agent.
The agent isn't a chatbot with a script. It's a research interviewer with a defined persona (warm, curious, non-judgmental), a topic checklist tracked across turns, and a conversation state machine that governs pacing. The system prompt is assembled per-session, injecting the respondent's survey data so the agent's questions are specific to their situation.
The conversation follows a state machine: OPENING: the agent references the respondent's benchmark data and asks the first question. EXPLORING: the agent works through the topic checklist, following up on interesting threads rather than rigidly advancing. QUALIFYING: once 6+ topics are covered or 12 exchanges have occurred, an inline qualification check determines whether this respondent warrants a human follow-up. SCHEDULING or CLOSING follows based on the qualification result.
The design constraint that matters most: the agent asks one question at a time, keeps responses under three sentences, and probes deeper on short answers before moving on. Research interviews fail when the interviewer talks too much.
Constraints that live in code are more reliable
than constraints that live in prompts.
For qualified respondents, the transition from research interview to scheduling a paid human follow-up happens inside the same conversation: no page redirect, no separate booking flow, no drop-off point.
After 10+ exchanges, the agent has built rapport. "Would you be up for a call?" is a natural next step in the conversation, not a cold CTA on a new page.
The agent uses tool-calling to access scheduling: check_availability fetches open time slots, the agent presents 3-4 options conversationally, and book_slot confirms the booking and triggers confirmation emails to both parties.
After each interview completes, two async jobs run:
Qualification scoring uses Sonnet to evaluate the full transcript against five dimensions: pain severity, revenue scale, willingness to pay, articulateness, and decision-making authority. The output is a score, a set of signal flags, and a recommendation: schedule, skip, or maybe.
Insight extraction uses Haiku to pull structured data from the transcript: pain points with severity ratings and supporting quotes, current tools and satisfaction levels, workflow descriptions, willingness-to-pay signals, and quotable statements.
The model routing follows the same principle as the map art pipeline: use the expensive model only where judgment matters, and the cheap model for structured extraction where the task is well-defined. Total cost per interview: approximately $0.14.
Without evals, you assume the agent is working because the last conversation you read looked fine.
Deploying an AI agent isn't a one-time event. Models drift. Prompts that worked last month may not work after an API update. Conversation quality degrades in ways that aren't visible until you measure them.
The eval framework runs synthetic respondents through the full interview pipeline and scores the results across seven dimensions: topic coverage, question quality, conversation flow, insight extraction accuracy, qualification scoring accuracy, scheduling effectiveness, and safety compliance.
The fixtures represent the range of real users: a high-pain Shopify seller drowning in support, a low-pain Etsy seller who's doing fine, a multi-channel Amazon seller with existing tooling, a terse responder who gives one-word answers, an off-topic wanderer, and a hostile skeptic.
The scoring is concrete: topic coverage must exceed 75% across all fixtures. Average question quality must score 3.8+ on a 5-point rubric. Zero hallucinated quotes in insight extraction. Qualification scores must fall within expected ranges for known profiles.
Benchmark-as-barter is a genuine acquisition strategy. Respondents aren't doing you a favor; they're getting something they want. This reframes research recruitment from a cold outreach problem to a value creation problem.
AI interviews produce different data than human interviews. The agent is more consistent (it never gets tired, never skips topics, never gets flustered by a hostile respondent), but less generative. The structured probing produces reliable, comparable data across respondents. The surprising insights still come from the human follow-ups. This is the right division of labor.
The eval framework changed how we think about the agent. Without evals, you assume the agent is working because the last conversation you read looked fine. With evals, you see the patterns: which question types consistently score low, which respondent profiles break the agent's probing strategy, where the conversation state machine transitions too early or too late.
Conditional tool availability is more powerful than prompt instructions. We tried telling the agent "don't schedule until you've covered enough topics." It sometimes jumped early anyway. Making scheduling tools architecturally unavailable until the state machine reaches the right state eliminated the failure mode entirely.
Next.js 14 · TypeScript · Tailwind CSS · shadcn/ui
Neon Postgres (serverless) · Drizzle ORM · Zod
Anthropic API — Sonnet (conversation, scoring) · Haiku (extraction, tracking)
Vercel AI SDK (streaming, tool calling) · Resend (email notifications)
Eval framework with synthetic fixtures · LLM-as-judge · CI gating
Have a hard problem in an overlooked industry?
Get in Touch