Benchmark release
Introducing GTM-Bench
A benchmark for agentic GTM work.
An open benchmark for evaluating whether AI agents can find the right buyers for the right seller, with evidence.
AI is rapidly changing go-to-market work. Agents can now research companies, inspect data, generate lists, write outreach, and activate campaigns at a scale that was impossible only a short time ago.
But scale is not the same as quality.
In GTM, the central question is not whether an agent can generate more prospects. It is whether it can understand what a seller offers, infer who should buy it, retrieve the right accounts or contacts, and explain why each record is commercially relevant.
We built GTM-Bench to measure that.
GTM-Bench is a benchmark for evaluating buyer/seller coherence in agentic GTM workflows. It tests whether AI systems can complete realistic prospecting tasks end to end: infer the offer, define an actionable ICP, retrieve matching buyer records, and support each recommendation with evidence.
The first version includes 72 GTM tasks spanning 11 task types and 15 market categories, designed from a taxonomy of real prospecting behavior observed across 59,881 opening queries submitted to Bebop.ai.
We are releasing the task catalog, agent harness, evaluation code, benchmark calculation code, and leaderboard to help model builders, agent developers, and GTM teams measure progress on commercially useful AI agents.
Why GTM needs its own benchmark
Most AI benchmarks do not capture the shape of real GTM work.
Traditional LLM benchmarks test reasoning, coding, math, or question answering. Enterprise-agent benchmarks test tool use, CRM workflows, and business-system operations. These are valuable, but they do not isolate the GTM problem that matters most in outbound work:
Can the system match the right offer to the right account, at the right time?
This matters because poor matching has real costs. It wastes seller time and budget, and it creates more noise for buyers. As AI lowers the marginal cost of outbound generation, evaluation needs to reward relevance, not volume.
A benchmark that rewards agents for returning more rows would encourage the wrong behavior. GTM-Bench is designed to do the opposite.
How GTM-Bench works
Each task mirrors real GTM work.
Each task gives an agent a natural-language GTM instruction, a controlled data environment, and a standard operating environment.
To complete the task, the agent must produce three artifacts:
This mirrors how real GTM work is done. A useful system cannot jump straight to a lead list. It has to understand the seller, reason about the buyer, retrieve candidates, and decide which records are worth acting on.
The tasks cover common GTM patterns including offer-grounded lead lists, named-domain offer extraction, persona and contact activation, intent and trigger evidence, technographic search, lookalikes, geographic and firmographic filtering, and limited-context prompts.
Scoring useful matches, not more rows
GTM-Bench uses a scoring design built around production utility.
Offer and ICP artifacts are scored independently for fidelity, specificity, commercial relevance, actionability, and concision. These scores act as task-level multipliers, so an agent is not rewarded for finding plausible rows if it misunderstood the seller or buyer.
Each returned record is then judged on two dimensions:
Match quality
Does the company or contact fit the offer and ICP? Is there a concrete reason they would need the product or service? Is the contact usable for activation?
Audit quality
Does the submitted record resolve to the right real-world entity? Are the claims supported? Is the row consistent with the underlying database and evidence?
Records are grouped into A-grade, B-grade, and below-B outcomes. A-grade records create positive utility. B-grade records are neutral. Below-B records create negative utility.
That means an agent cannot win by flooding the evaluator with weak leads. Unsupported claims, identity errors, irrelevant companies, and poor contact matches actively reduce the score.
This reflects real GTM work: a bad lead is not just “less good.” It wastes budget, damages trust, and creates spam.
Initial results
We evaluated six frontier generalist agent systems and one purpose-built GTM system.
The purpose-built Blackpearl RTSA system was the clear leader, achieving the highest net score and the strongest useful-volume performance. Among generalist agents, OpenAI GPT-5.5 with Codex was the strongest overall and the only generalist system with a large positive net score.
| System | Net score | A-grade rate |
|---|---|---|
| Blackpearl RTSA | 26,615.6 | 40.9% |
| OpenAI GPT-5.5 / Codex | 4,040.9 | 37.7% |
| Claude Sonnet 4.6 / Claude Code | 400.1 | 27.3% |
| Claude Opus 4.7 / Claude Code | -2,476.6 | 31.7% |
| DeepSeek V4 Pro / Hermes | -3,398.0 | 21.8% |
| Gemini 3.5 Flash / Hermes | -10,671.9 | 13.6% |
| Kimi K2.6 / Hermes | -15,402.3 | 22.8% |
The results show why volume-weighted scoring matters. Some systems returned many plausible-looking rows, but enough of those rows were weak, unsupported, or incorrectly matched that their total utility became negative.
RTSA and GPT-5.5 had similar A-grade rates, but RTSA produced far more useful volume. That difference is critical in production GTM, where the goal is not just precision in isolation, but enough high-quality, auditable prospects to activate.
What we learned from agent traces
The strongest runs filtered before they wrote final results.
Across 432 generalist-agent traces, the strongest runs followed a consistent pattern.
They first retrieved broad-enough candidate pools, then used structured filtering, website evidence, scripts, and row-level pruning before writing final results. The weakest runs usually failed in one of three ways: they stopped too early, overproduced weak rows, or made claims the evidence could not support.
This creates a clear lesson for GTM agents: retrieval alone is not enough. The hard part is deciding what to keep, what to reject, and when the evidence is strong enough to act.
What’s next
GTM-Bench is an early step toward better evaluation for GTM agents.
Future versions will expand task coverage, improve reproducibility, and incorporate richer buyer-affinity signals that go beyond likely fit to estimate whether a matched prospect is actually likely to buy.
We are releasing GTM-Bench to make progress in agentic GTM measurable. Our goal is to give model providers, agent builders, researchers, and GTM teams a shared way to evaluate systems that do more than generate lists.
The next generation of GTM AI should not create more noise.
It should help sellers find the right buyers, with evidence, and turn data into revenue.