Benchmark release

Introducing GTM-Bench

A benchmark for agentic GTM work.

An open benchmark for evaluating whether AI agents can find the right buyers for the right seller, with evidence.

June 8, 2026 Buyer/seller coherence Agentic GTM workflows

AI is rapidly changing go-to-market work. Agents can now research companies, inspect data, generate lists, write outreach, and activate campaigns at a scale that was impossible only a short time ago.

But scale is not the same as quality.

In GTM, the central question is not whether an agent can generate more prospects. It is whether it can understand what a seller offers, infer who should buy it, retrieve the right accounts or contacts, and explain why each record is commercially relevant.

We built GTM-Bench to measure that.

GTM-Bench is a benchmark for evaluating buyer/seller coherence in agentic GTM workflows. It tests whether AI systems can complete realistic prospecting tasks end to end: infer the offer, define an actionable ICP, retrieve matching buyer records, and support each recommendation with evidence.

The first version includes 72 GTM tasks spanning 11 task types and 15 market categories, designed from a taxonomy of real prospecting behavior observed across 59,881 opening queries submitted to Bebop.ai.

We are releasing the task catalog, agent harness, evaluation code, benchmark calculation code, and leaderboard to help model builders, agent developers, and GTM teams measure progress on commercially useful AI agents.

Why GTM needs its own benchmark

Most AI benchmarks do not capture the shape of real GTM work.

Traditional LLM benchmarks test reasoning, coding, math, or question answering. Enterprise-agent benchmarks test tool use, CRM workflows, and business-system operations. These are valuable, but they do not isolate the GTM problem that matters most in outbound work:

Can the system match the right offer to the right account, at the right time?

This matters because poor matching has real costs. It wastes seller time and budget, and it creates more noise for buyers. As AI lowers the marginal cost of outbound generation, evaluation needs to reward relevance, not volume.

A benchmark that rewards agents for returning more rows would encourage the wrong behavior. GTM-Bench is designed to do the opposite.

How GTM-Bench works

Each task mirrors real GTM work.

Each task gives an agent a natural-language GTM instruction, a controlled data environment, and a standard operating environment.

To complete the task, the agent must produce three artifacts:

OFFER.md — what the seller appears to be offering

ICP.md — the ideal customer profile inferred from the task and offer

RESULTS.csv — a ranked list of accounts or contacts, with evidence and activation context

This mirrors how real GTM work is done. A useful system cannot jump straight to a lead list. It has to understand the seller, reason about the buyer, retrieve candidates, and decide which records are worth acting on.

The tasks cover common GTM patterns including offer-grounded lead lists, named-domain offer extraction, persona and contact activation, intent and trigger evidence, technographic search, lookalikes, geographic and firmographic filtering, and limited-context prompts.

Scoring useful matches, not more rows

GTM-Bench uses a scoring design built around production utility.

Offer and ICP artifacts are scored independently for fidelity, specificity, commercial relevance, actionability, and concision. These scores act as task-level multipliers, so an agent is not rewarded for finding plausible rows if it misunderstood the seller or buyer.

Each returned record is then judged on two dimensions:

Match quality

Does the company or contact fit the offer and ICP? Is there a concrete reason they would need the product or service? Is the contact usable for activation?

Audit quality

Does the submitted record resolve to the right real-world entity? Are the claims supported? Is the row consistent with the underlying database and evidence?

Records are grouped into A-grade, B-grade, and below-B outcomes. A-grade records create positive utility. B-grade records are neutral. Below-B records create negative utility.

That means an agent cannot win by flooding the evaluator with weak leads. Unsupported claims, identity errors, irrelevant companies, and poor contact matches actively reduce the score.

This reflects real GTM work: a bad lead is not just “less good.” It wastes budget, damages trust, and creates spam.

Initial results

We evaluated six frontier generalist agent systems and one purpose-built GTM system.

The purpose-built Blackpearl RTSA system was the clear leader, achieving the highest net score and the strongest useful-volume performance. Among generalist agents, OpenAI GPT-5.5 with Codex was the strongest overall and the only generalist system with a large positive net score.

System	Net score	A-grade rate
Blackpearl RTSA	26,615.6	40.9%
OpenAI GPT-5.5 / Codex	4,040.9	37.7%
Claude Sonnet 4.6 / Claude Code	400.1	27.3%
Claude Opus 4.7 / Claude Code	-2,476.6	31.7%
DeepSeek V4 Pro / Hermes	-3,398.0	21.8%
Gemini 3.5 Flash / Hermes	-10,671.9	13.6%
Kimi K2.6 / Hermes	-15,402.3	22.8%

Net score by evaluated system

Blackpearl RTSA

26,615.6

OpenAI GPT-5.5 / Codex

4,040.9

Claude Sonnet 4.6 / Claude Code

400.1

Claude Opus 4.7 / Claude Code

-2,476.6

DeepSeek V4 Pro / Hermes

-3,398

Gemini 3.5 Flash / Hermes

-10,671.9

Kimi K2.6 / Hermes

-15,402.3

The results show why volume-weighted scoring matters. Some systems returned many plausible-looking rows, but enough of those rows were weak, unsupported, or incorrectly matched that their total utility became negative.

RTSA and GPT-5.5 had similar A-grade rates, but RTSA produced far more useful volume. That difference is critical in production GTM, where the goal is not just precision in isolation, but enough high-quality, auditable prospects to activate.

What we learned from agent traces

The strongest runs filtered before they wrote final results.

Across 432 generalist-agent traces, the strongest runs followed a consistent pattern.

They first retrieved broad-enough candidate pools, then used structured filtering, website evidence, scripts, and row-level pruning before writing final results. The weakest runs usually failed in one of three ways: they stopped too early, overproduced weak rows, or made claims the evidence could not support.

This creates a clear lesson for GTM agents: retrieval alone is not enough. The hard part is deciding what to keep, what to reject, and when the evidence is strong enough to act.

What’s next

GTM-Bench is an early step toward better evaluation for GTM agents.

Future versions will expand task coverage, improve reproducibility, and incorporate richer buyer-affinity signals that go beyond likely fit to estimate whether a matched prospect is actually likely to buy.

We are releasing GTM-Bench to make progress in agentic GTM measurable. Our goal is to give model providers, agent builders, researchers, and GTM teams a shared way to evaluate systems that do more than generate lists.

The next generation of GTM AI should not create more noise.

It should help sellers find the right buyers, with evidence, and turn data into revenue.

Explore the leaderboard View the open-source release Read the paper