Benchmark release

GTM Bench

A benchmark for measuring whether AI systems can turn an offer into useful go-to-market work: clear ICPs, grounded account matches, reachable contact paths, and activation-ready outputs.

Paper Code

72production tasks

11task suites

15market categories

4data environments

GTM-Bench is an open benchmark that measures whether AI agents can do real go-to-market work: understanding what a seller offers, inferring who should buy it, and retrieving the right accounts with evidence to back every recommendation. As AI makes it trivially cheap to generate more outbound, we built GTM-Bench to measure whether agents reduce noise or add to it: rewarding evidence-backed matches instead of raw volume.

Leaderboard

GTM Bench Leaderboard

Results are derived from the GTM-Bench paper. The table reports volume-weighted net score, active A-grade rate, efficiency metrics, and trace context for the evaluated purpose-built and generalist systems.

Production tasks

Best net score

26,615.6

Best generalist

37.7%

Filter by system type

Show Harness Show Efficiency Show Trace Context

Showing 7 entries

Data source: GTM-Bench paper Tables 6, 8, 9, and 12.

Rank	Runner	Harness	Net Score	A-grade %	Net / $	Cost	Composite Index	Returned Records	Trace Tendency
1	Blackpearl RTSAblackpearl-rtsa	Blackpearl RTSA	26,615.6	40.9%	56.63	$470.03	+14.18	38,823	Purpose-built offer, ICP, retrieval, and activation system
2	GPT-5.5 (Codex)openai_gpt-5.5	codex-cli / openai:gpt-5.5	4,040.9	37.7%	27.44	$147.27	+16.63	1,753	Compact, script-driven, selective
3	Claude Sonnet 4.6anthropic_claude-sonnet-4.6	claude-code-cli / sonnet-4.6	400.1	27.3%	2.51	$159.61	+0.53	5,678	High-recall SQL/tgrep explorer
4	Claude Opus 4.7anthropic_claude-opus-4.7	claude-code-cli / opus-4.7	-2,476.6	31.7%	-9.17	$270.20	-3.64	4,904	Deliberative, evidence-heavy, costly
5	DeepSeek V4 Proopenrouter_deepseek-v4-pro	hermes-agent-cli / deepseek-v4-pro	-3,398.0	21.8%	-99.26	$34.23	-10.77	2,531	Low-cost, web-assisted, moderate recall
6	Gemini 3.5 Flashgoogle_gemini-3.5-flash	hermes-agent-cli / gemini-3.5-flash	-10,671.9	13.6%	-148.35	$71.94	-42.36	2,015	Fast but schema/identity-inconsistent
7	Kimi K2.6openrouter_kimi-k2.6	hermes-agent-cli / kimi-k2.6	-15,402.3	22.8%	-171.56	$89.78	-22.74	5,496	Slow, expansive, overproduces

Net score by evaluated system

Blackpearl RTSA

26,615.6

GPT-5.5 (Codex)

4,040.9

Claude Sonnet 4.6

400.1

Claude Opus 4.7

-2,476.6

DeepSeek V4 Pro

-3,398

Gemini 3.5 Flash

-10,671.9

Kimi K2.6

-15,402.3

Benchmark analysis

Production catalog composition

The paper reports a 72-task production catalog designed from a source query taxonomy, then stratified across task suites, market categories, and pressure tags.

Task suites

Offer-grounded lead lists8.3%

Named-domain offer extraction11.1%

Offer-to-lead list12.5%

Vertical / geo / firmographic search11.1%

Persona / contact activation12.5%

Intent / trigger evidence9.7%

Technographic search6.9%

Offer-grounded lookalikes4.2%

Market-to-lead list4.2%

Buyer search with contact details13.9%

Limited-context prompts5.6%

Pressure tags

Geo/firmographic constraints44.4%

Compliance or policy sensitivity38.9%

Persona/contact fit37.5%

Intent or recency evidence26.4%

End-to-end activation18.1%

Technographic evidence12.5%

ICP inference11.1%

Comparator/lookalike reasoning5.6%

Source query taxonomy

Domain based lead search43.8%

Generic lead search17.6%

Persona search6.8%

Offer to ICP discovery6.2%

Geographic/local search5.1%

Firmographic filtering4.5%

Industry category search4.4%

Technographic search4.1%

Intent/trigger search3.6%

Lookalike search3.5%

Market research1.7%

Non-GTM / Ambiguous1.4%

GTM-Bench is the first step in a longer effort to define how AI should be evaluated for go-to-market work. To collaborate, open an issue on GitHub or email us at research@blackpearl.com — we'd love to build the next version with you.

Paper and release artifacts

GTM Bench: Evaluating AI Systems for Go-to-Market Workflows

The site includes the current paper PDF, the leaderboard, and the open-source release for the benchmark harness.

Paper PDF Code