Leaderboard

GTM Bench Leaderboard

Results are derived from the GTM-Bench paper. The table reports volume-weighted net score, active A-grade rate, efficiency metrics, and trace context for the evaluated purpose-built and generalist systems.

Production tasks

Best net score

26,615.6

Best generalist

37.7%

Filter by system type

Show Harness Show Efficiency Show Trace Context

Showing 7 entries

Data source: GTM-Bench paper Tables 6, 8, 9, and 12.

Rank	Runner	Harness	Net Score	A-grade %	Net / $	Cost	Prod. Utility	Returned Records	Trace Tendency
1	Blackpearl RTSA ready-to-advertise	Blackpearl RTSA	26,615.6	40.9%	56.63	$470.03	-	-	Purpose-built offer, ICP, retrieval, and activation system
2	GPT-5.5 (Codex) openai_gpt-5.5	codex-cli / openai:gpt-5.5	4,040.9	37.7%	27.44	$147.27	0.407	1,753	Compact, script-driven, selective
3	Claude Sonnet 4.6 anthropic_claude-sonnet-4.6	claude-code-cli / sonnet-4.6	400.1	27.3%	2.51	$159.61	0.367	5,678	High-recall SQL/tgrep explorer
4	Claude Opus 4.7 anthropic_claude-opus-4.7	claude-code-cli / opus-4.7	-2,476.6	31.7%	-9.17	$270.20	0.352	4,904	Deliberative, evidence-heavy, costly
5	DeepSeek V4 Pro openrouter_deepseek-v4-pro	hermes-agent-cli / deepseek-v4-pro	-3,398.0	21.8%	-99.26	$34.23	0.346	2,531	Low-cost, web-assisted, moderate recall
6	Gemini 3.5 Flash google_gemini-3.5-flash	hermes-agent-cli / gemini-3.5-flash	-10,671.9	13.6%	-148.35	$71.94	0.303	2,015	Fast but schema/identity-inconsistent
7	Kimi K2.6 openrouter_kimi-k2.6	hermes-agent-cli / kimi-k2.6	-15,402.3	22.8%	-171.56	$89.78	0.316	5,496	Slow, expansive, overproduces

Benchmark analysis

Production catalog composition

The paper reports a 72-task production catalog designed from a source query taxonomy, then stratified across task suites, market categories, and pressure tags.

Task suites

Offer-grounded lead lists 8.3%

Named-domain offer extraction 11.1%

Offer-to-lead list 12.5%

Vertical / geo / firmographic search 11.1%

Persona / contact activation 12.5%

Intent / trigger evidence 9.7%

Technographic search 6.9%

Offer-grounded lookalikes 4.2%

Market-to-lead list 4.2%

Buyer search with contact details 13.9%

Limited-context prompts 5.6%

Pressure tags

Geo/firmographic constraints 44.4%

Compliance or policy sensitivity 38.9%

Persona/contact fit 37.5%

Intent or recency evidence 26.4%

End-to-end activation 18.1%

Technographic evidence 12.5%

ICP inference 11.1%

Comparator/lookalike reasoning 5.6%

Source query taxonomy

Domain based lead search 43.8%

Generic lead search 17.6%

Persona search 6.8%

Offer to ICP discovery 6.2%

Geographic/local search 5.1%

Firmographic filtering 4.5%

Industry category search 4.4%

Technographic search 4.1%

Intent/trigger search 3.6%

Lookalike search 3.5%

Market research 1.7%

Non-GTM / Ambiguous 1.4%