Production tasks
72Leaderboard
GTM Bench Leaderboard
Results are derived from the GTM-Bench paper. The table reports volume-weighted net score, active A-grade rate, efficiency metrics, and trace context for the evaluated purpose-built and generalist systems.
Best net score
26,615.6Best generalist
37.7% Filter by system type
Showing 7 entries
Data source: GTM-Bench paper Tables 6, 8, 9, and 12.
| Rank | Runner | Harness | Net Score | A-grade % | Net / $ | Cost | Prod. Utility | Returned Records | Trace Tendency |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Blackpearl RTSA ready-to-advertise | Blackpearl RTSA | 26,615.6 | 40.9% | 56.63 | $470.03 | - | - | Purpose-built offer, ICP, retrieval, and activation system |
| 2 | GPT-5.5 (Codex) openai_gpt-5.5 | codex-cli / openai:gpt-5.5 | 4,040.9 | 37.7% | 27.44 | $147.27 | 0.407 | 1,753 | Compact, script-driven, selective |
| 3 | Claude Sonnet 4.6 anthropic_claude-sonnet-4.6 | claude-code-cli / sonnet-4.6 | 400.1 | 27.3% | 2.51 | $159.61 | 0.367 | 5,678 | High-recall SQL/tgrep explorer |
| 4 | Claude Opus 4.7 anthropic_claude-opus-4.7 | claude-code-cli / opus-4.7 | -2,476.6 | 31.7% | -9.17 | $270.20 | 0.352 | 4,904 | Deliberative, evidence-heavy, costly |
| 5 | DeepSeek V4 Pro openrouter_deepseek-v4-pro | hermes-agent-cli / deepseek-v4-pro | -3,398.0 | 21.8% | -99.26 | $34.23 | 0.346 | 2,531 | Low-cost, web-assisted, moderate recall |
| 6 | Gemini 3.5 Flash google_gemini-3.5-flash | hermes-agent-cli / gemini-3.5-flash | -10,671.9 | 13.6% | -148.35 | $71.94 | 0.303 | 2,015 | Fast but schema/identity-inconsistent |
| 7 | Kimi K2.6 openrouter_kimi-k2.6 | hermes-agent-cli / kimi-k2.6 | -15,402.3 | 22.8% | -171.56 | $89.78 | 0.316 | 5,496 | Slow, expansive, overproduces |
Benchmark analysis
Production catalog composition
The paper reports a 72-task production catalog designed from a source query taxonomy, then stratified across task suites, market categories, and pressure tags.
Task suites
Offer-grounded lead lists 8.3%
Named-domain offer extraction 11.1%
Offer-to-lead list 12.5%
Vertical / geo / firmographic search 11.1%
Persona / contact activation 12.5%
Intent / trigger evidence 9.7%
Technographic search 6.9%
Offer-grounded lookalikes 4.2%
Market-to-lead list 4.2%
Buyer search with contact details 13.9%
Limited-context prompts 5.6%
Pressure tags
Geo/firmographic constraints 44.4%
Compliance or policy sensitivity 38.9%
Persona/contact fit 37.5%
Intent or recency evidence 26.4%
End-to-end activation 18.1%
Technographic evidence 12.5%
ICP inference 11.1%
Comparator/lookalike reasoning 5.6%
Source query taxonomy
Domain based lead search 43.8%
Generic lead search 17.6%
Persona search 6.8%
Offer to ICP discovery 6.2%
Geographic/local search 5.1%
Firmographic filtering 4.5%
Industry category search 4.4%
Technographic search 4.1%
Intent/trigger search 3.6%
Lookalike search 3.5%
Market research 1.7%
Non-GTM / Ambiguous 1.4%