Blackpearl GTM Bench
Code

Leaderboard

GTM Bench Leaderboard

Results are derived from the GTM-Bench paper. The table reports volume-weighted net score, active A-grade rate, efficiency metrics, and trace context for the evaluated purpose-built and generalist systems.

Production tasks

72

Best net score

26,615.6

Best generalist

37.7%
Filter by system type

Showing 7 entries

Data source: GTM-Bench paper Tables 6, 8, 9, and 12.

Rank Runner Harness Net Score A-grade % Net / $ Cost Prod. Utility Returned Records Trace Tendency
1 Blackpearl RTSA ready-to-advertise Blackpearl RTSA 26,615.6 40.9% 56.63 $470.03 - - Purpose-built offer, ICP, retrieval, and activation system
2 GPT-5.5 (Codex) openai_gpt-5.5 codex-cli / openai:gpt-5.5 4,040.9 37.7% 27.44 $147.27 0.407 1,753 Compact, script-driven, selective
3 Claude Sonnet 4.6 anthropic_claude-sonnet-4.6 claude-code-cli / sonnet-4.6 400.1 27.3% 2.51 $159.61 0.367 5,678 High-recall SQL/tgrep explorer
4 Claude Opus 4.7 anthropic_claude-opus-4.7 claude-code-cli / opus-4.7 -2,476.6 31.7% -9.17 $270.20 0.352 4,904 Deliberative, evidence-heavy, costly
5 DeepSeek V4 Pro openrouter_deepseek-v4-pro hermes-agent-cli / deepseek-v4-pro -3,398.0 21.8% -99.26 $34.23 0.346 2,531 Low-cost, web-assisted, moderate recall
6 Gemini 3.5 Flash google_gemini-3.5-flash hermes-agent-cli / gemini-3.5-flash -10,671.9 13.6% -148.35 $71.94 0.303 2,015 Fast but schema/identity-inconsistent
7 Kimi K2.6 openrouter_kimi-k2.6 hermes-agent-cli / kimi-k2.6 -15,402.3 22.8% -171.56 $89.78 0.316 5,496 Slow, expansive, overproduces

Benchmark analysis

Production catalog composition

The paper reports a 72-task production catalog designed from a source query taxonomy, then stratified across task suites, market categories, and pressure tags.

Task suites

Offer-grounded lead lists 8.3%
Named-domain offer extraction 11.1%
Offer-to-lead list 12.5%
Vertical / geo / firmographic search 11.1%
Persona / contact activation 12.5%
Intent / trigger evidence 9.7%
Technographic search 6.9%
Offer-grounded lookalikes 4.2%
Market-to-lead list 4.2%
Buyer search with contact details 13.9%
Limited-context prompts 5.6%

Pressure tags

Geo/firmographic constraints 44.4%
Compliance or policy sensitivity 38.9%
Persona/contact fit 37.5%
Intent or recency evidence 26.4%
End-to-end activation 18.1%
Technographic evidence 12.5%
ICP inference 11.1%
Comparator/lookalike reasoning 5.6%

Source query taxonomy

Domain based lead search 43.8%
Generic lead search 17.6%
Persona search 6.8%
Offer to ICP discovery 6.2%
Geographic/local search 5.1%
Firmographic filtering 4.5%
Industry category search 4.4%
Technographic search 4.1%
Intent/trigger search 3.6%
Lookalike search 3.5%
Market research 1.7%
Non-GTM / Ambiguous 1.4%