Blackpearl GTM Bench
Code

Blackpearl benchmark release

GTM Bench by Blackpearl

A benchmark for measuring whether AI systems can turn an offer into useful go-to-market work: clear ICPs, grounded account matches, reachable contact paths, and activation-ready outputs.

72 production tasks
11 task suites
15 market categories
4 data environments

Benchmark data

Results from the production catalog

The current paper reports a 72-task GTM-Bench catalog designed from a taxonomy of real Bebop.ai prospecting queries. Here's a preview of the paper-derived leaderboard and catalog composition.

View all
Rank Runner Harness Net Score A-grade % Net / $ Cost Prod. Utility Returned Records Trace Tendency
1 Blackpearl RTSA ready-to-advertise Blackpearl RTSA 26,615.6 40.9% 56.63 $470.03 - - Purpose-built offer, ICP, retrieval, and activation system
2 GPT-5.5 (Codex) openai_gpt-5.5 codex-cli / openai:gpt-5.5 4,040.9 37.7% 27.44 $147.27 0.407 1,753 Compact, script-driven, selective
3 Claude Sonnet 4.6 anthropic_claude-sonnet-4.6 claude-code-cli / sonnet-4.6 400.1 27.3% 2.51 $159.61 0.367 5,678 High-recall SQL/tgrep explorer
4 Claude Opus 4.7 anthropic_claude-opus-4.7 claude-code-cli / opus-4.7 -2,476.6 31.7% -9.17 $270.20 0.352 4,904 Deliberative, evidence-heavy, costly
5 DeepSeek V4 Pro openrouter_deepseek-v4-pro hermes-agent-cli / deepseek-v4-pro -3,398.0 21.8% -99.26 $34.23 0.346 2,531 Low-cost, web-assisted, moderate recall

Showing the top 5 of 7 evaluated systems from the paper. The full table is on the leaderboard page.

Task suites

Offer-grounded lead lists 8.3%
Named-domain offer extraction 11.1%
Offer-to-lead list 12.5%
Vertical / geo / firmographic search 11.1%
Persona / contact activation 12.5%
Intent / trigger evidence 9.7%
Technographic search 6.9%
Offer-grounded lookalikes 4.2%
Market-to-lead list 4.2%
Buyer search with contact details 13.9%
Limited-context prompts 5.6%

Source query taxonomy

Domain based lead search 43.8%
Generic lead search 17.6%
Persona search 6.8%
Offer to ICP discovery 6.2%
Geographic/local search 5.1%
Firmographic filtering 4.5%
Industry category search 4.4%
Technographic search 4.1%
Intent/trigger search 3.6%
Lookalike search 3.5%
Market research 1.7%
Non-GTM / Ambiguous 1.4%

Market categories

Ecommerce, Retail & CPG 12.5%
Marketing & Revenue Growth 8.3%
Financial Services & Insurance 8.3%
Healthcare & Wellness 8.3%
Cybersecurity & IT 6.9%
Local & Field Services 6.9%
Industrial, Manufacturing & Maintenance 6.9%
Events, Hospitality & Media 6.9%
Recruiting & Talent 5.6%
Nonprofit, Education & Community 5.6%
RevOps, Data & Admin Services 5.6%
Sustainability & Energy 5.6%
Logistics & Fleet 4.2%
Public Sector 4.2%
Real Estate & Property 4.2%

Task registry

Representative production prompts

The production catalog includes concrete tasks with suite, archetype, market category, data access mode, expected schema, and scoring focus.

Q014 web only offer definition

Website-to-ICP inference

Find the best-fit customers for https://floralawn-and-landscaping.com/. First infer what the company appears to sell, then define the likely ICP before listing target account types.

Construction & Home Services
Offer extraction from sparse domain, ICP fit, and evidence groundedness.
Q061 third party database connector ranked contact list

Contact/persona extraction

Find 50 Owner, GM, Marketing Manager contacts at roofing, HVAC, plumbing, and remodeling companies in the United States that are likely buyers of local SEO and website redesign services.

Construction & Home Services
Persona fit, contactability, evidence quality, compliance posture, and schema adherence.
Q086 third party database connector ranked company list

Technographic evidence

Find companies using HubSpot and Salesforce that may need data integration services. Include technology evidence, source URL, confidence, and why the technology implies need.

Software, SaaS & IT Services
Technology evidence, false-positive control, and company fit.
Q107 pearl engine access activation ready lead table

Buyer search with contact details

We sell commercial funding and bookkeeping services. Build an end-to-end GTM target list for small businesses needing capital in the United States.

Financial Services & Insurance
Full workflow quality, evidence, contactability, compliance, schema, cost, and runtime.
Q124 pearl engine access activation ready lead table

Limited-context prompts

Find decision makers in Phoenix at companies with 51-200 employees showing intent for cybersecurity or cloud migration in the past 30 days.

Cybersecurity & IT
Hard-mode multi-constraint retrieval, source evidence, schema, contactability, and compliance.

Evaluation method

Complete GTM systems, not isolated model prose

GTM-Bench evaluates model, harness, tools, data access, retrieval, validation, and final activation output as one auditable workflow.

01

Observed source corpus

Start from 59,881 real Bebop.ai opening queries submitted from 2025-02-07 through 2026-05-19.

02

Production catalog

Select 72 tasks across suites, source-query patterns, market categories, and benchmark pressure tags.

03

Three artifacts

Require OFFER.md, ICP.md, and RESULTS.csv so offer inference, ICP synthesis, and row ranking can be evaluated separately.

04

Signed row utility

Reward A-grade records, give no utility to B-grade rows, and penalize below-B rows that are unsupported or commercially weak.

Paper and release artifacts

GTM Bench: Evaluating AI Systems for Go-to-Market Workflows

The site includes the current paper PDF, a web version of the paper, the leaderboard, and the open-source release for the benchmark harness.