Blackpearl benchmark release
GTM Bench by Blackpearl
A benchmark for measuring whether AI systems can turn an offer into useful go-to-market work: clear ICPs, grounded account matches, reachable contact paths, and activation-ready outputs.
Benchmark data
Results from the production catalog
The current paper reports a 72-task GTM-Bench catalog designed from a taxonomy of real Bebop.ai prospecting queries. Here's a preview of the paper-derived leaderboard and catalog composition.
| Rank | Runner | Harness | Net Score | A-grade % | Net / $ | Cost | Prod. Utility | Returned Records | Trace Tendency |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Blackpearl RTSA ready-to-advertise | Blackpearl RTSA | 26,615.6 | 40.9% | 56.63 | $470.03 | - | - | Purpose-built offer, ICP, retrieval, and activation system |
| 2 | GPT-5.5 (Codex) openai_gpt-5.5 | codex-cli / openai:gpt-5.5 | 4,040.9 | 37.7% | 27.44 | $147.27 | 0.407 | 1,753 | Compact, script-driven, selective |
| 3 | Claude Sonnet 4.6 anthropic_claude-sonnet-4.6 | claude-code-cli / sonnet-4.6 | 400.1 | 27.3% | 2.51 | $159.61 | 0.367 | 5,678 | High-recall SQL/tgrep explorer |
| 4 | Claude Opus 4.7 anthropic_claude-opus-4.7 | claude-code-cli / opus-4.7 | -2,476.6 | 31.7% | -9.17 | $270.20 | 0.352 | 4,904 | Deliberative, evidence-heavy, costly |
| 5 | DeepSeek V4 Pro openrouter_deepseek-v4-pro | hermes-agent-cli / deepseek-v4-pro | -3,398.0 | 21.8% | -99.26 | $34.23 | 0.346 | 2,531 | Low-cost, web-assisted, moderate recall |
Showing the top 5 of 7 evaluated systems from the paper. The full table is on the leaderboard page.
Source query taxonomy
Market categories
Task registry
Representative production prompts
The production catalog includes concrete tasks with suite, archetype, market category, data access mode, expected schema, and scoring focus.
Website-to-ICP inference
Find the best-fit customers for https://floralawn-and-landscaping.com/. First infer what the company appears to sell, then define the likely ICP before listing target account types.
Offer extraction from sparse domain, ICP fit, and evidence groundedness.
Contact/persona extraction
Find 50 Owner, GM, Marketing Manager contacts at roofing, HVAC, plumbing, and remodeling companies in the United States that are likely buyers of local SEO and website redesign services.
Persona fit, contactability, evidence quality, compliance posture, and schema adherence.
Technographic evidence
Find companies using HubSpot and Salesforce that may need data integration services. Include technology evidence, source URL, confidence, and why the technology implies need.
Technology evidence, false-positive control, and company fit.
Buyer search with contact details
We sell commercial funding and bookkeeping services. Build an end-to-end GTM target list for small businesses needing capital in the United States.
Full workflow quality, evidence, contactability, compliance, schema, cost, and runtime.
Limited-context prompts
Find decision makers in Phoenix at companies with 51-200 employees showing intent for cybersecurity or cloud migration in the past 30 days.
Hard-mode multi-constraint retrieval, source evidence, schema, contactability, and compliance.
Evaluation method
Complete GTM systems, not isolated model prose
GTM-Bench evaluates model, harness, tools, data access, retrieval, validation, and final activation output as one auditable workflow.
Observed source corpus
Start from 59,881 real Bebop.ai opening queries submitted from 2025-02-07 through 2026-05-19.
Production catalog
Select 72 tasks across suites, source-query patterns, market categories, and benchmark pressure tags.
Three artifacts
Require OFFER.md, ICP.md, and RESULTS.csv so offer inference, ICP synthesis, and row ranking can be evaluated separately.
Signed row utility
Reward A-grade records, give no utility to B-grade rows, and penalize below-B rows that are unsupported or commercially weak.
Paper and release artifacts
GTM Bench: Evaluating AI Systems for Go-to-Market Workflows
The site includes the current paper PDF, a web version of the paper, the leaderboard, and the open-source release for the benchmark harness.