// methodology

How AI call tracking platforms were tested

The five AI call tracking scoring dimensions and the testing approach behind every review on ShadowVoices.

Methodology cover graphic for ShadowVoices AI Voice Intel 2026

How platforms were tested

Each AI call tracking platform was evaluated through three channels.

  1. A self-serve account where available, with a test campaign and a benchmark call corpus of 820 calls (mixed accents, mixed verticals).
  2. A sales-led trial where self-serve was not on offer (Invoca, Convirza).
  3. Operator interviews. 14 working operators across lead-gen agencies, pay-per-call publishers, and mid-market marketing teams. The cohort spans 50 numbers up to 8,000 numbers.

The five scoring dimensions

Each platform scored on five AI call tracking dimensions, equally weighted.

transcription latency20%
intent classification depth20%
signal sync into ad platforms20%
per-number economics20%
self-serve onboarding20%

Transcription latency

How fast the call audio turns into a usable transcript. We measured p50 and p95 latency on the 820-call benchmark corpus. Sub-300ms p50 cleared the bar. 800ms p50 marked an acceptable floor for real-time workflows. Anything above 1.5s p50 got marked down.

Intent classification depth

How well the platform's model labels calls. We measured F1 score on a held-out subset of the corpus, with ground-truth labels manually annotated by two reviewers. We also measured the granularity of the intent taxonomy. Generic models that ship 6 to 8 intent labels score lower than fine-tuned vertical models that ship 40 to 60.

Signal sync into ad platforms

How fast and how deeply the AI signal flows back to Google Ads, Meta, and TikTok as conversion events. We measured event-fire latency from call-end to ad-platform pickup. We also measured the granularity of events supported (basic conversion event vs custom-event taxonomies). Reference for the integration patterns: Google's official call assets and conversion documentation.

Per-number economics

The cost of provisioning and keeping tracking numbers at network scale. The dominant variable for most cost-sensitive operators. We measured published rates where available and quoted rates where not. Hidden floors, minimum spend rules, and tier-only discounts also captured.

Self-serve onboarding

Whether an operator can provision a tracking number and validate the workflow without talking to sales. Self-serve cleared the bar; sales-led got marked down. Trial-to-paid path also factored in. PAYG-style $0 trials counted as best-in-class. Annual-contract-only counted as worst.

What was not scored

Generic CRM integration count was not scored. Brand recognition was not scored. Number of "AI-powered features" listed on the marketing site was not scored. None of those correlate with operator-fit for the audience this site serves.

Refresh cadence

Annual full report with quarterly updates when major platform releases shift the rankings. Open-source transcription models keep moving fast, so the latency benchmarks especially get revisited each quarter. Wikipedia's speech analytics article covers the broader category history for context on where the technology came from.

Further reading: schema.org Review markup specification · Wikipedia entry on software review