// METHODOLOGY
How The Ad Bench scores an ad
The Ad Bench is not a checklist tool. It runs a multi-step analysis pipeline — input ingestion, frame extraction, transcript, and a structured rubric pass — then maps the result to six independent category scores. Here is what happens between “submit” and “score.”
// INPUT_PIPELINE
What gets analyzed
The analyzer accepts four input types: a public video URL (TikTok or YouTube Shorts), an uploaded video file (MP4/MOV/WebM and others), a still image (JPEG/PNG/WebP/HEIC), or a plain-text ad script. Each takes a different path through the pipeline but ends at the same rubric pass.
For video inputs — whether a URL or an upload — the pipeline extracts keyframes at regular intervals (up to 10 frames per video), runs a Whisper voiceover transcript if audio is present, and assembles the frames plus transcript into the model context. TikTok and YouTube Shorts URLs are fetched server-side so the actual video content is analyzed, not just a thumbnail or caption. For Instagram Reels and Facebook URLs, Meta's CDN blocks server-side fetches; those inputs are analyzed from the cover frame and caption via the public oEmbed API.
Image inputs skip the frame pipeline and go directly to the rubric pass with the image as a single frame. Script inputs carry no visual signal; the rubric adjusts its scoring weights accordingly — hook and CTA are scored against the written copy; native-feel and pacing receive reduced weight with an explicit caveat in the output.
// AI_STACK
The AI model stack
The scoring engine is Claude (Anthropic), a large multimodal language model with both text and vision input capability. Frames are passed as vision inputs alongside the voiceover transcript. The model is given a structured tool schema (not a free-text prompt) that requires it to return JSON with one-to-100 scores, one-sentence justifications per score, and timestamped evidence citations. Structured output prevents hallucinated scores that don't correspond to the video.
Agency plan accounts use a more capable model variant with stronger reasoning for brand-kit compliance and nuanced rubric calibration. Free and Pro accounts use the standard variant, which produces equivalent scores across the six core categories.
Voiceover transcription runs through OpenAI Whisper before the Claude pass. Whisper gives the rubric pass accurate word-level timing — critical for hook scoring, where the first spoken line landing at 0 vs 1.5 seconds is a meaningful difference.
A lightweight classification pass runs first (Claude Haiku) to identify the medium type (short-form video, static display, print, OOH, email, and so on). This pre-classify step routes the input to the correct rubric variant before the main scoring pass — an email creative is not scored against TikTok hook-rate norms.
// RUBRIC
The 6-category rubric
Every Deep Dive scores six categories independently. Quick Check scores five (pacing is excluded from the abbreviated pass). Scores are 0–100 per category; the overall score is a weighted average with hook weighted heaviest — because a weak hook means the rest of the ad is never seen.
| Category | What it measures | Mode |
|---|---|---|
| Hook | Whether the first 3 seconds earn the next 3 — pattern interrupt, curiosity gap, or high-density payoff. Heaviest weight. | Quick + Deep |
| Native feel | How much the creative reads as organic feed content vs. a produced ad. Selfie-cam, natural audio, UGC aesthetic score higher. | Quick + Deep |
| Clarity | Whether the viewer knows what the ad is, who it's for, and what it's selling within the first 6 seconds. | Quick + Deep |
| CTA | Specificity, timing, and legibility of the call to action. 'Learn more' scores lower than a named, visible, timed ask. | Quick + Deep |
| Brand fit | Whether the creative tone, visual style, and claims align with the brand's known position (or the Brand Kit for Agency accounts). | Quick + Deep |
| Pacing | Cut rate, visual density, and whether the edit sustains attention from hook to CTA without dead air or over-cutting. | Deep only |
Each category returns a score, a one-sentence justification, and timestamped evidence (e.g., “hook lands at 0.8s, pattern interrupt via color contrast on frame 1”). The rubric documentation is at /learn/scoring-rubric.
// PLATFORM_CALIBRATION
Platform-specific calibration
The six categories are universal — the calibration inside each category shifts per platform. TikTok, Instagram Reels, and YouTube Shorts have meaningfully different audience behavior, and the rubric accounts for that:
- —TikTok. ~85% of impressions are muted or low-volume. Sound-off legibility is required: burned-in captions from frame 1, high-contrast text. Hook scoring penalizes audio-dependent openers. Native-feel calibration is strictest — the For You feed is 90%+ organic UGC, and ad-coded production cues fire the skip reflex faster here than anywhere else.
- —Instagram Reels. ~60% muted. Slightly higher tolerance for polished production than TikTok, but still rewards UGC aesthetic over studio grade. Reels discovery is heavily influenced by save rate, so CTAs that prompt saving score higher than generic 'link in bio' closes.
- —YouTube Shorts. ~75% sound-on — the inverse of TikTok. Voiceover carries the message; captions are the redundancy layer. The loop mechanic is unique to Shorts: a clean closer-to-opener seam multiplies watch time at no production cost. Pacing scoring weights the loop seam quality.
Non-short-form inputs (static image, email, display, OOH, print) route to a separate rubric variant calibrated to the medium's own performance norms. A billboard is not scored against TikTok hook-rate benchmarks.
// SCORE_TO_OUTCOME
How scores map to real outcomes
The rubric scores are calibrated against observed ad performance data in The Ad Bench database. The correlations that drive thresholds:
- —Hook score below 60 → hook rate (viewers past 3s) typically below 20% on TikTok. Above 75 → hook rate typically above 35%.
- —Native-feel score below 65 → 30–40% underperformance on hook rate vs. equivalent native-feel ads in the same vertical.
- —CTA score below 70 → CTR-to-conversion gap widens. Most common cause: CTA delivered after second 28 when 80%+ of viewers have already exited.
- —Overall score above 75 across all categories → ad set exits learning phase faster and at lower CPA than sub-70 creative in the same account.
These are probabilistic correlations, not guarantees. An ad scoring 80 can underperform; an ad scoring 55 can catch a viral moment. The rubric predicts the likely distribution, not the individual outcome. Use it to filter out known losers before spend, not to predict exact ROAS.
// BRAND_KIT
Brand Kit (Agency)
Agency accounts can configure a Brand Kit — a per-team voice document, banned-words list, compliance posture, and six rubric weight presets. When a Brand Kit is active, the scoring pass receives the kit as part of the model context, and brand-fit scoring is evaluated against the team's specific position rather than a generic inferred brand. Reports carry a visible “// AGENCY · BRAND KIT” disclosure so reviewers know the scoring was kit-informed.