How does The Ad Bench score an ad?

The Ad Bench uses a 6-category rubric — Hook, Native feel, Clarity, CTA, Brand fit, and Pacing — each scored 0–100 by Claude (Anthropic). Scores are weighted averages with Hook carrying the most weight, because a weak hook means the rest of the ad is never seen. Quick Check scores five categories; Deep Dive scores all six.

What AI does The Ad Bench use?

The Ad Bench uses Claude (Anthropic) as the scoring engine — a multimodal language model that receives video frames and voiceover transcript together. OpenAI Whisper handles voiceover transcription, and a lightweight Claude Haiku pass pre-classifies the creative's medium before the main scoring pass.

What is a good ad score?

75–100 (green) is strong — ship it. 50–74 (amber) is fixable — address the top issues and re-score. 0–49 (red) means an overhaul is needed. Every score includes a one-sentence justification; Deep Dive scores point at timestamped evidence in the frames and transcript.

Does The Ad Bench work for video ads?

Yes. Upload an MP4, MOV, or WebM file, or paste a TikTok or YouTube Shorts URL. The pipeline extracts up to 10 keyframes and transcribes the voiceover via OpenAI Whisper, then runs the full rubric. Original video bytes never leave your browser — only JPEG keyframes upload.

How does The Ad Bench calibrate scores for different platforms?

Each platform gets specific calibrations inside the six rubric categories. TikTok (~85% muted) penalizes audio-dependent hooks and applies the strictest native-feel standard. Instagram Reels (~60% muted) rewards saves-prompting CTAs. YouTube Shorts (~75% sound-on) weights voiceover clarity and loop-seam quality. Non-short-form inputs (images, email, display, OOH) route to a separate rubric variant calibrated to that medium.

// METHODOLOGY

How The Ad Bench scores an ad

The Ad Bench is not a checklist tool. It runs a multi-step analysis pipeline — input ingestion, frame extraction, transcript, and a structured rubric pass — then maps the result to six independent category scores. Here is what happens between “submit” and “score.”

// INPUT_PIPELINE

What gets analyzed

The analyzer accepts four input types: a public video URL (TikTok or YouTube Shorts), an uploaded video file (MP4/MOV/WebM and others), a still image (JPEG/PNG/WebP/HEIC), or a plain-text ad script. Each takes a different path through the pipeline but ends at the same rubric pass.

For video inputs — whether a URL or an upload — the pipeline extracts keyframes at regular intervals (up to 10 frames per video), runs a Whisper voiceover transcript if audio is present, and assembles the frames plus transcript into the model context. TikTok and YouTube Shorts URLs are fetched server-side so the actual video content is analyzed, not just a thumbnail or caption. For Instagram Reels and Facebook URLs, Meta's CDN blocks server-side fetches; those inputs are analyzed from the cover frame and caption via the public oEmbed API.

Image inputs skip the frame pipeline and go directly to the rubric pass with the image as a single frame. Script inputs carry no visual signal; the rubric adjusts its scoring weights accordingly — hook and CTA are scored against the written copy; native-feel and pacing receive reduced weight with an explicit caveat in the output.

// AI_STACK

The AI model stack

The scoring engine is Claude (Anthropic), a large multimodal language model with both text and vision input capability. Frames are passed as vision inputs alongside the voiceover transcript. The model is given a structured tool schema (not a free-text prompt) that requires it to return JSON with one-to-100 scores, one-sentence justifications per score, and timestamped evidence citations. Structured output prevents hallucinated scores that don't correspond to the video.

Agency plan accounts use a more capable model variant with stronger reasoning for brand-kit compliance and nuanced rubric calibration. Free and Pro accounts use the standard variant, which produces equivalent scores across the six core categories.

Voiceover transcription runs through OpenAI Whisper before the Claude pass. Whisper gives the rubric pass accurate word-level timing — critical for hook scoring, where the first spoken line landing at 0 vs 1.5 seconds is a meaningful difference.

A lightweight classification pass runs first (Claude Haiku) to identify the medium type (short-form video, static display, print, OOH, email, and so on). This pre-classify step routes the input to the correct rubric variant before the main scoring pass — an email creative is not scored against TikTok hook-rate norms.

// RUBRIC

The 6-category rubric

Every Deep Dive scores six categories independently. Quick Check scores five (pacing is excluded from the abbreviated pass). Scores are 0–100 per category; the overall score is a weighted average with hook weighted heaviest — because a weak hook means the rest of the ad is never seen.

Category	What it measures	Mode
Hook	Whether the first 3 seconds earn the next 3 — pattern interrupt, curiosity gap, or high-density payoff. Heaviest weight.	Quick + Deep
Native feel	How much the creative reads as organic feed content vs. a produced ad. Selfie-cam, natural audio, UGC aesthetic score higher.	Quick + Deep
Clarity	Whether the viewer knows what the ad is, who it's for, and what it's selling within the first 6 seconds.	Quick + Deep
CTA	Specificity, timing, and legibility of the call to action. 'Learn more' scores lower than a named, visible, timed ask.	Quick + Deep
Brand fit	Whether the creative tone, visual style, and claims align with the brand's known position (or the Brand Kit for Agency accounts).	Quick + Deep
Pacing	Cut rate, visual density, and whether the edit sustains attention from hook to CTA without dead air or over-cutting.	Deep only

Each category returns a score, a one-sentence justification, and timestamped evidence (e.g., “hook lands at 0.8s, pattern interrupt via color contrast on frame 1”). The rubric documentation is at /learn/scoring-rubric.

// PLATFORM_CALIBRATION

Platform-specific calibration

The six categories are universal — the calibration inside each category shifts per platform. TikTok, Instagram Reels, and YouTube Shorts have meaningfully different audience behavior, and the rubric accounts for that:

—TikTok. ~85% of impressions are muted or low-volume. Sound-off legibility is required: burned-in captions from frame 1, high-contrast text. Hook scoring penalizes audio-dependent openers. Native-feel calibration is strictest — the For You feed is 90%+ organic UGC, and ad-coded production cues fire the skip reflex faster here than anywhere else.
—Instagram Reels. ~60% muted. Slightly higher tolerance for polished production than TikTok, but still rewards UGC aesthetic over studio grade. Reels discovery is heavily influenced by save rate, so CTAs that prompt saving score higher than generic 'link in bio' closes.
—YouTube Shorts. ~75% sound-on — the inverse of TikTok. Voiceover carries the message; captions are the redundancy layer. The loop mechanic is unique to Shorts: a clean closer-to-opener seam multiplies watch time at no production cost. Pacing scoring weights the loop seam quality.

Non-short-form inputs (static image, email, display, OOH, print) route to a separate rubric variant calibrated to the medium's own performance norms. A billboard is not scored against TikTok hook-rate benchmarks.

// SCORE_TO_OUTCOME

How scores map to real outcomes

The rubric scores are calibrated against industry benchmarks and our calibration targets. The correlations that drive thresholds:

—Hook score below 60 → hook rate (viewers past 3s) typically below 20% on TikTok. Above 75 → hook rate typically above 35%.
—Native-feel score below 65 → 30–40% underperformance on hook rate vs. equivalent native-feel ads in the same vertical.
—CTA score below 70 → CTR-to-conversion gap widens. Most common cause: CTA delivered after second 28 when 80%+ of viewers have already exited.
—Overall score above 75 across all categories → ad set exits learning phase faster and at lower CPA than sub-70 creative in the same account.

These are probabilistic correlations, not guarantees. An ad scoring 80 can underperform; an ad scoring 55 can catch a viral moment. The rubric predicts the likely distribution, not the individual outcome. Use it to filter out known losers before spend, not to predict exact ROAS.

// BRAND_KIT

Brand Kit (Agency)

Agency accounts can configure a Brand Kit — a per-team voice document, banned-words list, compliance posture, and six rubric weight presets. When a Brand Kit is active, the scoring pass receives the kit as part of the model context, and brand-fit scoring is evaluated against the team's specific position rather than a generic inferred brand. Reports carry a visible “// AGENCY · BRAND KIT” disclosure so reviewers know the scoring was kit-informed.

// SEE_ALSO