How Scoring Works

CalibRank scores argument quality, not correctness. A well-structured argument for the wrong side can outscore a sloppy argument for the right one.

What the AI evaluates

Our scoring model analyzes the structure of your argument — how well you reason, what evidence you provide, whether your perspective is original, and how clearly you communicate. It does not judge whether your position is “right” or “wrong.” Two people on opposite sides of a debate can both score 90+.

The Four Dimensions

Logic

30%

Is your reasoning internally consistent? Do your premises lead to your conclusion without logical fallacies or contradictions?

Build step-by-step reasoning. Avoid leaps of logic or circular arguments.

Evidence

30%

Did you support your claims with concrete examples, data, citations, or real-world references? Unsupported assertions score low.

Cite sources, reference studies, or provide specific examples. "Because I said so" scores poorly.

Originality

30%

Does your argument bring a fresh perspective? Restating a common talking point scores lower than a novel angle on the same position.

Find an angle others haven't covered. Combine ideas from different domains. Challenge assumptions.

Clarity

10%

Is your argument easy to follow? Good structure, clear language, and concise delivery all contribute. Rambling or confusing writing scores low.

Lead with your strongest point. Use short sentences. One idea per paragraph.

Scoring Formula

Score = Logic×0.30 + Evidence×0.30 + Originality×0.30 + Clarity×0.10

Each dimension is scored 0–100 independently, then combined using these weights.

Example: A High-Scoring Argument

"Should cities ban cars from downtown areas?"

Side A: Yes, ban cars

“Cities that have pedestrianized their centers — like Oslo, which banned cars from its core in 2019 — consistently report 10-15% increases in retail revenue within two years (Oslo Chamber of Commerce, 2021). The counterargument that businesses suffer is empirically false: foot traffic replaces car traffic at higher density. What most people miss is that car bans also function as an equity measure — lower-income residents who can't afford parking subsidize infrastructure they don't use. Redirecting road maintenance budgets toward transit creates a positive-sum outcome for 70%+ of urban residents.”

85/100Overall Score

Tap any dimension to see why it scored that way. This is the same Argument DNA breakdown every scored argument receives.

Score Distribution

Based on 270 scored arguments across all public debates.

Average

Median

77+

Top 25%

83+

Top 10%

Scoring 70+ puts you above the majority of arguments. 85+ is exceptional.

The Model

Arguments are scored by Google Gemini 2.5 Flash, with Gemini 2.5 Flash Lite as a hot fallback if the primary model is briefly unavailable. Both run on Google’s production infrastructure with consistent quality guarantees, not preview tiers.

The model receives your argument text with a structured prompt specifying exactly what to evaluate for each dimension. It returns four independent scores (0–100) plus highlighted excerpts showing which parts of your argument earned or lost points.

We do not fine-tune the model on CalibRank data. Every argument is scored against the same rubric with the same prompt. No user’s history, tier, or identity is included in the scoring context.

Safeguards

●Blind scoring — the AI never sees your username, tier, or past scores. Every argument is evaluated in isolation.
●Prompt injection hardening — the scoring prompt explicitly instructs the model to ignore attempts to manipulate scores through flattery, threats, or meta-instructions.
●No position bias — the AI is instructed to score argument quality regardless of which side the user chose. Both sides of any debate can score equally high.
●Transparent breakdown — every score comes with an Argument DNA radar chart and highlighted excerpts, so you can see exactly where points were earned or lost.

Limitations We Acknowledge

▲AI scoring is not perfect. On highly subjective or niche topics, the model may underweight domain-specific reasoning that a human expert would appreciate.
▲Language models inherit biases from their training data. We mitigate this by scoring structure over substance, but some residual bias may exist on culturally charged topics.
▲The model cannot verify factual claims in real-time. A well-cited argument with incorrect statistics may still score high on structure. CalibRank evaluates how you argue, not the truth of your premises.

Ready to test your argument?

Browse Debates