Methodology

Evaluation

Why we measure each metric, what its score means, and where it breaks down. Every model in the dashboard reports these seven numbers — this page is the rosetta stone that explains them.

How to read these together

No metric is sufficient on its own. A model that scores best on flip rate often scores worst on meaning. Read the row, not the column.
Pair similarity with BLEU vs input. High similarity + high BLEU = copying. High similarity + low BLEU = genuine rewriting. The paraphrase score formalises this.
Treat flip rate as one signal, not ground truth. The classifiers disagree with each other by 33 pp and with humans by κ ≈ 0.10. Cross-check with the human eval.

The 7-Metric Pipeline

Sarcasm Flip Rate

RoBERTa-Twitter · Bert-Kaggle · RoBERTa-News

Question

Does the classifier think sarcasm was removed?

Why we use it

The most direct measure of task success — was the output reclassified as non-sarcastic? We run all three classifiers because they trained on different domains (Twitter irony tweets, a Kaggle headlines dataset, news headlines) and disagree by up to 33 percentage points on the same outputs.

Score interpretation

HigherMore outputs flagged non-sarcastic

Limitation

All three classifiers fail against human ground truth (Cohen's κ from −0.11 to +0.18 vs human κ > 0.8). They detect sarcasm presence in isolation, not removal between input and output. Treat as one signal, not as ground truth.

Semantic Similarity

sentence-transformers/all-MiniLM-L6-v2

Question

Is the core meaning preserved between input and output?

Why we use it

Sarcasm style transfer must keep the underlying claim. Cosine similarity on sentence embeddings is the standard cheap proxy for semantic preservation across rewrites.

Score interpretation

0.95+Nearly identical (possibly just paraphrased)
0.85–0.95Good meaning preservation
0.70–0.85Moderate drift
< 0.70Significant meaning loss

Limitation

Doesn't detect copying — a model that just lowercases the input scores 0.99. Always pair with BLEU vs input.

Perplexity

GPT-2

Question

Is the output fluent, natural English?

Why we use it

Catches degenerate outputs (truncations, gibberish, broken syntax) that other metrics might miss. We use GPT-2 because it's a fixed reference LM that pre-dates our training data.

Score interpretation

< 300Very fluent
300–600Normal
600–1000Somewhat disfluent
> 1000Problematic

Limitation

Mean perplexity is dominated by long-tail outliers (a single broken sample drags the average up). The dashboard reports the mean — flag outliers rather than treat the absolute value as authoritative.

BLEU vs Input

sacrebleu

Question

How much n-gram overlap is there between output and input?

Why we use it

Detects whether the model is genuinely rewriting or just copying the input back. We compare against the input (not a reference) because we want to penalize models that take shortcuts.

Score interpretation

High BLEU + High simParaphrasing — minimal real change
Low BLEU + High simGenuine rewriting
Low BLEU + Low simMeaning lost

Limitation

Only meaningful in combination with similarity. Low BLEU alone could mean either successful rewriting or content destruction.

Edit Distance

Word-level Levenshtein, normalized to [0, 1]

Question

How much was the text modified?

Why we use it

Complementary to BLEU — measures structural change, not just n-gram overlap. Useful for separating models that delete tokens (high edit distance, low BLEU) from models that paraphrase (moderate edit distance, moderate BLEU).

Score interpretation

0.0–0.3Minor edits (punctuation, casing)
0.4–0.6Moderate rewriting
0.7–0.9Significant rewriting
0.9+Complete rewrite

Limitation

Doesn't tell you whether the rewriting was good — just how much there was. Use alongside similarity.

LLM-as-Judge

Gemini 2.5 Flash

Question

What does a strong LLM think of the output across three dimensions?

Why we use it

Captures what surface metrics miss. We score each output 1–5 on (a) sarcasm_removed — is the output non-sarcastic? (b) meaning_preserved — is the core claim intact? (c) fluency — is it natural English? Run on a 50-sample batch per model to keep cost bounded.

Score interpretation

5.0Strong agreement with human intent
3.0–4.0Mixed signal
< 3.0Failed dimension

Limitation

Expensive to run at full scale and known to be biased (LLMs prefer LLM-style outputs). Used as a sample evaluation, not a primary metric.

Paraphrase Score

similarity × (1 − BLEU vs input)

Question

Is the model rewriting genuinely while still preserving meaning?

Why we use it

Existing metrics fail individually — high similarity alone doesn't catch copying, low BLEU alone doesn't distinguish rewriting from destruction. Paraphrase score multiplies them so a model has to score well on BOTH to win.

Score interpretation

> 0.70Strong rewriting — preserves meaning and diverges from input
0.60–0.70Moderate — where most fine-tuned models land
0.50–0.60Around the human-baseline level
< 0.50Below human baseline — copying or meaning loss

Concrete example. Input: "Man Shocked By Obvious Fact". Output A: "man shocked by obvious fact" → similarity 0.99, BLEU 0.95 → paraphrase 0.05 (just copied). Output B: "A person was surprised to learn something widely known" → similarity 0.85, BLEU 0.08 → paraphrase 0.78 (genuine rewrite). The gold human rewrites from iSarcasmEval land around 0.51 on this metric.

Section 02

Human Evaluation

Why we did it: every automated metric has known failure modes, and our multi-classifier audit showed the flip rate disagrees with itself by up to 33 percentage points. We needed a ground truth. Two annotators independently labeled 140 stratified samples per model on two binary questions: sarcasm removed? and meaning changed?

140

Samples per model

Stratified by subtype

Models annotated

T5-Joint, T5-Control, BART-RL

Independent annotators

Per model

κ > 0.8

Inter-annotator

Excellent agreement

What we found

All three classifiers fail. Cohen's κ vs human ranges from −0.11 (Bert-Kaggle on T5-Control) to +0.18 (RoBERTa-News on T5-Joint). Four of the nine model×classifier cells show negative κ — the classifier anti-correlates with humans.
T5-Joint is the best model overall. Strict success rate (sarcasm removed AND meaning preserved) is 43.6% — beating T5-Control (39.3%) and BART-RL (34.3%). The strategy prefix forces task decomposition before generation.
BART-RL destroys meaning. Meaning-change rate is 40.7% — more than double T5-Joint's 16.4%. The reward function rewards deletion of sarcastic tokens, not faithful rewriting.
Subtypes fail differently. Satire has the highest classifier miss rate (80%) because it mimics legitimate news format. Rhetorical questions encode sarcasm in implication, not lexical markers. Overstatement is a modelfailure — the model can't remove what defines the headline.

The full evidence

The Human Eval page has the receipts: per-model summary cards, the 9-cell classifier-vs-human accuracy table, per-subtype miss rates, and the multi-classifier comparison across all 14 models.

Open Human Eval →