LLMaoSarcasm Transfer
Methodology

Evaluation

Why we measure each metric, what its score means, and where it breaks down. Every model in the dashboard reports these seven numbers — this page is the rosetta stone that explains them.

How to read these together

  • No metric is sufficient on its own. A model that scores best on flip rate often scores worst on meaning. Read the row, not the column.
  • Pair similarity with BLEU vs input. High similarity + high BLEU = copying. High similarity + low BLEU = genuine rewriting. The paraphrase score formalises this.
  • Treat flip rate as one signal, not ground truth. The classifiers disagree with each other by 33 pp and with humans by κ ≈ 0.10. Cross-check with the human eval.
The 7-Metric Pipeline
01

Sarcasm Flip Rate

RoBERTa-Twitter · DistilBERT-Reddit · RoBERTa-News

Question
Does the classifier think sarcasm was removed?
Why we use it
The most direct measure of task success — was the output reclassified as non-sarcastic? We run all three classifiers because they trained on different domains (Twitter, Reddit, News) and disagree by up to 33 percentage points on the same outputs.
Score interpretation
  • HigherMore outputs flagged non-sarcastic
Limitation
All three classifiers fail against human ground truth (Cohen's κ from −0.11 to +0.18 vs human κ > 0.8). They detect sarcasm presence in isolation, not removal between input and output. Treat as one signal, not as ground truth.
02

Semantic Similarity

sentence-transformers/all-MiniLM-L6-v2

Question
Is the core meaning preserved between input and output?
Why we use it
Sarcasm style transfer must keep the underlying claim. Cosine similarity on sentence embeddings is the standard cheap proxy for semantic preservation across rewrites.
Score interpretation
  • 0.95+Nearly identical (possibly just paraphrased)
  • 0.85–0.95Good meaning preservation
  • 0.70–0.85Moderate drift
  • < 0.70Significant meaning loss
Limitation
Doesn't detect copying — a model that just lowercases the input scores 0.99. Always pair with BLEU vs input.
03

Perplexity

GPT-2

Question
Is the output fluent, natural English?
Why we use it
Catches degenerate outputs (truncations, gibberish, broken syntax) that other metrics might miss. We use GPT-2 because it's a fixed reference LM that pre-dates our training data.
Score interpretation
  • < 300Very fluent
  • 300–600Normal
  • 600–1000Somewhat disfluent
  • > 1000Problematic
Limitation
Mean perplexity is dominated by long-tail outliers (a single broken sample drags the average up). The dashboard reports the mean — flag outliers rather than treat the absolute value as authoritative.
04

BLEU vs Input

sacrebleu

Question
How much n-gram overlap is there between output and input?
Why we use it
Detects whether the model is genuinely rewriting or just copying the input back. We compare against the input (not a reference) because we want to penalize models that take shortcuts.
Score interpretation
  • High BLEU + High simParaphrasing — minimal real change
  • Low BLEU + High simGenuine rewriting
  • Low BLEU + Low simMeaning lost
Limitation
Only meaningful in combination with similarity. Low BLEU alone could mean either successful rewriting or content destruction.
05

Edit Distance

Word-level Levenshtein, normalized to [0, 1]

Question
How much was the text modified?
Why we use it
Complementary to BLEU — measures structural change, not just n-gram overlap. Useful for separating models that delete tokens (high edit distance, low BLEU) from models that paraphrase (moderate edit distance, moderate BLEU).
Score interpretation
  • 0.0–0.3Minor edits (punctuation, casing)
  • 0.4–0.6Moderate rewriting
  • 0.7–0.9Significant rewriting
  • 0.9+Complete rewrite
Limitation
Doesn't tell you whether the rewriting was good — just how much there was. Use alongside similarity.
06

LLM-as-Judge

Gemini 2.5 Flash

Question
What does a strong LLM think of the output across three dimensions?
Why we use it
Captures what surface metrics miss. We score each output 1–5 on (a) sarcasm_removed — is the output non-sarcastic? (b) meaning_preserved — is the core claim intact? (c) fluency — is it natural English? Run on a 50-sample batch per model to keep cost bounded.
Score interpretation
  • 5.0Strong agreement with human intent
  • 3.0–4.0Mixed signal
  • < 3.0Failed dimension
Limitation
Expensive to run at full scale and known to be biased (LLMs prefer LLM-style outputs). Used as a sample evaluation, not a primary metric.
07

Paraphrase Score

similarity × (1 − BLEU vs input)

Question
Is the model rewriting genuinely while still preserving meaning?
Why we use it
Existing metrics fail individually — high similarity alone doesn't catch copying, low BLEU alone doesn't distinguish rewriting from destruction. Paraphrase score multiplies them so a model has to score well on BOTH to win.
Score interpretation
  • > 0.20Good — high similarity + low copying
  • 0.10–0.20Moderate
  • < 0.05Either copying or meaning lost
Concrete example. Input: "Man Shocked By Obvious Fact". Output A: "man shocked by obvious fact" → similarity 0.99, BLEU 0.95 → paraphrase 0.05 (just copied). Output B: "A person was surprised to learn something widely known" → similarity 0.85, BLEU 0.08 → paraphrase 0.78 (genuine rewrite).
Section 02

Human Evaluation

Why we did it: every automated metric has known failure modes, and our multi-classifier audit showed the flip rate disagrees with itself by up to 33 percentage points. We needed a ground truth. Two annotators independently labeled 140 stratified samples per model on two binary questions: sarcasm removed? and meaning changed?

140
Samples per model
Stratified by subtype
3
Models annotated
T5-Joint, T5-Control, BART-RL
2
Independent annotators
Per model
κ > 0.8
Inter-annotator
Excellent agreement

What we found

  • All three classifiers fail. Cohen's κ vs human ranges from −0.11 (DistilBERT-Reddit on T5-Control) to +0.18 (RoBERTa-News on T5-Joint). Four of the nine model×classifier cells show negative κ — the classifier anti-correlates with humans.
  • T5-Joint is the best model overall. Strict success rate (sarcasm removed AND meaning preserved) is 43.6% — beating T5-Control (39.3%) and BART-RL (34.3%). The strategy prefix forces task decomposition before generation.
  • BART-RL destroys meaning. Meaning-change rate is 40.7% — more than double T5-Joint's 16.4%. The reward function rewards deletion of sarcastic tokens, not faithful rewriting.
  • Subtypes fail differently. Satire has the highest classifier miss rate (80%) because it mimics legitimate news format. Rhetorical questions encode sarcasm in implication, not lexical markers. Overstatement is a modelfailure — the model can't remove what defines the headline.

The full evidence

The Human Eval page has the receipts: per-model summary cards, the 9-cell classifier-vs-human accuracy table, per-subtype miss rates, and the multi-classifier comparison across all 14 models.

Open Human Eval →