Question
Does the classifier think sarcasm was removed?
Why we use it
The most direct measure of task success — was the output reclassified as non-sarcastic? We run all three classifiers because they trained on different domains (Twitter, Reddit, News) and disagree by up to 33 percentage points on the same outputs.
Score interpretation
- HigherMore outputs flagged non-sarcastic
Limitation
All three classifiers fail against human ground truth (Cohen's κ from −0.11 to +0.18 vs human κ > 0.8). They detect sarcasm presence in isolation, not removal between input and output. Treat as one signal, not as ground truth.
Question
Is the core meaning preserved between input and output?
Why we use it
Sarcasm style transfer must keep the underlying claim. Cosine similarity on sentence embeddings is the standard cheap proxy for semantic preservation across rewrites.
Score interpretation
- 0.95+Nearly identical (possibly just paraphrased)
- 0.85–0.95Good meaning preservation
- 0.70–0.85Moderate drift
- < 0.70Significant meaning loss
Limitation
Doesn't detect copying — a model that just lowercases the input scores 0.99. Always pair with BLEU vs input.
Question
Is the output fluent, natural English?
Why we use it
Catches degenerate outputs (truncations, gibberish, broken syntax) that other metrics might miss. We use GPT-2 because it's a fixed reference LM that pre-dates our training data.
Score interpretation
- < 300Very fluent
- 300–600Normal
- 600–1000Somewhat disfluent
- > 1000Problematic
Limitation
Mean perplexity is dominated by long-tail outliers (a single broken sample drags the average up). The dashboard reports the mean — flag outliers rather than treat the absolute value as authoritative.
Question
How much was the text modified?
Why we use it
Complementary to BLEU — measures structural change, not just n-gram overlap. Useful for separating models that delete tokens (high edit distance, low BLEU) from models that paraphrase (moderate edit distance, moderate BLEU).
Score interpretation
- 0.0–0.3Minor edits (punctuation, casing)
- 0.4–0.6Moderate rewriting
- 0.7–0.9Significant rewriting
- 0.9+Complete rewrite
Limitation
Doesn't tell you whether the rewriting was good — just how much there was. Use alongside similarity.
06
LLM-as-Judge
Gemini 2.5 Flash
Question
What does a strong LLM think of the output across three dimensions?
Why we use it
Captures what surface metrics miss. We score each output 1–5 on (a) sarcasm_removed — is the output non-sarcastic? (b) meaning_preserved — is the core claim intact? (c) fluency — is it natural English? Run on a 50-sample batch per model to keep cost bounded.
Score interpretation
- 5.0Strong agreement with human intent
- 3.0–4.0Mixed signal
- < 3.0Failed dimension
Limitation
Expensive to run at full scale and known to be biased (LLMs prefer LLM-style outputs). Used as a sample evaluation, not a primary metric.
Question
Is the model rewriting genuinely while still preserving meaning?
Why we use it
Existing metrics fail individually — high similarity alone doesn't catch copying, low BLEU alone doesn't distinguish rewriting from destruction. Paraphrase score multiplies them so a model has to score well on BOTH to win.
Score interpretation
- > 0.20Good — high similarity + low copying
- 0.10–0.20Moderate
- < 0.05Either copying or meaning lost
Concrete example. Input: "Man Shocked By Obvious Fact". Output A: "man shocked by obvious fact" → similarity 0.99, BLEU 0.95 → paraphrase 0.05 (just copied). Output B: "A person was surprised to learn something widely known" → similarity 0.85, BLEU 0.08 → paraphrase 0.78 (genuine rewrite).