LLMaoSarcasm Transfer
Methodology / Data

Data Pipeline

From 28,619 raw news headlines to 89,688 strategy-annotated training pairs — six stages of LLM generation, cross-validation, and stratified splitting designed to prevent leakage while capturing all six sarcasm strategies.

28,619
Raw headlines
NHDSD v2
28,536
Generated pairs
LLM opposites
89,688
Augmented records
6 strategies / source
4,076
Suspected mislabels
browse examples →
80.19%
NHDSD agreement
StepFun vs original
Primary Pipeline — Non-sarcastic → Sarcastic
01

Raw Collection

reclassify_nhdsd_binary.py

Clean duplicates, normalize whitespace

in 28,619 · NHDSD (Misra 2019)
out 28,497 · nhdsd_cleaned.json

Source: TheOnion (13,634 sarcastic) + HuffPost (14,985 neutral) headlines with article links.

02

Binary Reclassification

StepFun 3.5 Flash · temp=0.1

LLM re-labels every headline as sarcastic / non-sarcastic

in 28,497 · nhdsd_cleaned.json
out 28,497 · nhdsd_reclassified.jsonl

Agreement with original NHDSD labels: 80.19%. 5,644 headlines were flagged as disagreements for verification.

03

Cross-Validation

Nemotron 3 Nano 30B · temp=0.1

A second independent LLM re-classifies only the disagreements

in 5,644 · label_disagreements.jsonl
out 5,644 · cross_validation_comparison.json

Of the 5,644 disagreements, 4,076 (72.2%) had StepFun + Nemotron agreeing against the original annotation. This step produces an audit report, not a corrected labels file — the main training pipeline still reads from the raw NHDSD dataset. The CV results are consumed only by the secondary sar→non filtering pipeline.

04

Pair Generation

StepFun 3.5 Flash · temp=0.7 · batch=30

For each headline, generate its opposite-style counterpart and tag the strategy

in 28,619 · Sarcasm_Headlines_Dataset_v2.json
out 28,536 · sarcasm_pairs_step35_clean.jsonl

Non-sarcastic headlines (14,948) become non→sar pairs. Sarcastic headlines (13,634) become sar→non pairs. 6 headlines hit content filters and were dropped.

05

Strategy Augmentation

StepFun 3.5 Flash · temp=0.8

For each source, generate 5 more variants covering the missing strategies (6 total per source)

in 14,948 · sarcasm_pairs_non_to_sarcastic.jsonl
out 89,688 · sarcasm_pairs_non_to_sarcastic_complete.jsonl

The higher temperature encourages variation across strategy types. Every source ends up with exactly 6 labeled variants — one for each sarcasm strategy.

06

Stratified Splits

create_train_val_test_splits.py

Source-level stratified split (seed=42) so all 6 variants of a source stay in the same split

in 89,688 · sarcasm_pairs_non_to_sarcastic_complete.jsonl
out 71,730 / 8,952 / 9,006 · train / val / test

Source-level grouping prevents data leakage: a model can't memorize one variant of a headline and score well on another variant of the same source.

Six Sarcasm Strategies

Adapted from the iSarcasm taxonomy. Every source headline is expanded into six strategy-labeled variants so models see the full distribution during training.

sarcasm

Sarcasm

Contradicts the state of affairs with a critical tone toward the addressee.

"Great job breaking the build right before the demo."

irony

Irony

Contradicts the state of affairs without obvious blame or target.

"What a beautiful day for a three-hour traffic jam."

satire

Satire

Appears supportive but contains mockery that reveals absurdity.

"Senate Passes Landmark Bill To Study The Feasibility Of Passing Bills."

overstatement

Overstatement

Obviously exaggerated terms or impossible quantities.

"I've told you a million times to stop exaggerating."

understatement

Understatement

Severe minimization of the importance or severity of something.

"The Titanic experienced some minor hull damage."

rhetorical_question

Rhetorical Question

A question whose expected answer contradicts reality.

"Is the sky blue? Obviously congress isn't corrupt."

Secondary Datasets

Sarcastic → Non-sarcastic

After filtering to cross-validated sarcastic headlines and scraping articles to drop meme-only entries.

Total
13,588
Train / Val / Test
10,868 / 1,356 / 1,364

Context-Enhanced

Non-sarcastic targets generated with full article body as context. Different model (Qwen 3.6 Plus), extended strategy set (13 labels).

Total
10,330
Train / Val / Test
8,258 / 1,029 / 1,043
Quality Controls

Cross-validation as a data audit

Every headline StepFun disagreed with was re-checked by Nemotron 3 Nano 30B. When both LLMs agreed against the NHDSD label (4,076 of 5,644 disagreements), we recorded it as a suspected mislabel. These corrections are an audit signal — only the secondary sar→non pipeline actually filters training data using them. The main non→sar pipeline reads the raw NHDSD labels directly.

Source-level stratification

Splits are done at the source headline level, not the variant level. All 6 strategy variants of a single source stay in the same split, preventing a model from memorizing one variant and scoring well on another.

Deterministic labeling, creative generation

Classification passes run at temperature 0.1 for stability, while pair and variant generation run at 0.7–0.8 so the model produces genuinely different rewrites per strategy.

Context-enhanced variant

A second smaller dataset (10,330 pairs) is generated with access to the full article body, producing more factually grounded non-sarcastic rewrites. Used to train the CE and CE+RL BART variants.

See how models trained on this data perform.