Data Pipeline
From 28,619 raw news headlines to 89,688 strategy-annotated training pairs — six stages of LLM generation, cross-validation, and stratified splitting designed to prevent leakage while capturing all six sarcasm strategies.
Raw Collection
reclassify_nhdsd_binary.pyClean duplicates, normalize whitespace
Source: TheOnion (13,634 sarcastic) + HuffPost (14,985 neutral) headlines with article links.
Binary Reclassification
StepFun 3.5 Flash · temp=0.1LLM re-labels every headline as sarcastic / non-sarcastic
Agreement with original NHDSD labels: 80.19%. 5,644 headlines were flagged as disagreements for verification.
Cross-Validation
Nemotron 3 Nano 30B · temp=0.1A second independent LLM re-classifies only the disagreements
Of the 5,644 disagreements, 4,076 (72.2%) had StepFun + Nemotron agreeing against the original annotation. This step produces an audit report, not a corrected labels file — the main training pipeline still reads from the raw NHDSD dataset. The CV results are consumed only by the secondary sar→non filtering pipeline.
Pair Generation
StepFun 3.5 Flash · temp=0.7 · batch=30For each headline, generate its opposite-style counterpart and tag the strategy
Non-sarcastic headlines (14,948) become non→sar pairs. Sarcastic headlines (13,634) become sar→non pairs. 6 headlines hit content filters and were dropped.
Strategy Augmentation
StepFun 3.5 Flash · temp=0.8For each source, generate 5 more variants covering the missing strategies (6 total per source)
The higher temperature encourages variation across strategy types. Every source ends up with exactly 6 labeled variants — one for each sarcasm strategy.
Stratified Splits
create_train_val_test_splits.pySource-level stratified split (seed=42) so all 6 variants of a source stay in the same split
Source-level grouping prevents data leakage: a model can't memorize one variant of a headline and score well on another variant of the same source.
Adapted from the iSarcasm taxonomy. Every source headline is expanded into six strategy-labeled variants so models see the full distribution during training.
Sarcasm
Contradicts the state of affairs with a critical tone toward the addressee.
"Great job breaking the build right before the demo."
Irony
Contradicts the state of affairs without obvious blame or target.
"What a beautiful day for a three-hour traffic jam."
Satire
Appears supportive but contains mockery that reveals absurdity.
"Senate Passes Landmark Bill To Study The Feasibility Of Passing Bills."
Overstatement
Obviously exaggerated terms or impossible quantities.
"I've told you a million times to stop exaggerating."
Understatement
Severe minimization of the importance or severity of something.
"The Titanic experienced some minor hull damage."
Rhetorical Question
A question whose expected answer contradicts reality.
"Is the sky blue? Obviously congress isn't corrupt."
Sarcastic → Non-sarcastic
After filtering to cross-validated sarcastic headlines and scraping articles to drop meme-only entries.
Context-Enhanced
Non-sarcastic targets generated with full article body as context. Different model (Qwen 3.6 Plus), extended strategy set (13 labels).
Cross-validation as a data audit
Every headline StepFun disagreed with was re-checked by Nemotron 3 Nano 30B. When both LLMs agreed against the NHDSD label (4,076 of 5,644 disagreements), we recorded it as a suspected mislabel. These corrections are an audit signal — only the secondary sar→non pipeline actually filters training data using them. The main non→sar pipeline reads the raw NHDSD labels directly.
Source-level stratification
Splits are done at the source headline level, not the variant level. All 6 strategy variants of a single source stay in the same split, preventing a model from memorizing one variant and scoring well on another.
Deterministic labeling, creative generation
Classification passes run at temperature 0.1 for stability, while pair and variant generation run at 0.7–0.8 so the model produces genuinely different rewrites per strategy.
Context-enhanced variant
A second smaller dataset (10,330 pairs) is generated with access to the full article body, producing more factually grounded non-sarcastic rewrites. Used to train the CE and CE+RL BART variants.