Methodology / Data

Data Pipeline

From 28,619 raw news headlines to 89,688 strategy-annotated training pairs — six stages of LLM generation, cross-validation, and stratified splitting designed to prevent leakage while capturing all six sarcasm strategies.

28,619

Raw headlines

NHDSD v2

28,536

Generated pairs

LLM opposites

89,688

Augmented records

6 strategies / source

4,076

Suspected mislabels

browse examples →

80.19%

NHDSD agreement

StepFun vs original

Primary Pipeline — Non-sarcastic → Sarcastic

Raw Collection

reclassify_nhdsd_binary.py

Clean duplicates, normalize whitespace

in 28,619 · NHDSD (Misra 2019)→

out 28,497 · nhdsd_cleaned.json

Source: TheOnion (13,634 sarcastic) + HuffPost (14,985 neutral) headlines with article links.

Files ↗nhdsd_cleaned.json

Binary Reclassification

StepFun 3.5 Flash · temp=0.1

LLM re-labels every headline as sarcastic / non-sarcastic

in 28,497 · nhdsd_cleaned.json→

out 28,497 · nhdsd_reclassified.jsonl

Agreement with original NHDSD labels: 80.19%. 5,644 headlines were flagged as disagreements for verification.

Files ↗nhdsd_reclassified.jsonl label_disagreements.jsonl

Cross-Validation

Nemotron 3 Nano 30B · temp=0.1

A second independent LLM re-classifies only the disagreements

in 5,644 · label_disagreements.jsonl→

out 5,644 · cross_validation_comparison.json

Of the 5,644 disagreements, 4,076 (72.2%) had StepFun + Nemotron agreeing against the original annotation. This step produces an audit report, not a corrected labels file — the main training pipeline still reads from the raw NHDSD dataset. The CV results are consumed only by the secondary sar→non filtering pipeline.

Files ↗cross_validation_comparison.json cross_validation_secondary.jsonl

Pair Generation

StepFun 3.5 Flash · temp=0.7 · batch=30

For each headline, generate its opposite-style counterpart and tag the strategy

in 28,619 · Sarcasm_Headlines_Dataset_v2.json→

out 28,536 · sarcasm_pairs_step35_clean.jsonl

Non-sarcastic headlines (14,948) become non→sar pairs. Sarcastic headlines (13,634) become sar→non pairs. 6 headlines hit content filters and were dropped.

Files ↗sarcasm_pairs_step35_clean.jsonl sarcasm_pairs_non_to_sarcastic.jsonl sarcasm_pairs_sarcastic_to_non.jsonl

Strategy Augmentation

StepFun 3.5 Flash · temp=0.8

For each source, generate 5 more variants covering the missing strategies (6 total per source)

in 14,948 · sarcasm_pairs_non_to_sarcastic.jsonl→

out 89,688 · sarcasm_pairs_non_to_sarcastic_complete.jsonl

The higher temperature encourages variation across strategy types. Every source ends up with exactly 6 labeled variants — one for each sarcasm strategy.

Files ↗sarcasm_pairs_non_to_sarcastic_complete.jsonl sarcasm_pairs_strategy_augmented.jsonl

Stratified Splits

create_train_val_test_splits.py

Source-level stratified split (seed=42) so all 6 variants of a source stay in the same split

in 89,688 · sarcasm_pairs_non_to_sarcastic_complete.jsonl→

out 71,730 / 8,952 / 9,006 · train / val / test

Source-level grouping prevents data leakage: a model can't memorize one variant of a headline and score well on another variant of the same source.

Files ↗train.jsonl val.jsonl test.jsonl split_metadata.json

Six Sarcasm Strategies

Adapted from the iSarcasm taxonomy. Every source headline is expanded into six strategy-labeled variants so models see the full distribution during training.

sarcasm

Sarcasm

Contradicts the state of affairs with a critical tone toward the addressee.

"Great job breaking the build right before the demo."

irony

Irony

Contradicts the state of affairs without obvious blame or target.

"What a beautiful day for a three-hour traffic jam."

satire

Satire

Appears supportive but contains mockery that reveals absurdity.

"Senate Passes Landmark Bill To Study The Feasibility Of Passing Bills."

overstatement

Overstatement

Obviously exaggerated terms or impossible quantities.

"I've told you a million times to stop exaggerating."

understatement

Understatement

Severe minimization of the importance or severity of something.

"The Titanic experienced some minor hull damage."

rhetorical_question

Rhetorical Question

A question whose expected answer contradicts reality.

"Is the sky blue? Obviously congress isn't corrupt."

Secondary Datasets

Sarcastic → Non-sarcastic

After filtering to cross-validated sarcastic headlines and scraping articles to drop meme-only entries.

Total

13,588

Train / Val / Test

10,868 / 1,356 / 1,364

Files ↗sarcasm_pairs_sar_to_non_cv_filtered.jsonl train.jsonl val.jsonl test.jsonl

Context-Enhanced

Non-sarcastic targets generated with full article body as context. Different model (Qwen 3.6 Plus), extended strategy set (13 labels).

Total

10,330

Train / Val / Test

8,258 / 1,029 / 1,043

Files ↗sarcasm_pairs_sar_to_non_context_enhanced.jsonl train.jsonl val.jsonl test.jsonl

Quality Controls

Cross-validation as a data audit

Every headline StepFun disagreed with was re-checked by Nemotron 3 Nano 30B. When both LLMs agreed against the NHDSD label (4,076 of 5,644 disagreements), we recorded it as a suspected mislabel. These corrections are an audit signal — only the secondary sar→non pipeline actually filters training data using them. The main non→sar pipeline reads the raw NHDSD labels directly.

Source-level stratification

Splits are done at the source headline level, not the variant level. All 6 strategy variants of a single source stay in the same split, preventing a model from memorizing one variant and scoring well on another.

Deterministic labeling, creative generation

Classification passes run at temperature 0.1 for stability, while pair and variant generation run at 0.7–0.8 so the model produces genuinely different rewrites per strategy.

Context-enhanced variant

A second smaller dataset (10,330 pairs) is generated with access to the full article body, producing more factually grounded non-sarcastic rewrites. Used to train the CE and CE+RL BART variants.

T5 data prep (separate pipeline)

The T5 family uses its own prepare_t5_datasets.py which takes the same cleaned pairs and produces an 80/10/10 split stratified by strategy. It emits three variants: a joint-task split (input prefix predicts strategy, target is strategy: X rewrite: Y), a control split (plain rewrite to non-sarcastic: prefix, plain rewrite target), and 6 ablation splits that drop one subtype each with the pools stratified-downsampled to keep effective size constant.

See how models trained on this data perform.

Model dashboard →Try the playground