CS4248 / NUS / Team 14

Project LLMao

Lightweight Language Models for Anti-sarcasm Output

Sarcasm style transfer with 14 small language models — T5, BART, and LLaMA variants fine-tuned to rewrite sarcastic news headlines as neutral, factual equivalents while preserving meaning. Our best model (T5-Joint) achieves 43.6% strict success on human evaluation.

Models Tested

BART, T5, LLaMA

2,857

Test Samples

Per model

3×7

Classifier×Metrics

Twitter / Kaggle / News

140

Hand-Labeled

2 annotators, κ > 0.8

Explore

Data Pipeline

How 28,619 NHDSD headlines became 89,688 strategy-annotated training pairs through LLM generation and cross-validation.

Model Training

Exact hyperparameters and loss formulations across four recipes — SFT seq2seq, REINFORCE + KL, LoRA instruction tuning, and the 6-way ablation.

Evaluation

What each of the 7 metrics measures, why we use it, where it breaks down. Read this before the dashboard.

Dashboard

Compare 14 models across 7 evaluation metrics with interactive charts and strategy breakdowns.

Sample Explorer

Browse 2,857 test samples with filtering, search, and side-by-side model comparison.

Playground

Type a sarcastic headline and watch our models rewrite it in real-time via LMStudio.

Human Evaluation

140 samples × 3 models × 2 annotators (κ > 0.8). Three sarcasm classifiers all disagree with humans (κ = −0.11 to +0.18) — receipts inside.

Methodology

Data Pipeline →

89,688 training pairs generated from NHDSD headlines using 6 sarcasm strategies, augmented via LLM pairing with cross-validation.

Model Training →

Four recipes across 14 models: T5 joint-task SFT (our best, via Camille's pipeline), BART with context enhancement and REINFORCE + KL, LLaMA 3.2 1B LoRA, and a 6-way subtype ablation on T5.

Evaluation →

7 automatic metrics across 3 sarcasm classifiers (RoBERTa-Twitter, Bert-Kaggle, RoBERTa-News) plus 140 hand-labeled samples. The classifiers disagree by up to 33 pp — human eval is essential.