Annotations · Multi-classifier audit

Human Evaluation

We hand-labeled 140 samples across 3 models with 2 independent annotators (κ > 0.8) and ran every output through 3 sarcasm classifiers. The classifiers all disagree — and they all disagree with humans. This page is the receipts.