Benchmarks
Classer accuracy across 32 public NLP datasets, compared to GPT-5.4-mini.
32 datasets
| Dataset | Classer↓ | GPT-5.4-mini | |||
|---|---|---|---|---|---|
| DBpedia-14topic · 14 labels | 98.0% | 98.0% | topic | 14 | medium |
| SMS Spammoderation · 2 labels | 97.0% | 90.5% | moderation | 2 | easy |
| SST-2sentiment · 2 labels | 96.5% | 95.0% | sentiment | 2 | easy |
| IMDBsentiment · 2 labels | 96.0% | 94.0% | sentiment | 2 | easy |
| TREC Questionintent · 6 labels | 94.5% | 81.5% | intent | 6 | medium |
| Rotten Tomatoessentiment · 2 labels | 89.0% | 86.0% | sentiment | 2 | easy |
| Amazon Counterfactualpragmatics · 2 labels | 88.5% | 87.0% | pragmatics | 2 | hard |
| ETHOSmoderation · 2 labels | 87.0% | 89.0% | moderation | 2 | hard |
| AG Newstopic · 4 labels | 86.0% | 85.0% | topic | 4 | easy |
| RumourEvalmisinformation · 4 labels | 84.5% | 71.0% | misinformation | 4 | hard |
| Tweet Eval Emotionemotion · 4 labels | 82.0% | 84.5% | emotion | 4 | medium |
| CLINC150intent · 26 labels | 81.0% | 83.7% | intent | 26 | hard |
| MASSIVEintent · 18 labels | 79.8% | 74.2% | intent | 18 | hard |
| Clickbait Detectionmoderation · 2 labels | 79.0% | 73.5% | moderation | 2 | medium |
| Banking77intent · 17 labels | 78.0% | 73.0% | intent | 17 | hard |
| Sarcasm Detectionpragmatics · 2 labels | 76.0% | 58.0% | pragmatics | 2 | hard |
| ArXiv Classificationdomain-specific · 11 labels | 73.0% | 75.5% | domain-specific | 11 | hard |
| ETHOS (multi-label)moderation · 8 labels | 72.5% | 68.0% | moderation | 8 | hard |
| 20 Newsgroupstopic · 20 labels | 71.7% | 77.3% | topic | 20 | hard |
| Tweet Eval Hatemoderation · 2 labels | 71.0% | 73.5% | moderation | 2 | hard |
| LexGLUE SCOTUSlegal · 13 labels | 70.0% | 44.5% | legal | 13 | hard |
| Financial PhraseBanksentiment · 3 labels | 69.8% | 24.1% | sentiment | 3 | medium |
| Yelp Review Fullsentiment · 5 labels | 69.5% | 65.5% | sentiment | 5 | medium |
| Hyperpartisan Newsmisinformation · 2 labels | 66.5% | 66.5% | misinformation | 2 | hard |
| App Reviewssentiment · 5 labels | 65.0% | 40.5% | sentiment | 5 | medium |
| Medical Abstractsmedical · 5 labels | 65.0% | 65.0% | medical | 5 | medium |
| LexGLUE ECtHRlegal · 10 labels | 63.0% | 15.6% | legal | 10 | hard |
| HateXplainmoderation · 3 labels | 60.0% | 55.0% | moderation | 3 | hard |
| Yahoo Answerstopic · 10 labels | 59.5% | 63.0% | topic | 10 | medium |
| Emotionemotion · 6 labels | 58.5% | 56.5% | emotion | 6 | medium |
| LexGLUE Unfair-ToSlegal · 8 labels | 58.5% | 24.9% | legal | 8 | hard |
| Jigsaw Toxicitymoderation · 6 labels | 39.0% | 28.4% | moderation | 6 | hard |
| Average | 75.8% | 67.7% |
Methodology
- 200 stratified samples per dataset
- Zero-shot — no training examples provided
- All datasets from HuggingFace