Benchmarks

Classer accuracy across 32 public NLP datasets, compared to GPT-5.4-mini.

32 datasets
DatasetClasserGPT-5.4-mini
DBpedia-14topic · 14 labels98.0%98.0%
SMS Spammoderation · 2 labels97.0%90.5%
SST-2sentiment · 2 labels96.5%95.0%
IMDBsentiment · 2 labels96.0%94.0%
TREC Questionintent · 6 labels94.5%81.5%
Rotten Tomatoessentiment · 2 labels89.0%86.0%
Amazon Counterfactualpragmatics · 2 labels88.5%87.0%
ETHOSmoderation · 2 labels87.0%89.0%
AG Newstopic · 4 labels86.0%85.0%
RumourEvalmisinformation · 4 labels84.5%71.0%
Tweet Eval Emotionemotion · 4 labels82.0%84.5%
CLINC150intent · 26 labels81.0%83.7%
MASSIVEintent · 18 labels79.8%74.2%
Clickbait Detectionmoderation · 2 labels79.0%73.5%
Banking77intent · 17 labels78.0%73.0%
Sarcasm Detectionpragmatics · 2 labels76.0%58.0%
ArXiv Classificationdomain-specific · 11 labels73.0%75.5%
ETHOS (multi-label)moderation · 8 labels72.5%68.0%
20 Newsgroupstopic · 20 labels71.7%77.3%
Tweet Eval Hatemoderation · 2 labels71.0%73.5%
LexGLUE SCOTUSlegal · 13 labels70.0%44.5%
Financial PhraseBanksentiment · 3 labels69.8%24.1%
Yelp Review Fullsentiment · 5 labels69.5%65.5%
Hyperpartisan Newsmisinformation · 2 labels66.5%66.5%
App Reviewssentiment · 5 labels65.0%40.5%
Medical Abstractsmedical · 5 labels65.0%65.0%
LexGLUE ECtHRlegal · 10 labels63.0%15.6%
HateXplainmoderation · 3 labels60.0%55.0%
Yahoo Answerstopic · 10 labels59.5%63.0%
Emotionemotion · 6 labels58.5%56.5%
LexGLUE Unfair-ToSlegal · 8 labels58.5%24.9%
Jigsaw Toxicitymoderation · 6 labels39.0%28.4%
Average75.8%67.7%

Methodology

  • 200 stratified samples per dataset
  • Zero-shot — no training examples provided
  • All datasets from HuggingFace