Benchmarks

Classer accuracy across 32 public NLP datasets, compared to GPT-5.4-mini.

Category

Difficulty

Task type

32 datasets

Dataset	Classer↓	GPT-5.4-mini	Category	Labels	Difficulty
DBpedia-14topic · 14 labels	98.0%	98.0%	topic	14	medium
SMS Spammoderation · 2 labels	97.0%	90.5%	moderation	2	easy
SST-2sentiment · 2 labels	96.5%	95.0%	sentiment	2	easy
IMDBsentiment · 2 labels	96.0%	94.0%	sentiment	2	easy
TREC Questionintent · 6 labels	94.5%	81.5%	intent	6	medium
Rotten Tomatoessentiment · 2 labels	89.0%	86.0%	sentiment	2	easy
Amazon Counterfactualpragmatics · 2 labels	88.5%	87.0%	pragmatics	2	hard
ETHOSmoderation · 2 labels	87.0%	89.0%	moderation	2	hard
AG Newstopic · 4 labels	86.0%	85.0%	topic	4	easy
RumourEvalmisinformation · 4 labels	84.5%	71.0%	misinformation	4	hard
Tweet Eval Emotionemotion · 4 labels	82.0%	84.5%	emotion	4	medium
CLINC150intent · 26 labels	81.0%	83.7%	intent	26	hard
MASSIVEintent · 18 labels	79.8%	74.2%	intent	18	hard
Clickbait Detectionmoderation · 2 labels	79.0%	73.5%	moderation	2	medium
Banking77intent · 17 labels	78.0%	73.0%	intent	17	hard
Sarcasm Detectionpragmatics · 2 labels	76.0%	58.0%	pragmatics	2	hard
ArXiv Classificationdomain-specific · 11 labels	73.0%	75.5%	domain-specific	11	hard
ETHOS (multi-label)moderation · 8 labels	72.5%	68.0%	moderation	8	hard
20 Newsgroupstopic · 20 labels	71.7%	77.3%	topic	20	hard
Tweet Eval Hatemoderation · 2 labels	71.0%	73.5%	moderation	2	hard
LexGLUE SCOTUSlegal · 13 labels	70.0%	44.5%	legal	13	hard
Financial PhraseBanksentiment · 3 labels	69.8%	24.1%	sentiment	3	medium
Yelp Review Fullsentiment · 5 labels	69.5%	65.5%	sentiment	5	medium
Hyperpartisan Newsmisinformation · 2 labels	66.5%	66.5%	misinformation	2	hard
App Reviewssentiment · 5 labels	65.0%	40.5%	sentiment	5	medium
Medical Abstractsmedical · 5 labels	65.0%	65.0%	medical	5	medium
LexGLUE ECtHRlegal · 10 labels	63.0%	15.6%	legal	10	hard
HateXplainmoderation · 3 labels	60.0%	55.0%	moderation	3	hard
Yahoo Answerstopic · 10 labels	59.5%	63.0%	topic	10	medium
Emotionemotion · 6 labels	58.5%	56.5%	emotion	6	medium
LexGLUE Unfair-ToSlegal · 8 labels	58.5%	24.9%	legal	8	hard
Jigsaw Toxicitymoderation · 6 labels	39.0%	28.4%	moderation	6	hard
Average	75.8%	67.7%

Methodology

200 stratified samples per dataset
Zero-shot — no training examples provided
All datasets from HuggingFace