EvasionBench

A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A

📄 Paper (arXiv) đŸ’» GitHub đŸ€— Model 📊 Dataset

Abstract

We introduce EvasionBench, a large-scale benchmark for detecting managerial evasion in earnings call Q&A sessions. Built from 22.7 million Q&A pairs from S&P Capital IQ, we develop a three-level evasion taxonomy (direct, intermediate, fully evasive) and a Multi-Model Consensus (MMC) annotation framework using frontier LLMs. Our benchmark includes 84K balanced training samples and a 1K gold-standard evaluation set with human validation (Cohen's Îș = 0.835). We introduce Eva-4B, a fine-tuned Qwen3-4B model that achieves 84.9% Macro-F1, outperforming larger frontier models including Claude Opus 4.5 and GPT-5.2.

Key Statistics

22.7M
Raw Q&A Pairs
84K
Training Samples
1K
Gold Evaluation Set
84.9%
Eva-4B Macro-F1

Evasion Taxonomy

✅ Direct

The core question is directly and explicitly answered with clear figures, "Yes/No" stance, or direct explanations.

Q: "What is the expected margin for Q4?"
A: "We expect it to be 32%."

⚠ Intermediate

The response provides related context but sidesteps the specific core through hedging or answering adjacent topics.

Q: "What is the expected margin for Q4?"
A: "We expect margins to improve relative to Q3."

❌ Fully Evasive

The question is ignored, explicitly refused, or the response is entirely off-topic with no relevant information.

Q: "What is the expected margin for Q4?"
A: "We are focused on driving long-term shareholder value."

Multi-Model Consensus (MMC) Framework

Our MMC framework leverages multiple frontier LLMs for annotation, with a three-judge majority voting mechanism to resolve disagreements.

Figure 1: The Multi-Model Consensus (MMC) annotation pipeline.

Model Performance

Top 5 models on EvasionBench 1K evaluation set. Eva-4B (Full) achieves the highest Macro-F1, outperforming frontier LLMs including Gemini 3 Flash and Claude Opus 4.5.

Figure: Top 5 model performance comparison (Macro-F1 %).

Leaderboard

Rank Model Category Accuracy Macro-F1
1 Eva-4B (Full) Eva-4B 84.8% 84.9%
2 Gemini 3 Flash Closed-Source 84.6% 84.64%
3 Claude Opus 4.5 Closed-Source 84.1% 84.38%
4 GLM-4.7 Open-Source 83.1% 82.91%
5 Eva-4B (Consensus) Eva-4B 81.0% 81.37%
6 GPT-5.2 Closed-Source 80.8% 80.90%
7 Eva-4B (Opus Only) Eva-4B 80.6% 80.61%
8 Qwen3-Coder Open-Source 78.0% 78.16%
9 MiniMax-M2.1 Open-Source 71.8% 71.31%
10 DeepSeek-V3.2 Open-Source 66.7% 66.88%
11 Kimi-K2 Open-Source 67.8% 66.68%
12 Qwen3-4B (Base) Base Model 42.3% 34.30%

Eva-4B Training & Ablation Study

Figure 2: Two-stage training pipeline for Eva-4B.

Figure 3: Ablation study comparing model variants.

Training Loss Curve

Figure 4: Training loss curves for two-stage fine-tuning.

Confusion Matrix

Figure 5: Eva-4B (Full) confusion matrix on 1K evaluation set.

LLM Judge Analysis

Judge Label Distribution

Figure 6: Label distribution across three LLM judges.

Position Bias Analysis

Figure 7: Position bias analysis in LLM-as-judge settings.

Citation

If you find EvasionBench useful, please cite our paper:

@article{evasionbench2026, title={EvasionBench: A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A}, author={...}, journal={arXiv preprint arXiv:2602.xxxxx}, year={2026} }