A Large-Scale Benchmark for Detecting Managerial Evasion in Earnings Call Q&A
We introduce EvasionBench, a large-scale benchmark for detecting managerial evasion in earnings call Q&A sessions. Built from 22.7 million Q&A pairs from S&P Capital IQ, we develop a three-level evasion taxonomy (direct, intermediate, fully evasive) and a Multi-Model Consensus (MMC) annotation framework using frontier LLMs. Our benchmark includes 84K balanced training samples and a 1K gold-standard evaluation set with human validation (Cohen's Îș = 0.835). We introduce Eva-4B, a fine-tuned Qwen3-4B model that achieves 84.9% Macro-F1, outperforming larger frontier models including Claude Opus 4.5 and GPT-5.2.
The core question is directly and explicitly answered with clear figures, "Yes/No" stance, or direct explanations.
The response provides related context but sidesteps the specific core through hedging or answering adjacent topics.
The question is ignored, explicitly refused, or the response is entirely off-topic with no relevant information.
Our MMC framework leverages multiple frontier LLMs for annotation, with a three-judge majority voting mechanism to resolve disagreements.
Figure 1: The Multi-Model Consensus (MMC) annotation pipeline.
Top 5 models on EvasionBench 1K evaluation set. Eva-4B (Full) achieves the highest Macro-F1, outperforming frontier LLMs including Gemini 3 Flash and Claude Opus 4.5.
Figure: Top 5 model performance comparison (Macro-F1 %).
| Rank | Model | Category | Accuracy | Macro-F1 |
|---|---|---|---|---|
| 1 | Eva-4B (Full) | Eva-4B | 84.8% | 84.9% |
| 2 | Gemini 3 Flash | Closed-Source | 84.6% | 84.64% |
| 3 | Claude Opus 4.5 | Closed-Source | 84.1% | 84.38% |
| 4 | GLM-4.7 | Open-Source | 83.1% | 82.91% |
| 5 | Eva-4B (Consensus) | Eva-4B | 81.0% | 81.37% |
| 6 | GPT-5.2 | Closed-Source | 80.8% | 80.90% |
| 7 | Eva-4B (Opus Only) | Eva-4B | 80.6% | 80.61% |
| 8 | Qwen3-Coder | Open-Source | 78.0% | 78.16% |
| 9 | MiniMax-M2.1 | Open-Source | 71.8% | 71.31% |
| 10 | DeepSeek-V3.2 | Open-Source | 66.7% | 66.88% |
| 11 | Kimi-K2 | Open-Source | 67.8% | 66.68% |
| 12 | Qwen3-4B (Base) | Base Model | 42.3% | 34.30% |
Figure 2: Two-stage training pipeline for Eva-4B.
Figure 3: Ablation study comparing model variants.
Figure 4: Training loss curves for two-stage fine-tuning.
Figure 5: Eva-4B (Full) confusion matrix on 1K evaluation set.
Figure 6: Label distribution across three LLM judges.
Figure 7: Position bias analysis in LLM-as-judge settings.
If you find EvasionBench useful, please cite our paper: