DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation

Ma, Shijian; Huang, Yunqi; Lin, Yan

What is DramaBench?

DramaBench is the first large-scale benchmark for evaluating drama script continuation across six independent dimensions. Our hybrid evaluation system combines rule-based analysis with LLM-based labeling to provide objective, reproducible assessments.

🎯

Six Dimensions

Comprehensive evaluation across Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling.

🔬

Rigorous Validation

252 statistical significance tests (65.9% significant), human validation on 188 scripts, and comprehensive ablation studies confirming dimension independence.

📊

Large-Scale Analysis

8,824 total evaluations across 8 state-of-the-art language models, with detailed error taxonomy and model-specific profiles.

Six Evaluation Dimensions

Each dimension captures distinct quality aspects with independent metrics

RULE-BASED

Format Standards

Metrics: Format Error Rate, Novelization Index, Dialogue-Action Ratio

100% reproducible screenplay format validation using Fountain syntax standards.

LLM-LABELED

Narrative Efficiency

Metrics: Effective Narrative Rate (ENR), Beats Per Page

Event-level annotation: Driver/Static/Redundant beats classification.

LLM-LABELED

Character Consistency

Metrics: Out-of-Character Rate, Voice Distinctiveness

Dialogue-level annotation with character persona profiles.

LLM-LABELED

Emotional Depth

Metrics: Arc Score, Complexity Ratio

Scene-level emotional tracking with valence and arousal dimensions.

LLM-LABELED

Logic Consistency

Metrics: Logic Break Rate, Context Coherence

Fact-level atomic verification ensuring narrative coherence.

LLM-LABELED

Conflict Handling

Metrics: Conflict Score, Drop Rate

Global-level conflict classification: Escalation/Resolution/Maintained/Dropped.

Evaluated Models

8 state-of-the-art language models from leading AI research labs

GPT-5.2

OpenAI

Rank #1

Gemini 3 Pro

Google DeepMind

Rank #2

Qwen3-Max

Alibaba Cloud

Rank #3

Claude Opus 4.5

Anthropic

Rank #4

DeepSeek V3.2

DeepSeek

Rank #5

MiniMax M2

MiniMax

Rank #6

Kimi K2 Thinking

Moonshot AI

Rank #7

GLM-4.6

Zhipu AI

Rank #8

Key Results

Statistical Significance

166/252 comparisons significant

65.9% of pairwise comparisons show statistically significant differences with FDR correction (p < 0.05).

Dimension Independence

Mean |r| = 0.020

Extremely low correlation between dimensions confirms each captures distinct quality aspects.

Human-LLM Agreement

Strong agreement on 3/5 dimensions

Logic (r=0.48***), Emotional Depth (κ=0.53), Conflict (κ=0.42) show substantial alignment.

📖 Citation

If you use DramaBench in your research, please cite our paper:

@misc{ma2025dramabenchsixdimensionalevaluationframework,
  title={DramaBench: A Six-Dimensional Evaluation Framework for Drama Script Continuation},
  author={Shijian Ma and Yunqi Huang and Yan Lin},
  year={2025},
  eprint={2512.19012},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2512.19012}
}

✓ Copied to clipboard!

DramaBench

What is DramaBench?

Six Evaluation Dimensions

Evaluated Models

Key Results

Ready to Explore?

📖 Citation