DramaBench

A Six-Dimensional Evaluation Framework for Drama Script Continuation

1,103 Unique Scripts
8,824 Evaluations
8 SOTA Models
6 Dimensions

What is DramaBench?

DramaBench is the first large-scale benchmark for evaluating drama script continuation across six independent dimensions. Our hybrid evaluation system combines rule-based analysis with LLM-based labeling to provide objective, reproducible assessments.

🎯
Six Dimensions
Comprehensive evaluation across Format Standards, Narrative Efficiency, Character Consistency, Emotional Depth, Logic Consistency, and Conflict Handling.
🔬
Rigorous Validation
252 statistical significance tests (65.9% significant), human validation on 188 scripts, and comprehensive ablation studies confirming dimension independence.
📊
Large-Scale Analysis
8,824 total evaluations across 8 state-of-the-art language models, with detailed error taxonomy and model-specific profiles.

Six Evaluation Dimensions

Each dimension captures distinct quality aspects with independent metrics

RULE-BASED
Format Standards

Metrics: Format Error Rate, Novelization Index, Dialogue-Action Ratio

100% reproducible screenplay format validation using Fountain syntax standards.

LLM-LABELED
Narrative Efficiency

Metrics: Effective Narrative Rate (ENR), Beats Per Page

Event-level annotation: Driver/Static/Redundant beats classification.

LLM-LABELED
Character Consistency

Metrics: Out-of-Character Rate, Voice Distinctiveness

Dialogue-level annotation with character persona profiles.

LLM-LABELED
Emotional Depth

Metrics: Arc Score, Complexity Ratio

Scene-level emotional tracking with valence and arousal dimensions.

LLM-LABELED
Logic Consistency

Metrics: Logic Break Rate, Context Coherence

Fact-level atomic verification ensuring narrative coherence.

LLM-LABELED
Conflict Handling

Metrics: Conflict Score, Drop Rate

Global-level conflict classification: Escalation/Resolution/Maintained/Dropped.

Evaluated Models

8 state-of-the-art language models from leading AI research labs

GPT-5.2
OpenAI
Rank #1
Gemini 3 Pro
Google DeepMind
Rank #2
Qwen3-Max
Alibaba Cloud
Rank #3
Claude Opus 4.5
Anthropic
Rank #4
DeepSeek V3.2
DeepSeek
Rank #5
MiniMax M2
MiniMax
Rank #6
Kimi K2 Thinking
Moonshot AI
Rank #7
GLM-4.6
Zhipu AI
Rank #8

Key Results

Statistical Significance

166/252 comparisons significant

65.9% of pairwise comparisons show statistically significant differences with FDR correction (p < 0.05).

Dimension Independence

Mean |r| = 0.020

Extremely low correlation between dimensions confirms each captures distinct quality aspects.

Human-LLM Agreement

Strong agreement on 3/5 dimensions

Logic (r=0.48***), Emotional Depth (κ=0.53), Conflict (κ=0.42) show substantial alignment.

Ready to Explore?

View detailed model rankings, explore case studies, and access the complete dataset