A Six-Dimensional Evaluation Framework for Drama Script Continuation
DramaBench is the first large-scale benchmark for evaluating drama script continuation across six independent dimensions. Our hybrid evaluation system combines rule-based analysis with LLM-based labeling to provide objective, reproducible assessments.
Each dimension captures distinct quality aspects with independent metrics
Metrics: Format Error Rate, Novelization Index, Dialogue-Action Ratio
100% reproducible screenplay format validation using Fountain syntax standards.
Metrics: Effective Narrative Rate (ENR), Beats Per Page
Event-level annotation: Driver/Static/Redundant beats classification.
Metrics: Out-of-Character Rate, Voice Distinctiveness
Dialogue-level annotation with character persona profiles.
Metrics: Arc Score, Complexity Ratio
Scene-level emotional tracking with valence and arousal dimensions.
Metrics: Logic Break Rate, Context Coherence
Fact-level atomic verification ensuring narrative coherence.
Metrics: Conflict Score, Drop Rate
Global-level conflict classification: Escalation/Resolution/Maintained/Dropped.
8 state-of-the-art language models from leading AI research labs
166/252 comparisons significant
65.9% of pairwise comparisons show statistically significant differences with FDR correction (p < 0.05).
Mean |r| = 0.020
Extremely low correlation between dimensions confirms each captures distinct quality aspects.
Strong agreement on 3/5 dimensions
Logic (r=0.48***), Emotional Depth (κ=0.53), Conflict (κ=0.42) show substantial alignment.
View detailed model rankings, explore case studies, and access the complete dataset