TY - JOUR
T1 - Automated Generation and Evaluation of Interactive-Fiction Serious Games with Open-Weight LLMs
AU - Rogosch, Finn
AU - Schrader, Andreas
N1 - Publisher Copyright:
© 2026 by the authors.
PY - 2026/3/18
Y1 - 2026/3/18
N2 - This work investigates whether open-weight large language models can automatically generate runnable and educationally faithful serious games in a constrained, text-only interactive-fiction (IF) setting. The target games are station-based single-player serious games for knowledge assessment, implemented as IF in a structured, machine-readable text format, and used here as a first step towards later ambient scenarios. A fully automated pipeline called SINE (Serious Interactive Narrative Engine) is evaluated with four prompting strategies, grammar-guided decoding, deterministic validation, and a repair agent. Across a staged evaluation with 240 seeds and increasing complexity, finalist configurations reach success rates between roughly 68% and 86% on the joint criterion of compilation, playability, and learning-goal fidelity. Repair iterations proved central to robustness, whereas grammar masking on top of reasoning prompts did not consistently improve outcomes. The study provides a reproducible benchmark setup, open artifacts, and a constrained generation pipeline as a basis for later extensions toward broader serious game scenarios.
AB - This work investigates whether open-weight large language models can automatically generate runnable and educationally faithful serious games in a constrained, text-only interactive-fiction (IF) setting. The target games are station-based single-player serious games for knowledge assessment, implemented as IF in a structured, machine-readable text format, and used here as a first step towards later ambient scenarios. A fully automated pipeline called SINE (Serious Interactive Narrative Engine) is evaluated with four prompting strategies, grammar-guided decoding, deterministic validation, and a repair agent. Across a staged evaluation with 240 seeds and increasing complexity, finalist configurations reach success rates between roughly 68% and 86% on the joint criterion of compilation, playability, and learning-goal fidelity. Repair iterations proved central to robustness, whereas grammar masking on top of reasoning prompts did not consistently improve outcomes. The study provides a reproducible benchmark setup, open artifacts, and a constrained generation pipeline as a basis for later extensions toward broader serious game scenarios.
UR - https://www.scopus.com/pages/publications/105034479977
UR - https://www.mendeley.com/catalogue/72a3653b-8e63-3150-8fb1-e717d5ab9786/
U2 - 10.3390/app16062932
DO - 10.3390/app16062932
M3 - Journal articles
SN - 2076-3417
VL - 16
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 6
M1 - 2932
ER -