Menu
in

Evaluating Language Models’ Understanding of Temporal Dependencies in Procedural Texts #TemporalDependenciesEvaluationLanguageModelsProceduralTexts

Researchers have developed CAT-BENCH, a benchmark to evaluate language models’ ability to predict step sequences in cooking recipes. The study found that current state-of-the-art models struggle with this task, even with advanced techniques. Models need help in comprehending causal and temporal relationships within instructional texts. CAT-BENCH evaluates models on their ability to recognize temporal dependencies in recipes, with a focus on predicting the sequence of steps accurately. The benchmark contains questions about whether one step must occur before or after another, testing models’ precision, recall, and F1 score. Various models were evaluated, with GPT-4-turbo and GPT-3.5-turbo showing the highest F1 scores in the zero-shot setting. Adding explanations alongside answers improved model performance significantly. However, models exhibited biases and inconsistencies in predicting step dependencies. Human evaluation of model-generated explanations revealed room for improvement in understanding step dependencies. Overall, the study highlights current limitations in language models for plan-based reasoning applications. The research underscores the need for more datasets focusing on predicting and explaining temporal order constraints in instructional plans.

Source link

Source link: https://www.marktechpost.com/2024/06/30/cat-bench-evaluating-language-models-understanding-of-temporal-dependencies-in-procedural-texts/?amp

Leave a Reply

Exit mobile version