


default search action
20th BEA 2025: Vienna, Austria
- Ekaterina Kochmar, Bashar Alhafni, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anaïs Tack, Victoria Yaneva, Zheng Yuan:

Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications, BEA 2025, Vienna, Austria, July 31 - August 1, 2025. Association for Computational Linguistics 2025, ISBN 979-8-89176-270-1 - Sankalan Pal Chowdhury, Nico Daheim, Ekaterina Kochmar, Jakub Macina, Donya Rooein, Mrinmaya Sachan, Shashank Sonkar:

Large Language Models for Education: Understanding the Needs of Stakeholders, Current Capabilities and the Path Forward. 1-10 - Hakyung Sung, Karla Csürös, Min-Chang Sung:

Comparing human and LLM proofreading in L2 writing: Impact on lexical and syntactic features. 11-23 - Adrian Marius Dumitran, Mihnea Buca, Theodor Moroianu:

MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks. 24-37 - Felipe Urrutia, Cristian Buc, Roberto Araya, Valentin Barrière:

Unsupervised Automatic Short Answer Grading and Essay Scoring: A Weakly Supervised Explainable Approach. 38-54 - Luca Benedetto, Shiva Taslimipoor, Paula Buttery:

A Survey on Automated Distractor Evaluation in Multiple-Choice Tasks. 55-69 - Mina Almasi, Ross Deans Kristensen-McLachlan:

Alignment Drift in CEFR-prompted LLMs for Interactive Spanish Tutoring. 70-88 - Stefan Dascalescu, Adrian Marius Dumitran, Mihai Alexandru Vasiluta:

Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests. 89-99 - Daria Martynova, Jakub Macina, Nico Daheim, Nilay Yalcin, Xiaoyu Zhang, Mrinmaya Sachan:

Can LLMs Effectively Simulate Human Learners? Teachers' Insights from Tutoring LLM Students. 100-117 - Ryszard Staruch, Filip Gralinski, Daniel Dzienisiewicz:

Adapting LLMs for Minimal-edit Grammatical Error Correction. 118-128 - Zhengyuan Liu, Stella Xin Yin, Dion Hoe-Lian Goh, Nancy F. Chen:

COGENT: A Curriculum-oriented Framework for Generating Grade-appropriate Educational Content. 129-143 - Marie Bexte, Torsten Zesch:

Is Lunch Free Yet? Overcoming the Cold-Start Problem in Supervised Content Scoring using Zero-Shot LLM-Generated Training Data. 144-159 - Lucy Skidmore, Mariano Felice, Karen Dunn:

Transformer Architectures for Vocabulary Test Item Difficulty Prediction. 160-174 - Kordula De Kuthy, Leander Girrbach, Detmar Meurers:

Automatic concept extraction for learning domain modeling: A weakly supervised approach using contextualized word embeddings. 175-185 - Tianyi Geng, David Alfter:

Towards a Real-time Swedish Speech Analyzer for Language Learning Games: A Hybrid AI Approach to Language Assessment. 186-201 - Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, Siliang Liu, Jungyeul Park:

Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility. 202-212 - Robert Östling, Murathan Kurfali, Andrew Caines:

LLM-based post-editing as reference-free GEC evaluation. 213-224 - Marie Bexte, Yuning Ding, Andrea Horbach:

Increasing the Generalizability of Similarity-Based Essay Scoring Through Cross-Prompt Training. 225-236 - Mihail Chifligarov, Jammila Laâguidi, Max Schellenberg, Alexander Dill, Anna Timukova, Anastasia Drackert, Ronja Laarmann-Quante:

Automated Scoring of a German Written Elicited Imitation Test. 237-247 - Andrei Kucharavy, Cyril Vallez, Dimitri Percia David:

LLMs Protégés: Tutoring LLMs with Knowledge Gaps Improves Student Learning Outcome. 248-257 - Karthika NJ, Krishnakant Bhatt, Ganesh Ramakrishnan, Preethi Jyothi:

LEVOS: Leveraging Vocabulary Overlap with Sanskrit to Generate Technical Lexicons in Indian Languages. 258-265 - Andreas Säuberli, Diego Frassinelli, Barbara Plank:

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? 266-278 - Aymeric de Chillaz, Anna Sotnikova, Patrick Jermann, Antoine Bosselut:

Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison. 279-293 - Nisarg Parikh, Alexander Scarlatos, Nigel Fernandez, Simon Woodhead, Andrew Lan:

LookAlike: Consistent Distractor Generation in Math MCQs. 294-311 - Jasper Degraeuwe:

You Shall Know a Word's Difficulty by the Family It Keeps: Word Family Features in Personalised Word Difficulty Classifiers for L2 Spanish. 312-325 - David Alfter:

The Need for Truly Graded Lexical Complexity Prediction. 326-333 - Louise Bloch, Johannes Rückert, Christoph M. Friedrich:

Towards Automatic Formal Feedback on Scientific Documents. 334-344 - Nils-Jonathan Schaller, Yuning Ding, Thorben Jansen, Andrea Horbach:

Don't Score too Early! Evaluating Argument Mining Models on Incomplete Essays. 345-355 - Sankalan Pal Chowdhury, Terry Jingchen Zhang, Donya Rooein, Dirk Hovy, Tanja Käser, Mrinmaya Sachan:

Educators' Perceptions of Large Language Models as Tutors: Comparing Human and AI Tutors in a Blind Text-only Setting. 356-374 - Torsten Zesch, Dominic Gardner, Marie Bexte:

Transformer-Based Real-Word Spelling Error Feedback with Configurable Confusion Sets. 375-383 - Aitor Arronte Alvarez, Naiyi Xie Fincham:

Automated L2 Proficiency Scoring: Weak Supervision, Large Language Models, and Statistical Guarantees. 384-397 - Wanjing (Anya) Ma, Michael Flor, Zuowei Wang:

Automatic Generation of Inference Making Questions for Reading Comprehension Assessments. 398-414 - Zahra Kolagar, Frank Zalkow, Alessandra Zarcone:

Investigating Methods for Mapping Learning Objectives to Bloom's Revised Taxonomy in Course Descriptions for Higher Education. 415-445 - Mariana Shimabukuro, Deval Panchal, Christopher Collins:

LangEye: Toward 'Anytime' Learner-Driven Vocabulary Learning From Real-World Objects. 446-459 - Syeda Sabrina Akter, Seth Hunter, David Woo, Antonios Anastasopoulos:

Costs and Benefits of AI-Enabled Topic Modeling in P-20 Research: The Case of School Improvement Plans. 460-476 - Tania Amanda Nkoyo Frederick Eneye, Chukwuebuka Fortunate Ijezue, Ahmad Imam Amjad, Maaz Amjad, Sabur Butt, Gerardo Castañeda Garza:

Advances in Auto-Grading with Large Language Models: A Cross-Disciplinary Survey. 477-498 - Rina Miyata, Toru Urakawa, Hideaki Tamori, Tomoyuki Kajiwara:

Unsupervised Sentence Readability Estimation Based on Parallel Corpora for Text Simplification. 499-504 - Martina Galletti, Valeria Cesaroni:

From End-Users to Co-Designers: Lessons from Teachers. 505-516 - Alexey Sorokin, Regina Nasyrova:

LLMs in alliance with Edit-based models: advancing In-Context Learning for Grammatical Error Correction by Specific Example Selection. 517-534 - Michiel De Vrindt, Renske Bouwer, Wim Van Den Noortgate, Marije Lesterhuis, Anaïs Tack:

Explaining Holistic Essay Scores in Comparative Judgment Assessments by Predicting Scores on Rubrics. 535-548 - Chatrine Qwaider, Bashar Alhafni, Kirill Chirkunov, Nizar Habash, Ted Briscoe:

Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection. 549-563 - Charles Koutcheme, Nicola Dainese, Arto Hellas:

Direct Repair Optimization: Training Small Language Models For Educational Program Repair Improves Feedback. 564-581 - Fatemeh Kazemi Vanhari, Christopher Anand, Charles Welch:

Analyzing Interview Questions via Bloom's Taxonomy to Enhance the Design Thinking Process. 582-593 - Anisia Katinskaia, Anh-Duc Vu, Jue Hou, Ulla Vanhatalo, Yiheng Wu, Roman Yangarber:

Estimation of Text Difficulty in the Context of Language Learning. 594-611 - Vansh Gupta, Sankalan Pal Chowdhury, Vilém Zouhar, Donya Rooein, Mrinmaya Sachan:

Are Large Language Models for Education Reliable Across Languages? 612-631 - Stefano Bannò, Kate M. Knill, Mark J. F. Gales:

Exploiting the English Vocabulary Profile for L2 word-level vocabulary assessment with LLMs. 632-646 - Bernardo Leite, Henrique Lopes Cardoso:

Advancing Question Generation with Joint Narrative and Difficulty Control. 647-659 - Fabian Zehner, Hyo-Jeong Shin, Emily Kerzabi, Andrea Horbach, Sebastian Gombert, Frank Goldhammer, Torsten Zesch, Nico Andersen:

Down the Cascades of Omethi: Hierarchical Automatic Scoring in Large-Scale Assessments. 660-671 - Mohamed Elaraby, Diane J. Litman:

Lessons Learned in Assessing Student Reflections with LLMs. 672-686 - Haiyin Yang, Zoey Liu, Stefanie Wulff:

Using NLI to Identify Potential Collocation Transfer in L2 English. 687-696 - Annabella Sakunkoo, Jonathan Sakunkoo:

Name of Thrones: How Do LLMs Rank Student Names in Status Hierarchies Based on Race and Gender? 697-707 - Adriana Mirabella, Dominique Brunato:

Exploring LLM-Based Assessment of Italian Middle School Writing: A Pilot Study. 708-715 - Yuya Asano, Beata Beigman Klebanov, Jamie N. Mikeska:

Exploring task formulation strategies to evaluate the coherence of classroom discussions with GPT-4o. 716-736 - Anh-Duc Vu, Jue Hou, Anisia Katinskaia, Ching-Fan Sheu, Roman Yangarber:

A Bayesian Approach to Inferring Prerequisite Structures and Topic Difficulty in Language Learning. 737-751 - Nhat Tran, Diane J. Litman, Benjamin Pierce, Richard Correnti, Lindsay Clare Matsumura:

Improving In-context Learning Example Retrieval for Classroom Discussion Assessment with Re-ranking and Label Ratio Regulation. 752-764 - Fareya Ikram, Alexander Scarlatos, Andrew Lan:

Exploring LLMs for Predicting Tutor Strategy and Student Outcomes in Dialogues. 765-779 - Madalina Chitez, Liviu P. Dinu, Marius Micluta-Câmpeanu, Ana-Maria Bucur, Roxana Rogobete:

Assessing Critical Thinking Components in Romanian Secondary School Textbooks: A Data Mining Approach to the ROTEX Corpus. 780-793 - Jacek Marciniak, Marek Kubis, Michal Gulczynski, Adam Szpilkowski, Adam Wieczarek, Marcin Szczepanski:

Improving AI assistants embedded in short e-learning courses with limited textual content. 794-804 - Junzhi Han, Jinho D. Choi:

Beyond Linear Digital Reading: An LLM-Powered Concept Mapping Approach for Reducing Cognitive Load. 805-817 - Noah-Manuel Michael, Andrea Horbach:

GermDetect: Verb Placement Error Detection Datasets for Learners of Germanic Languages. 818-829 - Sahar Yarmohammadtoosky, Yiyun Zhou, Victoria Yaneva, Peter Baldwin, Saed Rezayi, Brian Clauser, Polina Harik:

Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems. 830-840 - Astha Singh, Mark Torrance, Evgeny Chukharev:

EyeLLM: Using Lookback Fixations to Enhance Human-LLM Alignment for Text Completion. 841-849 - Phoebe Mulcaire, Nitin Madnani:

Span Labeling with Large Language Models: Shell vs. Meat. 850-859 - Kseniia Petukhova, Ekaterina Kochmar:

Intent Matters: Enhancing AI Tutoring with Fine-Grained Pedagogical Intent Annotation. 860-872 - Aayush Kucheria, Nitin Sawhney, Arto Hellas:

Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA Dataset. 873-881 - Zhenjiang Mao, Artem Bisliouk, Rohith Reddy Nama, Ivan Ruchkin:

Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic. 882-890 - Saed Rezayi, Le An Ha, Yiyun Zhou, Andrew Houriet, Angelo D'Addario, Peter Baldwin, Polina Harik, Ann King, Victoria Yaneva:

Automated Scoring of Communication Skills in Physician-Patient Interaction: Balancing Performance and Scalability. 891-897 - Mayank Sharma, Jason Zhang:

Decoding Actionability: A Computational Analysis of Teacher Observation Feedback. 898-907 - Ruishi Chen, Yiling Zhao:

EduCSW: Building a Mandarin-English Code-Switched Generation Pipeline for Computer Science Learning. 908-919 - Euigyum Kim, Seewoo Li, Salah Khalil, Hyo-Jeong Shin:

STAIR-AIG: Optimizing the Automated Item Generation Process through Human-AI Collaboration for Critical Thinking Assessment. 920-930 - Kevin Shi, Karttikeya Mangalam:

UPSC2M: Benchmarking Adaptive Learning from Two Million MCQ Attempts. 931-936 - Veronica Schmalz, Anaïs Tack:

Can GPTZero's AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays? 937-952 - Martin Vainikko, Taavi Kamarik, Karina Kert, Krista Liin, Silvia Maine, Kais Allkivi, Annekatrin Kaivapalu, Mark Fishel:

Paragraph-level Error Correction and Explanation Generation: Case Study for Estonian. 953-967 - Kamel Nebhi, Amrita Panesar, Hans Bantilan:

End-to-End Automated Item Generation and Scoring for Adaptive English Writing Assessment with Large Language Models. 968-977 - Luisa Ribeiro-Flucht, Xiaobin Chen, Detmar Meurers:

A Framework for Proficiency-Aligned Grammar Practice in LLM-Based Dialogue Systems. 978-987 - KV Aditya Srivatsa, Kaushal Maurya, Ekaterina Kochmar:

Can LLMs Reliably Simulate Real Students' Abilities in Mathematics and Reading Comprehension? 988-1001 - Leo Huovinen, Mika Hämäläinen:

LLM-Assisted, Iterative Curriculum Writing: A Human-Centered AI Approach in Finnish Higher Education. 1002-1010 - Ekaterina Kochmar, Kaushal Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, Justin Vasselli:

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. 1011-1033 - Lei Chen:

Jinan Smart Education at BEA 2025 Shared Task: Dual Encoder Architecture for Tutor Identification via Semantic Understanding of Pedagogical Conversations. 1034-1039 - Deliang Wang, Chao Yang, Gaowei Chen:

Wonderland_EDU@HKU at BEA 2025 Shared Task: Fine-tuning Large Language Models to Evaluate the Pedagogical Ability of AI-powered Tutors. 1040-1048 - Jihyeon Roh, Jinhyun Bang:

bea-jh at BEA 2025 Shared Task: Evaluating AI-powered Tutors through Pedagogically-Informed Reasoning. 1049-1059 - Zhihao Lyu:

CU at BEA 2025 Shared Task: A BERT-Based Cross-Attention Approach for Evaluating Pedagogical Responses in Dialogue. 1060-1072 - Yuming Fan, Chuangchuang Tan, Wenyu Song:

BJTU at BEA 2025 Shared Task: Task-Aware Prompt Tuning and Data Augmentation for Evaluating AI Math Tutors. 1073-1077 - Longfeng Chen, Zeyu Huang, Zheng Xiao, Yawen Zeng, Jin Xu:

SYSUpporter Team at BEA 2025 Shared Task: Class Compensation and Assignment Optimization for LLM-generated Tutor Identification. 1078-1083 - Jiyuan An, Xiang Fu, Bo Liu, Xuquan Zong, Cunliang Kong, Shuliang Liu, Shuo Wang, Zhenghao Liu, Liner Yang, Hanghang Fan, Erhong Yang:

BLCU-ICALL at BEA 2025 Shared Task: Multi-Strategy Evaluation of AI Tutors. 1084-1097 - Rajneesh Tiwari, Pranshu Rastogi:

Phaedrus at BEA 2025 Shared Task: Assessment of Mathematical Tutoring Dialogues through Tutor Identity Classification and Actionability Evaluation. 1098-1107 - Raunak Jain, Srinivasan Rengarajan:

Emergent Wisdom at BEA 2025 Shared Task: From Lexical Understanding to Reflective Reasoning for Pedagogical Ability Assessment. 1108-1120 - Mazen Yasser, Mariam Saeed, Hossam Elkordi, Ayman Khalafallah:

Averroes at BEA 2025 Shared Task: Verifying Mistake Identification in Tutor, Student Dialogue. 1121-1126 - Md. Abdur Rahman, Md Al Amin, Sabik Aftahee, Muhammad Junayed, Md Ashiqur Rahman:

SmolLab_SEU at BEA 2025 Shared Task: A Transformer-Based Framework for Multi-Track Pedagogical Evaluation of AI-Powered Tutors. 1127-1134 - Santiago Góngora, Ignacio Sastre, Santiago Robaina, Ignacio Remersaro, Luis Chiruzzo, Aiala Rosá:

RETUYT-INCO at BEA 2025 Shared Task: How Far Can Lightweight Models Go in AI-powered Tutor Evaluation? 1135-1144 - Geon Park, Jiwoo Song, Gihyeon Choi, Juoh Sun, Harksoo Kim:

K-NLPers at BEA 2025 Shared Task: Evaluating the Quality of AI Tutor Responses with GPT-4.1. 1145-1163 - Henry Pit:

Henry at BEA 2025 Shared Task: Improving AI Tutor's Guidance Evaluation Through Context-Aware Distillation. 1164-1172 - Sebastian Gombert, Fabian Zehner, Hendrik Drachsler:

TBA at BEA 2025 Shared Task: Transfer-Learning from DARE-TIES Merged Models for the Pedagogical Ability Assessment of LLM-Powered Math Tutors. 1173-1179 - Souvik Bhattacharyya, Billodal Roy, Niranjan M, Pranav Gupta:

LexiLogic at BEA 2025 Shared Task: Fine-tuning Transformer Language Models for the Pedagogical Skill Evaluation of LLM-based tutors. 1180-1186 - Sofía Correa Busquets, Valentina Córdova Véliz, Jorge Baier:

IALab UC at BEA 2025 Shared Task: LLM-Powered Expert Pedagogical Feature Extraction. 1187-1193 - Baraa Hikal, Mohmaed Basem, Islam Oshallah, Ali Hamdi:

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLMs as Math Tutors. 1194-1202 - Fatima Dekmak, Christian Khairallah, Wissam Antoun:

TutorMind at BEA 2025 Shared Task: Leveraging Fine-Tuned LLMs and Data Augmentation for Mistake Identification. 1203-1211 - Eduardus Tjitrahardja, Ikhlasul Akmal Hanif:

Two Outliers at BEA 2025 Shared Task: Tutor Identity Classification using DiReC, a Two-Stage Disentangled Contrastive Representation. 1212-1223 - Ana Rosu, Jany-Gabriel Ispas, Sergiu Nisioi:

Archaeology at BEA 2025 Shared Task: Are Simple Baselines Good Enough? 1224-1241 - Trishita Saha, Shrenik Ganguli, Maunendra Sankar Desarkar:

NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors. 1242-1253 - Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal:

NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors. 1254-1259 - Maria Monica Manlises, Mark Edward M. Gonzales, Lanz Lim:

DLSU at BEA 2025 Shared Task: Towards Establishing Baseline Models for Pedagogical Response Evaluation Tasks. 1260-1265 - Shadman Rohan, Ishita Sur Apan, Muhtasim Ibteda Shochcho, Md Fahim, Mohammad Ashfaq Ur Rahman, A. K. M. Mahbubur Rahman, Amin Ali:

BD at BEA 2025 Shared Task: MPNet Ensembles for Pedagogical Mistake Identification and Localization in AI Tutor Responses. 1266-1277 - Harsh Dadwal, Sparsh Rastogi, Jatin Bedi:

Thapar Titan/s : Fine-Tuning Pretrained Language Models with Contextual Augmentation for Mistake Identification in Tutor-Student Dialogues. 1278-1282

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.


Google
Google Scholar
Semantic Scholar
Internet Archive Scholar
CiteSeerX
ORCID














