May I Ask a Question? MIA40K: A Large-Scale Educational Conversation Dataset and Generation Pipeline
Published in , 1900
Large Language Models (LLMs) have shown significant promise in educational applications, but their full potential is constrained by the limited availability of high-quality educational dialogue data, as traditional data collection methods rely heavily on human involvement. This paper presents a fully automated, highly scalable pipeline for generating educational conversation datasets. Our multi-step framework incorporates solution generation, verification, and dialogue synthesis with LLM-as-a-judge filtering to ensure quality control. Using this pipeline, we introduce MIA40K, a dataset of 39,526 teacher-student conversations focused on mathematics and science education. We evaluate our dataset’s conversational and educational quality through standard metrics and demonstrate its utility in educational dialogue tasks.
Recommended citation: Gamsız, A. F., Köksal, A., Korhonen, A., & Schütze, H. (2024). May I Ask a Question? MIA40K: A Large-Scale Educational Conversation Dataset and Generation Pipeline [Work in progress]
Download Paper