FalconCode: A Multiyear Dataset of Python Code Samples from an Introductory Computer Science Course
The lack of large and diverse datasets of student code samples limits some forms of computer science education research. To address this problem, we created FalconCode, a novel collection of over 1.5 million Python programs from over two thousand undergraduate students at [Institution Redacted]. FalconCode captures over five semesters worth of code samples from our introduction to computing course, which is taken by every student regardless of their academic major. The dataset contains student code submissions for over 800 programming assignments, as well as additional metadata such as the prompt for each assignment, the testcase(s) used to evaluate student submissions, and the specific skills needed to solve each problem. In this paper, we describe the methodology used to create FalconCode and the steps taken to protect students’ personally identifiable information (PII). We then describe FalconCode’s data schema, and show how it can support a wide range of research—including those utilizing machine learning (ML) and artificial intelligence (AI). FalconCode is provided free-of-charge, and is available upon request.