← CSC 313 Teaching Computing

CSC 313 Educational data mining assignment

This is an open-ended programming assignment. You will be provided with a data-set of 62,223 submissions from students on 50 different small programming exercises, along with final exam scores for most of those students.

The core task here is a data mining challenge. Can you come up with ways to model student progamming behaviour in a way that helps predict their final exam scores?

You will work on this in groups.

The dataset

This dataset contains 62,223 programming problem submissions (in Java) from students in a CS 1 course, along with their final grades in that course. Students were allowed to make as many submissions to the programming problems as they wanted.

Obtaining the data

Understanding the data

I have summarised the important information below. The dataset is organised as follows:


The important pieces are as follows.


Contains programming process event data. Each row of the table represents an event: e.g. one program submission, compilation result, and errors if any. The Run.Program events are initiated by one student submitting a given problem, while the Compile and Compile.Error events are generated by the grading system. The following attributes are defined for each row:

Primary Attributes:


This file contains a mapping from each CodeStateID in the MainTable to the source code it represents.


This table contains one row for each combination of SubjectID and ProblemID for the first 30 Problems (the first 3 Assignments). The name “early” comes from the fact that they were assigned early in the term.


Same as early.csv but for the last 20 problems (2 assignments) in the data-set.


Contains a mapping from each SubjectID in the MainTable to the final grades of the students in this class (X-Grade)

Core task

This assignment is split into two parts. The first is geared toward getting you familiar with the dataset. The second is more open-ended.

First task: exploratory data analysis

Answer the following questions from this dataset.1 Use any data analytics toolset with which you’re comfortable (or take this opportunity to add a new tool to your toolbelt). Each question includes some information that should be helpful in answering it.

Question 1. How many unique students are represented in this data-set?

You will probably want to look at data in either Data/MainTable.csv or Data/LinkTables/Subjects.csv for this. Why do you suppose your answers would be different depending on which one you looked in?

Question 2. The data-set contains many Assignments, and each Assignment is made up of multiple Problems. Which Problem had the highest average number of attempts per student?

You’ll need to look in both early.csv and late.csv to answer this question.

Question 3. On which Problem did students tend to face the most compiler errors?

First, how do you interpret this question? Am I asking you to:

Which one would be more meaningful? Why?

You’ll need to look at the events in Data/MainTable.csv to answer this.

Second task: open-ended analysis

Obtain some insights or create something interesting using this data-set.

If it’s relevant to your analysis, this spreadsheet contains the programming problems, prompts, and estimates of which programming concepts were used in each problem.

For example:


Here are a number of resources that might help you analyse your data.


There are two deliverables:

  1. For the first task, enter into Canvas your answers to the questions, as well as an English-language description of your strategy for obtaining those answers.
  2. For the second task, turn in a zipped folder, which depending on what you do should contain:
    1. If you wrote code, submit the code along with a README file, indicating how to run the code. Naturally the code should be reasonably well-documented. (If you’re using Jupyter Notebooks, either export the code to a .py file or scrub the output from the .ipynb file that you turn in).
    2. Also turn in a PDF report detailing what you did, any insights you uncovered. If you’re turning in visuals, include those as images in the report.


I will let you self-organise into groups of 3–4 students for this assignment. In Canvas, when you submit, make sure to choose your group before submitting, and make sure that all students’ names are included in the submission.


Task 1 (9 points)

Question 1

Question 2

Question 3

Task 2 (11 points)

This is more open-ended. Contributions will be assessed based on the following criteria.

  1. These are the required questions, but obviously free to explore the data-set further. For example, I like to make barplots and boxplots to help me “eyeball” the data when I am starting to analyse a data-set.