← CSC 313 Teaching Computing

CSC 313 Educational data mining assignment

This is an open-ended programming assignment. You will be provided with a data-set of 62,223 submissions from students on 50 different small programming exercises, along with final exam scores for most of those students.

The core task here is a data mining challenge. Can you come up with ways to model student progamming behaviour in a way that helps predict their final exam scores?

You will work on this in groups.

The dataset

This dataset contains 62,223 programming problem submissions (in Java) from students in a CS 1 course, along with their final grades in that course. Students were allowed to make as many submissions to the programming problems as they wanted.

Obtaining the data

Understanding the data

The dataset is organised as follows:

Root/
	Data/
		MainTable.csv
		Metadata.csv
		CodeStates/
			CodeState.csv
		LinkTables/
			Subject.csv
	early.csv
	late.csv

The important pieces are as follows.

Data/MainTable.csv

This table contains programming process event data. Each row of the table represents an event: e.g. one program submission, compilation result, and errors if any. The Run.Program events are initiated by one student submitting a given problem, while the Compile and Compile.Error events are generated by the grading system. The following attributes are defined for each row:

Primary Attributes:

Data/CodeStates/CodeState.csv

This file contains a mapping from each CodeStateID in the MainTable to the source code it represents. Since these were small programming exercises, the source code is depicted simply as entries in this table.

early.csv

This table contains one row for each combination of SubjectID and ProblemID for the first 30 Problems (the first 3 Assignments). The name “early” comes from the fact that they were assigned early in the term.

Note: Attempts, CorrectEventually, and Label are provided for convenience and transparency, but these values can also be derived from the MainTable.csv.

late.csv

Same as early.csv but for the last 20 problems (2 assignments) in the data-set.

Data/LinkTables/Subject.csv

Contains a mapping from each SubjectID in the MainTable to the final grades of the students in this class (X-Grade)

Core task

This assignment is split into two parts. The first is aimed at getting you familiar with the dataset. The second is more open-ended.

First task: exploratory data analysis

Answer the following questions from this dataset.1 Use any data analytics toolset with which you’re comfortable (or take this opportunity to add a new tool to your toolbelt). Each question includes some information that should be helpful in answering it.

Question 1. How many unique students are represented in this data-set?

You will probably want to look at data in either Data/MainTable.csv or Data/LinkTables/Subjects.csv for this. Why do you suppose your answers would be different depending on which one you looked in?

Question 2. The data-set contains many Assignments, and each Assignment is made up of multiple Problems. Which Problem had the highest average number of attempts per student?

You’ll need to look in both early.csv and late.csv to answer this question, or compute it directly from the MainTable.

Question 3. On which Problem did students tend to face the most compiler errors?

First, how do you interpret this question? Am I asking you to:

Which one would be more meaningful? Why?

You’ll need to look at the events in Data/MainTable.csv to answer this.

For example, I like to start with tools that let me quickly make charts “on demand” to help me eyeball the data.

Second task: open-ended analysis

Obtain some insights or create something interesting using this data-set.

If it’s relevant to your analysis, this spreadsheet contains the programming problems, prompts, and estimates of which programming concepts were used in each problem.

For example:

Resources

Here are a number of resources that might help you analyse your data.

Deliverable

Turn in up to two items in Canvas.

1. A PDF report containing a report about your findings.

It should detail what you did, why you did it, and any insights you uncovered. The report should include text, graphics, summary tables, and statistics, as appropriate, to get your points across. In particular, I would appreciate reports that try to contextualize their data analyses for me. Don’t just give me results; also tell me why those results are important, and what follow-on questions might me fruitful.

If you use a notebook environment for your analysis, like Jupyter Notebooks, Google Colab, or Observable Notebooks, you may also export that to a PDF, instead of writing a separate report. However, note that your notebook will be held to the same presentation standards as others’ reports, so be sure to organise code well, use markdown cells for heading and prose, and don’t leave stale code and comments lying around.

Your report should also contain your answers to the questions from Part 1.

2. Any code or artifacts you created.

If you wrote code, also submit a zip file containing your code along with a README.md file telling me how to run it (i.e., dependencies needed and how to install them, etc). The code should be reasonably well-documented. If you’re using a Jupyter Notebook or Colab notebook, scrub the output from the .ipynb file before turning it in. This is useful because I will often mess with parts of your submission if I get curious about something in your analysis.

Groups

Work on this project in your workshop groups.

Rubric

Task 1 (9 points)

Question 1

Question 2

Question 3

Task 2 (11 points)

This is more open-ended. Contributions will be assessed based on the following criteria.

  1. These are the required questions, but obviously free to explore the data-set further.