CSC 313 Educational data mining assignment

This is an open-ended programming assignment. You will be provided with a data-set of 62,223 submissions from students on 50 different small programming exercises, along with final exam scores for most of those students.

The core task here is a data mining challenge. Can you come up with ways to model student progamming behaviour in a way that helps predict their final exam scores?

You will work on this in groups.

The dataset

This dataset contains 62,223 programming problem submissions (in Java) from students in a CS 1 course, along with their final grades in that course. Students were allowed to make as many submissions to the programming problems as they wanted.

Obtaining the data

First, go to DataShop (https://pslcdatashop.web.cmu.edu/login?Submit=Log+in) and create an account. (Note the institutional Cal Poly login doesn’t seem to work, so you’ll just have to login in with your email or Google/Github/etc.)
Next, go to this webpage (https://pslcdatashop.web.cmu.edu/Files?datasetId=3458)
Click on the link that says Files (6) — this will take you to a page where multiple data-sets are available. Download the dataset titled 2nd CSEDM Data Challenge - S19 All Data v1.0

Understanding the data

I have summarised the important information below. The dataset is organised as follows:

Root/
	early.csv
	late.csv
	Data/
		MainTable.csv
		Metadata.csv
		CodeStates/
			CodeState.csv
		LinkTables/
			Subject.csv

The important pieces are as follows.

Data/MainTable.csv

Contains programming process event data. Each row of the table represents an event: e.g. one program submission, compilation result, and errors if any. The Run.Program events are initiated by one student submitting a given problem, while the Compile and Compile.Error events are generated by the grading system. The following attributes are defined for each row:

Primary Attributes:

SubjectID: A unique ID for each participant in the study.
AssignmentID: An ID for the assignment a student is working on. One assignment may include multiple problems.
ProblemID: An ID for the problem being worked on by the student.
CodeStateID: An ID for the students code at the time of the event. This corresponds to one row in the LinkTables/CodeState.csv file, which contains the text of the code.
EventType:
- Run.Program indicates that the students submitted the program and made an attempt to run the code, and get feedback from test cases. The finished percentage score of the run in the Score column.
- Compile indicates the program is compiled, and a result of whether the compilation is a success or failure is shown on the Compile.Result column.
- Compile.Error value indicates the compilation fails, and the error messages are available in CompileMessageType and CompileMessageData column.
Score: The score the student got on the problem. Only Run.Program events will have scores.
ServerTimestamp: The time of the event.
ServerTimezone: The relative timezone of the server to US Eastern Time, or UTC. Here it is always 0 or UTC.

Data/CodeStates/CodeState.csv

This file contains a mapping from each CodeStateID in the MainTable to the source code it represents.

early.csv

This table contains one row for each combination of SubjectID and ProblemID for the first 30 Problems (the first 3 Assignments). The name “early” comes from the fact that they were assigned early in the term.

SubjectID: The unique ID of the student attempting the problem.
AssignmentID / ProblemID: The IDs of the assignment and problem being attempted.
Attempts: The number of attempts the student made on the problem before either getting it right for the first time, or giving up without getting it right.
CorrectEventually: This will be TRUE if the student eventually got the problem fully correct (Score = 1), and FALSE if they never submitted a correct solution.
- Note: Attempts and CorrectEventually are provided for convenience and transparency, but these values can be calculated from the MainTable.csv.
Label: Whether the student was successful (TRUE) or struggled (FALSE) on this problem (did they take more attempts than most other students).

late.csv

Same as early.csv but for the last 20 problems (2 assignments) in the data-set.

Data/LinkTables/Subject.csv

Contains a mapping from each SubjectID in the MainTable to the final grades of the students in this class (X-Grade)

Core task

This assignment is split into two parts. The first is geared toward getting you familiar with the dataset. The second is more open-ended.

First task: exploratory data analysis

Answer the following questions from this dataset.¹ Use any data analytics toolset with which you’re comfortable (or take this opportunity to add a new tool to your toolbelt). Each question includes some information that should be helpful in answering it.

Question 1. How many unique students are represented in this data-set?

You will probably want to look at data in either Data/MainTable.csv or Data/LinkTables/Subjects.csv for this. Why do you suppose your answers would be different depending on which one you looked in?

Question 2. The data-set contains many Assignments, and each Assignment is made up of multiple Problems. Which Problem had the highest average number of attempts per student?

You’ll need to look in both early.csv and late.csv to answer this question.

Question 3. On which Problem did students tend to face the most compiler errors?

First, how do you interpret this question? Am I asking you to:

compute, for each problem, the average number of compiler errors across all students (using some measure of central tendency), and then report the maximum, or
compute the total number of compiler errors faced on each problem, and report the maximum

Which one would be more meaningful? Why?

You’ll need to look at the events in Data/MainTable.csv to answer this.

Second task: open-ended analysis

Obtain some insights or create something interesting using this data-set.

If it’s relevant to your analysis, this spreadsheet contains the programming problems, prompts, and estimates of which programming concepts were used in each problem.

For example:

This may be obvious, but is average programming problem performance correlated to final exam score? If it’s not, then there’s a good chance that the problems and the exam are assessing different things, which may be a problem. Is the average number of attempts it takes to solve a problem correlated with final exam score?
Can you use programming process data from the MainTable to develop a model to help predict final exam scores? This can potentially help identify students who are struggling. Even better, can you develop this model only using data from early in the term (i.e., only using problems that appear in early.csv)? If we can predict final exam performance from early assignments, we could help prevent course withdrawals.
Can you use programming process data from the early problems to predict if the student will struggle in some late problem? (”Struggle” is the Label column in the early and late tables).
Can you cluster students based on their programming process? You can base this on the number of attempts they take to solve problems, the number or type of compiler errors they run into, their patterns of submission (i.e., do they tend to work at night or during the day?), etc.
Can you create an insightful visualisation that tells me something about how students are performing, what problems they struggle on? Note, you don’t have to write code to do this—you can process the data however you like and use something like Tableau to help with this.

Resources

Here are a number of resources that might help you analyse your data.

The ProgSnap2 data format, a standardised data format for programming snapshot data developed by a group of computing education researchers.
- The specification provides a detailed description of the different required or optional fields.
- You’re not expected to read the whole document—rather, once you have the dataset, use it as a reference to help you make sense of what the different files and columns mean.
R (a programming language for data analysis) and RStudio (the accompanying IDE).
Python and accompanying packages for data analysis: pandas, numpy, and if you’re doing some statistical tests that are not available through pandas, scipy.stats
Tableau for visual data analysis—this is a remarkably powerful tool that has a free version
VSCode or Jupyter for interactive analysis with Python
If you’re interested in analysing the actual contents of students’ code submissions, you can use something like javaparser (or javalang if you want to do it in Python) to parse the program snapshots.

Deliverable

There are two deliverables:

For the first task, enter into Canvas your answers to the questions, as well as an English-language description of your strategy for obtaining those answers.
For the second task, turn in a zipped folder, which depending on what you do should contain:
1. If you wrote code, submit the code along with a README file, indicating how to run the code. Naturally the code should be reasonably well-documented. (If you’re using Jupyter Notebooks, either export the code to a .py file or scrub the output from the .ipynb file that you turn in).
2. Also turn in a PDF report detailing what you did, any insights you uncovered. If you’re turning in visuals, include those as images in the report.

Groups

I will let you self-organise into groups of 3–4 students for this assignment. In Canvas, when you submit, make sure to choose your group before submitting, and make sure that all students’ names are included in the submission.

Rubric

Task 1 (9 points)

Question 1

Correct numerical answer (1 point)
Explanation for why the answers might be different depending on the data-set one looks in (2 points)

Question 2

Correct numerical answer (1 points)
Strategy for obtaining the answer (2)

Question 3

Which interpretation is more meaningful and why (2 points)
Correct numerical answer (1 point)

Task 2 (11 points)

This is more open-ended. Contributions will be assessed based on the following criteria.

(2 points) Contribution: Is there a clear contribution in the submission? This can be an important insight, a metric, a method of assessing some aspect of students’ programming process, etc. Is there a discussion or statement of why this submission is useful or novel?
(3 points) Significance: Submissions can take many forms, but they ought to be substantial units of work. For example, a contribution might be a ProgSnap2-compatible implementation of a chosen programming process metric from the CS Ed literature. It may be a model to predict final grades or struggle on later programming problems. It may be series of visualisations (a “dashboard”) geared toward helping an instructor understand a given student’s process, or the where the whole class stands. Whatever it is, it should be substantial.
(3 points) Evaluation: The submission ought to contain an evaluation of its main contributions. If it’s a model, how good is it? If it’s a research question that was answered, how soundly was it answered? Were there any limitations?
(3 points) Presentation: Overall presentation of the accompanying report.

These are the required questions, but obviously free to explore the data-set further. For example, I like to make barplots and boxplots to help me “eyeball” the data when I am starting to analyse a data-set. ↩