CSC 313 Educational data mining assignment

This is an open-ended programming assignment. You will be provided with a data-set of 62,223 submissions from students on 50 different small programming exercises, along with final exam scores for most of those students.

The core task here is a data mining challenge. Can you come up with ways to model student progamming behaviour in a way that helps predict their final exam scores?

You will work on this in groups.

The dataset

This dataset contains 62,223 programming problem submissions (in Java) from students in a CS 1 course, along with their final grades in that course. Students were allowed to make as many submissions to the programming problems as they wanted.

Obtaining the data

First, go to DataShop (https://pslcdatashop.web.cmu.edu/login?Submit=Log+in) and create an account. (Note the institutional Cal Poly login doesn’t seem to work, so you’ll just have to login in with your email or Google/Github/etc.)
Next, go to this webpage (https://pslcdatashop.web.cmu.edu/Files?datasetId=3458)
Click on the link that says Files (6) — this will take you to a page where multiple data-sets are available. Download the dataset titled 2nd CSEDM Data Challenge - S19 All Data v1.0

Understanding the data

The dataset is organised as follows:

Root/
	Data/
		MainTable.csv
		Metadata.csv
		CodeStates/
			CodeState.csv
		LinkTables/
			Subject.csv
	early.csv
	late.csv

The important pieces are as follows.

Data/MainTable.csv

This table contains programming process event data. Each row of the table represents an event: e.g. one program submission, compilation result, and errors if any. The Run.Program events are initiated by one student submitting a given problem, while the Compile and Compile.Error events are generated by the grading system. The following attributes are defined for each row:

Primary Attributes:

SubjectID: A unique ID for each participant in the study.
AssignmentID: An ID for the assignment a student is working on. One assignment may include multiple problems.
ProblemID: An ID for the problem being worked on by the student.
CodeStateID: An ID for the students code at the time of the event. This corresponds to one row in the LinkTables/CodeState.csv file, which contains the text of the code.
EventType:
- Run.Program indicates that the students submitted the program and made an attempt to run the code, and get feedback from test cases. The finished percentage score of the run in the Score column.
- Compile indicates the program is compiled, and a result of whether the compilation is a success or failure is shown on the Compile.Result column.
- Compile.Error value indicates the compilation fails, and the error messages are available in CompileMessageType and CompileMessageData column.
Score: The score the student got on the problem. Only Run.Program events will have scores.
ServerTimestamp: The time of the event. Note that the timestamps in this dataset are in Eastern Time.

Data/CodeStates/CodeState.csv

This file contains a mapping from each CodeStateID in the MainTable to the source code it represents. Since these were small programming exercises, the source code is depicted simply as entries in this table.

early.csv

This table contains one row for each combination of SubjectID and ProblemID for the first 30 Problems (the first 3 Assignments). The name “early” comes from the fact that they were assigned early in the term.

SubjectID: The unique ID of the student attempting the problem.
AssignmentID / ProblemID: The IDs of the assignment and problem being attempted.
Attempts: The number of attempts the student made on the problem before either getting it right for the first time, or giving up without getting it right.
CorrectEventually: This will be TRUE if the student eventually got the problem fully correct (Score = 1), and FALSE if they never submitted a correct solution.
Label: Whether the student was successful (TRUE) or struggled (FALSE) on this problem (did they take more attempts than most other students).

Note: Attempts, CorrectEventually, and Label are provided for convenience and transparency, but these values can also be derived from the MainTable.csv.

late.csv

Same as early.csv but for the last 20 problems (2 assignments) in the data-set.

Data/LinkTables/Subject.csv

Contains a mapping from each SubjectID in the MainTable to the final grades of the students in this class (X-Grade)

Core task

This assignment is split into two parts. The first is aimed at getting you familiar with the dataset. The second is more open-ended.

First task: exploratory data analysis

Answer the following questions from this dataset.¹ Use any data analytics toolset with which you’re comfortable (or take this opportunity to add a new tool to your toolbelt). Each question includes some information that should be helpful in answering it.

Question 1. How many unique students are represented in this data-set?

You will probably want to look at data in either Data/MainTable.csv or Data/LinkTables/Subjects.csv for this. Why do you suppose your answers would be different depending on which one you looked in?

Question 2. The data-set contains many Assignments, and each Assignment is made up of multiple Problems. Which Problem had the highest average number of attempts per student?

You’ll need to look in both early.csv and late.csv to answer this question, or compute it directly from the MainTable.

Question 3. On which Problem did students tend to face the most compiler errors?

First, how do you interpret this question? Am I asking you to:

compute, for each problem, the average number of compiler errors across all students (using some measure of central tendency), and then report the maximum, or
compute the total number of compiler errors faced on each problem, and report the maximum

Which one would be more meaningful? Why?

You’ll need to look at the events in Data/MainTable.csv to answer this.

For example, I like to start with tools that let me quickly make charts “on demand” to help me eyeball the data.

Second task: open-ended analysis

Obtain some insights or create something interesting using this data-set.

If it’s relevant to your analysis, this spreadsheet contains the programming problems, prompts, and estimates of which programming concepts were used in each problem.

For example:

This may be obvious, but is average programming problem performance correlated to final exam score? If it’s not, then there’s a good chance that the problems and the exam are assessing different things, which may be a problem. Is the average number of attempts it takes to solve a problem correlated with final exam score?
Can you use programming process data from the MainTable to develop a model to help predict final exam scores? This can potentially help identify students who are struggling. Even better, can you develop this model only using data from early in the term (i.e., only using problems that appear in early.csv)? If we can predict final exam performance from early assignments, we could help prevent course withdrawals.
Can you use programming process data from the early problems to predict if the student will struggle in some late problem? (”Struggle” is the Label column in the early and late tables).
Can you cluster students based on their programming process? You can base this on the number of attempts they take to solve problems, the number or type of compiler errors they run into, their patterns of submission (i.e., do they tend to work at night or during the day?), etc.
Can you create an insightful visualisation that tells me something about how students are performing, what problems they struggle on? Note, you don’t have to write code to do this—you can process the data however you like and use something like Tableau to help with this.

Resources

Here are a number of resources that might help you analyse your data.

The ProgSnap2 data format, a standardised data format for programming snapshot data developed by a group of computing education researchers.
- The specification provides a detailed description of the different required or optional fields.
- You’re not expected to read the whole document. But you can use it as a reference to help you make sense of what the different files and columns mean.
R (a programming language for data analysis) and RStudio (the accompanying IDE).
Python and accompanying packages for data analysis: pandas, numpy, and if you’re doing some statistical tests that are not available through pandas, scipy.stats
Tableau for visual data analysis—this is a remarkably powerful tool that has a free version
VSCode or Jupyter for interactive analysis with Python
If you’re interested in analysing the actual contents of students’ code submissions, you can use something like javaparser (or javalang if you want to do it in Python) to parse the program snapshots.

Deliverable

Turn in up to two items in Canvas.

1. A PDF report containing a report about your findings.

It should detail what you did, why you did it, and any insights you uncovered. The report should include text, graphics, summary tables, and statistics, as appropriate, to get your points across. In particular, I would appreciate reports that try to contextualize their data analyses for me. Don’t just give me results; also tell me why those results are important, and what follow-on questions might me fruitful.

If you use a notebook environment for your analysis, like Jupyter Notebooks, Google Colab, or Observable Notebooks, you may also export that to a PDF, instead of writing a separate report. However, note that your notebook will be held to the same presentation standards as others’ reports, so be sure to organise code well, use markdown cells for heading and prose, and don’t leave stale code and comments lying around.

Your report should also contain your answers to the questions from Part 1.

2. Any code or artifacts you created.

If you wrote code, also submit a zip file containing your code along with a README.md file telling me how to run it (i.e., dependencies needed and how to install them, etc). The code should be reasonably well-documented. If you’re using a Jupyter Notebook or Colab notebook, scrub the output from the .ipynb file before turning it in. This is useful because I will often mess with parts of your submission if I get curious about something in your analysis.

Groups

Work on this project in your workshop groups.

Rubric

Task 1 (9 points)

Question 1

Correct numerical answer (1 point)
Explanation for why the answers might be different depending on the data-set one looks in (2 points)

Question 2

Correct numerical answer (1 points)
Strategy for obtaining the answer (2)

Question 3

Which interpretation is more meaningful and why (2 points)
Correct numerical answer (1 point)

Task 2 (11 points)

This is more open-ended. Contributions will be assessed based on the following criteria.

(2 points) Contribution: Is there a clear contribution in the submission? This can be an important insight, a metric, a method of assessing some aspect of students’ programming process, etc. Is there a discussion or statement of why this submission is useful or novel?
(3 points) Significance: Submissions can take many forms, but they ought to be substantial units of work. For example, a contribution might be a ProgSnap2-compatible implementation of a chosen programming process metric from the CS Ed literature. It may be a model to predict final grades or struggle on later programming problems. It may be series of visualisations (a “dashboard”) geared toward helping an instructor understand a given student’s process, or the where the whole class stands. Whatever it is, it should be substantial.
(3 points) Evaluation: The submission ought to contain an evaluation of its main contributions. If it’s a model, how good is it? If it’s a research question that was answered, how soundly was it answered? Were there any limitations?
(3 points) Presentation: Overall presentation of the accompanying report.

These are the required questions, but obviously free to explore the data-set further. ↩