CSC 313 Educational data mining assignment
This is an open-ended programming assignment. You will be provided with a data-set of 62,223 submissions from students on 50 different small programming exercises, along with final exam scores for most of those students.
The core task here is a data mining challenge. Can you come up with ways to model student progamming behaviour in a way that helps predict their final exam scores?
You will work on this in groups.
The dataset
This dataset contains 62,223 programming problem submissions (in Java) from students in a CS 1 course, along with their final grades in that course. Students were allowed to make as many submissions to the programming problems as they wanted.
Obtaining the data
- First, go to DataShop (https://pslcdatashop.web.cmu.edu/login?Submit=Log+in) and create an account. (Note the institutional Cal Poly login doesn’t seem to work, so you’ll just have to login in with your email or Google/Github/etc.)
- Next, go to this webpage (https://pslcdatashop.web.cmu.edu/Files?datasetId=3458)
- Click on the link that says Files (6) — this will take you to a page where multiple data-sets are available. Download the dataset titled 2nd CSEDM Data Challenge - S19 All Data v1.0
Understanding the data
I have summarised the important information below. The dataset is organised as follows:
Root/
early.csv
late.csv
Data/
MainTable.csv
Metadata.csv
CodeStates/
CodeState.csv
LinkTables/
Subject.csv
The important pieces are as follows.
Data/MainTable.csv
Contains programming process event data. Each row of the table represents an event: e.g. one program submission, compilation result, and errors if any. The Run.Program
events are initiated by one student submitting a given problem, while the Compile
and Compile.Error
events are generated by the grading system. The following attributes are defined for each row:
Primary Attributes:
- SubjectID: A unique ID for each participant in the study.
- AssignmentID: An ID for the assignment a student is working on. One assignment may include multiple problems.
- ProblemID: An ID for the problem being worked on by the student.
- CodeStateID: An ID for the students code at the time of the event. This corresponds to one row in the LinkTables/CodeState.csv file, which contains the text of the code.
- EventType:
- Run.Program indicates that the students submitted the program and made an attempt to run the code, and get feedback from test cases. The finished percentage score of the run in the Score column.
- Compile indicates the program is compiled, and a result of whether the compilation is a success or failure is shown on the Compile.Result column.
- Compile.Error value indicates the compilation fails, and the error messages are available in CompileMessageType and CompileMessageData column.
- Score: The score the student got on the problem. Only Run.Program events will have scores.
- ServerTimestamp: The time of the event.
- ServerTimezone: The relative timezone of the server to US Eastern Time, or UTC. Here it is always 0 or UTC.
Data/CodeStates/CodeState.csv
This file contains a mapping from each CodeStateID
in the MainTable
to the source code it represents.
early.csv
This table contains one row for each combination of SubjectID
and ProblemID
for the first 30 Problems
(the first 3 Assignments
). The name “early” comes from the fact that they were assigned early in the term.
- SubjectID: The unique ID of the student attempting the problem.
- AssignmentID / ProblemID: The IDs of the assignment and problem being attempted.
- Attempts: The number of attempts the student made on the problem before either getting it right for the first time, or giving up without getting it right.
- CorrectEventually: This will be TRUE if the student eventually got the problem fully correct (Score = 1), and FALSE if they never submitted a correct solution.
- Note: Attempts and CorrectEventually are provided for convenience and transparency, but these values can be calculated from the
MainTable.csv.
- Note: Attempts and CorrectEventually are provided for convenience and transparency, but these values can be calculated from the
- Label: Whether the student was successful (TRUE) or struggled (FALSE) on this problem (did they take more attempts than most other students).
late.csv
Same as early.csv
but for the last 20 problems (2 assignments) in the data-set.
Data/LinkTables/Subject.csv
Contains a mapping from each SubjectID
in the MainTable
to the final grades of the students in this class (X-Grade
)
Core task
This assignment is split into two parts. The first is geared toward getting you familiar with the dataset. The second is more open-ended.
First task: exploratory data analysis
Answer the following questions from this dataset.1 Use any data analytics toolset with which you’re comfortable (or take this opportunity to add a new tool to your toolbelt). Each question includes some information that should be helpful in answering it.
Question 1. How many unique students are represented in this data-set?
You will probably want to look at data in either Data/MainTable.csv
or Data/LinkTables/Subjects.csv
for this. Why do you suppose your answers would be different depending on which one you looked in?
Question 2. The data-set contains many Assignments
, and each Assignment
is made up of multiple Problems
. Which Problem
had the highest average number of attempts per student?
You’ll need to look in both early.csv
and late.csv
to answer this question.
Question 3. On which Problem
did students tend to face the most compiler errors?
First, how do you interpret this question? Am I asking you to:
- compute, for each problem, the average number of compiler errors across all students (using some measure of central tendency), and then report the maximum, or
- compute the total number of compiler errors faced on each problem, and report the maximum
Which one would be more meaningful? Why?
You’ll need to look at the events in Data/MainTable.csv
to answer this.
Second task: open-ended analysis
Obtain some insights or create something interesting using this data-set.
If it’s relevant to your analysis, this spreadsheet contains the programming problems, prompts, and estimates of which programming concepts were used in each problem.
For example:
- This may be obvious, but is average programming problem performance correlated to final exam score? If it’s not, then there’s a good chance that the problems and the exam are assessing different things, which may be a problem. Is the average number of attempts it takes to solve a problem correlated with final exam score?
- Can you use programming process data from the
MainTable
to develop a model to help predict final exam scores? This can potentially help identify students who are struggling. Even better, can you develop this model only using data from early in the term (i.e., only using problems that appear inearly.csv
)? If we can predict final exam performance from early assignments, we could help prevent course withdrawals. - Can you use programming process data from the early problems to predict if the student will struggle in some late problem? (”Struggle” is the Label column in the
early
andlate
tables). - Can you cluster students based on their programming process? You can base this on the number of attempts they take to solve problems, the number or type of compiler errors they run into, their patterns of submission (i.e., do they tend to work at night or during the day?), etc.
- Can you create an insightful visualisation that tells me something about how students are performing, what problems they struggle on? Note, you don’t have to write code to do this—you can process the data however you like and use something like Tableau to help with this.
Resources
Here are a number of resources that might help you analyse your data.
- The ProgSnap2 data format, a standardised data format for programming snapshot data developed by a group of computing education researchers.
- The specification provides a detailed description of the different required or optional fields.
- You’re not expected to read the whole document—rather, once you have the dataset, use it as a reference to help you make sense of what the different files and columns mean.
- R (a programming language for data analysis) and RStudio (the accompanying IDE).
- Python and accompanying packages for data analysis: pandas, numpy, and if you’re doing some statistical tests that are not available through pandas, scipy.stats
- Tableau for visual data analysis—this is a remarkably powerful tool that has a free version
- VSCode or Jupyter for interactive analysis with Python
- If you’re interested in analysing the actual contents of students’ code submissions, you can use something like javaparser (or javalang if you want to do it in Python) to parse the program snapshots.
Deliverable
There are two deliverables:
- For the first task, enter into Canvas your answers to the questions, as well as an English-language description of your strategy for obtaining those answers.
- For the second task, turn in a zipped folder, which depending on what you do should contain:
- If you wrote code, submit the code along with a README file, indicating how to run the code. Naturally the code should be reasonably well-documented. (If you’re using Jupyter Notebooks, either export the code to a
.py
file or scrub the output from the.ipynb
file that you turn in). - Also turn in a PDF report detailing what you did, any insights you uncovered. If you’re turning in visuals, include those as images in the report.
- If you wrote code, submit the code along with a README file, indicating how to run the code. Naturally the code should be reasonably well-documented. (If you’re using Jupyter Notebooks, either export the code to a
Groups
I will let you self-organise into groups of 3–4 students for this assignment. In Canvas, when you submit, make sure to choose your group before submitting, and make sure that all students’ names are included in the submission.
Rubric
Task 1 (9 points)
Question 1
- Correct numerical answer (1 point)
- Explanation for why the answers might be different depending on the data-set one looks in (2 points)
Question 2
- Correct numerical answer (1 points)
- Strategy for obtaining the answer (2)
Question 3
- Which interpretation is more meaningful and why (2 points)
- Correct numerical answer (1 point)
Task 2 (11 points)
This is more open-ended. Contributions will be assessed based on the following criteria.
- (2 points) Contribution: Is there a clear contribution in the submission? This can be an important insight, a metric, a method of assessing some aspect of students’ programming process, etc. Is there a discussion or statement of why this submission is useful or novel?
- (3 points) Significance: Submissions can take many forms, but they ought to be substantial units of work. For example, a contribution might be a ProgSnap2-compatible implementation of a chosen programming process metric from the CS Ed literature. It may be a model to predict final grades or struggle on later programming problems. It may be series of visualisations (a “dashboard”) geared toward helping an instructor understand a given student’s process, or the where the whole class stands. Whatever it is, it should be substantial.
- (3 points) Evaluation: The submission ought to contain an evaluation of its main contributions. If it’s a model, how good is it? If it’s a research question that was answered, how soundly was it answered? Were there any limitations?
- (3 points) Presentation: Overall presentation of the accompanying report.
-
These are the required questions, but obviously free to explore the data-set further. For example, I like to make barplots and boxplots to help me “eyeball” the data when I am starting to analyse a data-set. ↩