Using Data Science to Help Kids Graduate

High school graduation rates are the final metric of school achievement. School districts can ensure their students feel safe and connected, but the ultimate measure of success is still whether the kid gets across that mortarboard-littered finish line. Timely graduation is associated with all sorts of positive outcomes, so this metric is a truly valuable one.

Because graduation is so crucial, school districts are understandably very interested in taking steps to ensure as many students as possible graduate on time. This often means that districts are trying to identify at-risk students early in order to help them before an on-time graduation becomes impossible.

Finding these students was the task taken on by Lakkaraju, et al, in their 2015 paper "A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes." The researchers used standard machine learning algorithms to predict which students were most at risk, but the context of the situation required a few interesting tweaks to the typical data science process. I will summarize the work in this post.

The Basics

The core process is a familiar one in data science. The problem was a basic binary classification: "Will this student graduate on time?" The researchers took their data (provided by two school districts) and trained a variety of models. These models included a Decision Tree, Logistic Regression, AdaBoost, Random Forest, and a Support Vector Model. Different cohorts within the same district were used as holdout sets for testing, in hopes of rewarding more generalizable models. The models were also created such that they could run with varied inputs, as each district collected slightly different data across different time periods. The researchers wanted a framework that could function regardless of the particular district's specific data. After training the models, the researchers evaluated their usefulness for the specific needs of educators.

This is where the work gets interesting.

The researchers worked with the school districts to determine what they needed from that data. Obviously the models could offer a straight up yes/no prediction of whether a student would graduate on time, but how useful was that?

It turns out, in a real world with limited resources, the answer is "not very."

Fitting to the Context

The educators needed to be able to prioritize which students were most likely to struggle. If the district only had the resources to offer extra support to fifty students in a given year, a list of 200 kids predicted to fail was more demoralizing than helpful.

The researchers engineered their models to determine the likelihood of each student not to graduate on time. Logistic Regression models do this automatically, but the other models required a bit of extra work to extract the information. The team then tested the models to see how effectively they worked with this added wrinkle.

The next issue was that educators needed a model that caught off-track students as early as possible. A model could likely predict graduation rates of second-semester seniors with incredible accuracy, but at that point it would be too late to address the issue. The researchers evaluated their models' goodness at predicting graduation rates early in a student's academic career. Understandably, all the models improved as students got older, but some models were already fairly strong at lower grades.

The last big issue was interpretability. The districts already had early-warning systems in place, often based on attendance and GPA. These factors are easy to understand and logically connect to the likelihood of graduation. The educators were unlikely to prefer a mysterious algorithm with slightly better performance over a clear system that made intuitive sense. The school districts were also concerned about the types of mistakes a model might make. No predictor is perfect, but if the errors were consistently targeting a certain type of student, the model might seem or truly be discriminatory and therefore unacceptable.

The researchers broke out the feature importances for the various models and also inspected the most common errors. As one might expect, GPA and attendance factored highly in most models. Some models, however, relied heavily on gender or disability status. This does not automatically render a model bad, but it certainly might nudge a district to take a deeper look, potentially at both the model and their own practices.

Common errors occurred for students with high GPAs but also high absenteeism, as well as the opposite. As these errors make sense and are not directly related to sensitive features like race or socioeconomic status, they probably wouldn't raise too many red flags for educators. Knowing that these are the models' most frequent errors would likely assuage educators' concerns about moving to a less transparent system for identifying at-risk students.

So what?

For the two districts studied, the Random Forests model did the best on almost all of the metrics evaluated. The point of the article, however, was not to recommend a particular machine learning algorithm. The point was to recommend a process, and the chosen algorithm(s) could vary by school district.

This article showcases what data science does beyond statistics. The researchers learned about the particular context and the needs of the stakeholders. They then found ways to effectively communicate the usefulness of their model to those stakeholders.

A model can have 100% accuracy and still be useless if people don't trust it or understand it or have ways to take action based on its output.

In the setting of a school district seeking to find students most in need of intervention, building the model was just the first step. The researchers then needed to convert the predictions into actionable output and explain what was happening under the hood enough to engender confidence.

The researchers listened to the educators they were working with and produced a model that is successful by the standards of both the data scientists and the teachers. That is an excellent outcome.

And hopefully this collaboration yields more students graduating high school on time, which is the ultimate excellent outcome.

Photo by Jakob Rosen on Unsplash

Lydia Cuffman, Data Scientist