Multi-Class Machine Learning Approaches to Student Dropout

A Comparative Study of Classifiers and Data Balancing Techniques

Kareem D. Piper (Advisor: Dr. Shusen Pu)

2026-03-17

Background and Motivation

Introduction

Student dropout is associated with lower earnings and reduced labor mobility
Academic outcomes reflect interacting academic, financial, and demographic factors
Traditional statistical summaries struggle with complex high-dimensional relationships
Machine learning enables predictive early-warning systems

Why This Problem Matters

Institutions require tools that can:

identify at-risk students earlier
support proactive intervention
move beyond descriptive reporting

A multi-class framework distinguishes:

Dropout
Enrolled
Graduate

Study Purpose and Research Questions

This study aims to:

compare machine learning classifiers
evaluate sampling strategies for class imbalance
identify models suitable for early warning systems

Research questions:

Which classifier performs best on the UCI Student Dropout Dataset?
How do sampling strategies affect performance?
How do boosting models compare with traditional ensembles?
Which variables are most associated with student outcomes?

Dataset Overview

Descriptive Statistics

Source: UCI Machine Learning Repository
Observations: 4,424 students
Predictors: 36 features

Outcome classes:

Dropout
Enrolled
Graduate

Descriptive statistics for selected numerical features.

Feature Structure

Predictor domains include:

demographic variables
academic progression indicators
financial variables
institutional characteristics

Mixed binary, ordinal, and continuous feature types.

Exploratory Data Analysis

Analysis and Results

EDA focused on:

feature distributions
subgroup outcome differences
variable correlations

Academic progression variables emerged as key predictors.

Feature Correlations

Academic progression variables showed the strongest relationships with outcomes.

Demographic Indicators

Gender and scholarship status showed moderate associations with student outcomes.

Debtor Status and Age Patterns

Debtor status and age showed clear differences across groups, especially for dropout outcomes.

Academic Progress Indicators

Students with stronger academic progress were significantly more likely to graduate.

Class Imbalance

Outcome classes were not evenly distributed, making accuracy insufficient.

Evaluation emphasized:

Macro F1
Recall
Precision

Analytical Framework

Modeling approach:

multi-class classification
tree-based ensemble algorithms
explicit treatment of class imbalance
evaluation using macro-level metrics

Modeling Workflow

data preparation
exploratory data analysis
class imbalance assessment
resampling
model training
model comparison
best model selection

Preprocessing and Pipeline Setup

Preprocessing

Pipeline Setup

Feature Typing and Numeric Preprocessing

These code snippets show target encoding, identification of feature types, numerical preprocessing, and structured pipeline construction using ColumnTransformer and related preprocessing components.

Sampling Strategies

Resampling techniques tested:

Random under-sampling
Random over-sampling
SMOTE
SMOTEENN
SMOTETomek

These methods improve minority-class learning.

Sampling Formulas

Random Under-Sampling

Random Over-Sampling

SMOTE

SMOTEENN

SMOTETomek

Models Evaluated

Algorithms tested:

Random Forest
XGBoost
LightGBM
CatBoost

Primary evaluation metric:

Macro F1 Score

Model Formulations

Random Forest

XGBoost

LightGBM

CatBoost

Model Training and Comparison Code

Model Training

Model Comparison

These code snippets show how candidate classifiers were trained and compared across model and sampler combinations using consistent evaluation metrics.

Model Performance

Sampling strategies improved macro-level performance.

Best Model Selection and Final Training Code

Best Model Selection

Final Model Training

These steps identify the strongest model-sampler combination and then refit the selected pipeline for final validation-stage evaluation.

Best Performing Model

Best performing approach:

XGBoost + SMOTETomek

Advantages:

highest macro F1 score
balanced class performance
improved minority-class predictions

Evaluation Metrics Code

This section computes the final evaluation metrics, including accuracy, macro precision, macro recall, macro F1, and weighted F1 for the selected pipeline.

Confusion Matrix

Final Model Evaluation

This code produces final predictions and visual diagnostics, including the confusion matrix used to assess class-level performance for Dropout, Enrolled, and Graduate outcomes.

Model Interpretation

Key observations:

Graduate predicted most accurately
Dropout predictions comparatively strong
Enrolled class most difficult to classify
resampling improved minority-class learning

Key Findings

Conclusions

Multi-class machine learning effectively modeled student outcomes
Class imbalance required explicit treatment
Academic progression variables were the strongest predictors
Financial indicators also contributed meaningfully
Boosting models outperformed traditional ensembles

Practical Implications and Limitations

Predictive models support:

institutional early warning systems
targeted student support interventions
data-driven advising

Limitations:

restricted to UCI dataset variables
predictive relationships are not causal
Enrolled class remains difficult to classify

Future Directions

Future work should explore:

expanded feature engineering
model interpretability (e.g., SHAP)
external institutional validation
deployment within student success systems

Final Takeaway

Student dropout prediction is a multi-factor, multi-class problem.

Machine learning provides a powerful framework for:

identifying risk patterns
improving early intervention
supporting data-informed institutional decisions

Acknowledgment

Advisor: Dr. Shusen Pu

Capstone Projects in Data Science
University of West Florida

Questions

Thank you.

Kareem D. Piper