Multi-Class Machine Learning Approaches to Student Dropout

A Comparative Study of Classifiers and Data Balancing Techniques

Kareem D. Piper (Advisor: Dr. Shusen Pu)

2026-03-17

Background and Motivation

Introduction

  • Student dropout is associated with lower earnings and reduced labor mobility
  • Academic outcomes reflect interacting academic, financial, and demographic factors
  • Traditional statistical summaries struggle with complex high-dimensional relationships
  • Machine learning enables predictive early-warning systems

Why This Problem Matters

Institutions require tools that can:

  • identify at-risk students earlier
  • support proactive intervention
  • move beyond descriptive reporting

A multi-class framework distinguishes:

  • Dropout
  • Enrolled
  • Graduate

Study Purpose and Research Questions

This study aims to:

  • compare machine learning classifiers
  • evaluate sampling strategies for class imbalance
  • identify models suitable for early warning systems

Research questions:

  • Which classifier performs best on the UCI Student Dropout Dataset?
  • How do sampling strategies affect performance?
  • How do boosting models compare with traditional ensembles?
  • Which variables are most associated with student outcomes?

Dataset Overview

Descriptive Statistics

  • Source: UCI Machine Learning Repository
  • Observations: 4,424 students
  • Predictors: 36 features

Outcome classes:

  • Dropout
  • Enrolled
  • Graduate

Descriptive statistics for selected numerical features.

Feature Structure

Predictor domains include:

  • demographic variables
  • academic progression indicators
  • financial variables
  • institutional characteristics

Mixed binary, ordinal, and continuous feature types.

Exploratory Data Analysis

Analysis and Results

EDA focused on:

  • feature distributions
  • subgroup outcome differences
  • variable correlations

Academic progression variables emerged as key predictors.

Feature Correlations

Academic progression variables showed the strongest relationships with outcomes.

Demographic Indicators

Gender and scholarship status showed moderate associations with student outcomes.

Debtor Status and Age Patterns

Debtor status and age showed clear differences across groups, especially for dropout outcomes.

Academic Progress Indicators

Students with stronger academic progress were significantly more likely to graduate.

Class Imbalance

Outcome classes were not evenly distributed, making accuracy insufficient.

Evaluation emphasized:

  • Macro F1
  • Recall
  • Precision

Analytical Framework

Modeling approach:

  • multi-class classification
  • tree-based ensemble algorithms
  • explicit treatment of class imbalance
  • evaluation using macro-level metrics

Modeling Workflow

  1. data preparation
  2. exploratory data analysis
  3. class imbalance assessment
  4. resampling
  5. model training
  6. model comparison
  7. best model selection

Preprocessing and Pipeline Setup

Preprocessing

Pipeline Setup

Feature Typing and Numeric Preprocessing

These code snippets show target encoding, identification of feature types, numerical preprocessing, and structured pipeline construction using ColumnTransformer and related preprocessing components.

Sampling Strategies

Resampling techniques tested:

  • Random under-sampling
  • Random over-sampling
  • SMOTE
  • SMOTEENN
  • SMOTETomek

These methods improve minority-class learning.

Sampling Formulas

Random Under-Sampling

Random Over-Sampling

SMOTE

SMOTEENN

SMOTETomek

Models Evaluated

Algorithms tested:

  • Random Forest
  • XGBoost
  • LightGBM
  • CatBoost

Primary evaluation metric:

Macro F1 Score

Model Formulations

Random Forest

XGBoost

LightGBM

CatBoost

Model Training and Comparison Code

Model Training

Model Comparison

These code snippets show how candidate classifiers were trained and compared across model and sampler combinations using consistent evaluation metrics.

Model Performance

Sampling strategies improved macro-level performance.

Best Model Selection and Final Training Code

Best Model Selection

Final Model Training

These steps identify the strongest model-sampler combination and then refit the selected pipeline for final validation-stage evaluation.

Best Performing Model

Best performing approach:

XGBoost + SMOTETomek

Advantages:

  • highest macro F1 score
  • balanced class performance
  • improved minority-class predictions

Evaluation Metrics Code

This section computes the final evaluation metrics, including accuracy, macro precision, macro recall, macro F1, and weighted F1 for the selected pipeline.

Confusion Matrix

Final Model Evaluation

This code produces final predictions and visual diagnostics, including the confusion matrix used to assess class-level performance for Dropout, Enrolled, and Graduate outcomes.

Model Interpretation

Key observations:

  • Graduate predicted most accurately
  • Dropout predictions comparatively strong
  • Enrolled class most difficult to classify
  • resampling improved minority-class learning

Key Findings

Conclusions

  • Multi-class machine learning effectively modeled student outcomes
  • Class imbalance required explicit treatment
  • Academic progression variables were the strongest predictors
  • Financial indicators also contributed meaningfully
  • Boosting models outperformed traditional ensembles

Practical Implications and Limitations

Predictive models support:

  • institutional early warning systems
  • targeted student support interventions
  • data-driven advising

Limitations:

  • restricted to UCI dataset variables
  • predictive relationships are not causal
  • Enrolled class remains difficult to classify

Future Directions

Future work should explore:

  • expanded feature engineering
  • model interpretability (e.g., SHAP)
  • external institutional validation
  • deployment within student success systems

Final Takeaway

Student dropout prediction is a multi-factor, multi-class problem.

Machine learning provides a powerful framework for:

  • identifying risk patterns
  • improving early intervention
  • supporting data-informed institutional decisions

Acknowledgment

Advisor: Dr. Shusen Pu

Capstone Projects in Data Science
University of West Florida

Questions

Thank you.

Kareem D. Piper