Multi-Class Machine Learning Approaches to Student Dropout
A Comparative Study of Classifiers and Data Balancing Techniques
Kareem D. Piper (Advisor: Dr. Shusen Pu)
2026-03-17
Background and Motivation
Introduction
- Student dropout is associated with lower earnings and reduced labor mobility
- Academic outcomes reflect interacting academic, financial, and demographic factors
- Traditional statistical summaries struggle with complex high-dimensional relationships
- Machine learning enables predictive early-warning systems
Why This Problem Matters
Institutions require tools that can:
- identify at-risk students earlier
- support proactive intervention
- move beyond descriptive reporting
A multi-class framework distinguishes:
- Dropout
- Enrolled
- Graduate
Study Purpose and Research Questions
This study aims to:
- compare machine learning classifiers
- evaluate sampling strategies for class imbalance
- identify models suitable for early warning systems
Research questions:
- Which classifier performs best on the UCI Student Dropout Dataset?
- How do sampling strategies affect performance?
- How do boosting models compare with traditional ensembles?
- Which variables are most associated with student outcomes?
Dataset Overview
Descriptive Statistics
- Source: UCI Machine Learning Repository
- Observations: 4,424 students
- Predictors: 36 features
Outcome classes:
- Dropout
- Enrolled
- Graduate
![]()
Descriptive statistics for selected numerical features.
Feature Structure
Predictor domains include:
- demographic variables
- academic progression indicators
- financial variables
- institutional characteristics
![]()
Mixed binary, ordinal, and continuous feature types.
Exploratory Data Analysis
Analysis and Results
EDA focused on:
- feature distributions
- subgroup outcome differences
- variable correlations
Academic progression variables emerged as key predictors.
Feature Correlations
![]()
Academic progression variables showed the strongest relationships with outcomes.
Demographic Indicators
Gender and scholarship status showed moderate associations with student outcomes.
Debtor Status and Age Patterns
Debtor status and age showed clear differences across groups, especially for dropout outcomes.
Academic Progress Indicators
Students with stronger academic progress were significantly more likely to graduate.
Class Imbalance
Outcome classes were not evenly distributed, making accuracy insufficient.
![]()
Evaluation emphasized:
- Macro F1
- Recall
- Precision
Analytical Framework
Modeling approach:
- multi-class classification
- tree-based ensemble algorithms
- explicit treatment of class imbalance
- evaluation using macro-level metrics
Modeling Workflow
- data preparation
- exploratory data analysis
- class imbalance assessment
- resampling
- model training
- model comparison
- best model selection
Preprocessing and Pipeline Setup
Preprocessing
![]()
Pipeline Setup
![]()
Feature Typing and Numeric Preprocessing
![]()
These code snippets show target encoding, identification of feature types, numerical preprocessing, and structured pipeline construction using ColumnTransformer and related preprocessing components.
Sampling Strategies
Resampling techniques tested:
- Random under-sampling
- Random over-sampling
- SMOTE
- SMOTEENN
- SMOTETomek
These methods improve minority-class learning.
Models Evaluated
Algorithms tested:
- Random Forest
- XGBoost
- LightGBM
- CatBoost
Primary evaluation metric:
Macro F1 Score
![]()
Model Training and Comparison Code
Model Training
![]()
Model Comparison
![]()
These code snippets show how candidate classifiers were trained and compared across model and sampler combinations using consistent evaluation metrics.
Best Model Selection and Final Training Code
Best Model Selection
![]()
Final Model Training
![]()
These steps identify the strongest model-sampler combination and then refit the selected pipeline for final validation-stage evaluation.
Evaluation Metrics Code
![]()
This section computes the final evaluation metrics, including accuracy, macro precision, macro recall, macro F1, and weighted F1 for the selected pipeline.
Confusion Matrix
![]()
Final Model Evaluation
![]()
This code produces final predictions and visual diagnostics, including the confusion matrix used to assess class-level performance for Dropout, Enrolled, and Graduate outcomes.
Model Interpretation
Key observations:
- Graduate predicted most accurately
- Dropout predictions comparatively strong
- Enrolled class most difficult to classify
- resampling improved minority-class learning
Key Findings
Conclusions
- Multi-class machine learning effectively modeled student outcomes
- Class imbalance required explicit treatment
- Academic progression variables were the strongest predictors
- Financial indicators also contributed meaningfully
- Boosting models outperformed traditional ensembles
Practical Implications and Limitations
Predictive models support:
- institutional early warning systems
- targeted student support interventions
- data-driven advising
Limitations:
- restricted to UCI dataset variables
- predictive relationships are not causal
- Enrolled class remains difficult to classify
Future Directions
Future work should explore:
- expanded feature engineering
- model interpretability (e.g., SHAP)
- external institutional validation
- deployment within student success systems
Final Takeaway
Student dropout prediction is a multi-factor, multi-class problem.
Machine learning provides a powerful framework for:
- identifying risk patterns
- improving early intervention
- supporting data-informed institutional decisions
Acknowledgment
Advisor: Dr. Shusen Pu
Capstone Projects in Data Science
University of West Florida
Questions
Thank you.
Kareem D. Piper