Multi-Class Machine Learning Approaches to Student Dropout

A Comparative Study of Classifiers and Data Balancing Techniques

Author

Kareem D. Piper (Advisor: Dr. Shusen Pu)

Published

March 17, 2026

1 Introduction

The goal of this capstone project is to test, assess and report the results of various multi-class machine learning prediction models on their ability to accurately predict student dropout in the higher education setting. According to Realinho et al. (2022), student attrition and academic failure not only negatively affect the education institutions that students attend, but also create greater societal issues. For example, when students drop out prior to completing a degree program, it makes them less competitive in the job market, which then leads to economic deficiencies, which can lead to inadequate health care access and subsequent more dire socio-economic issues that often disproportionately affect marginalized demographics (Realinho et al., 2022).

According to Aina et al. (2022), statistics released by the Organization of Economic Co-Operation and Development (OECD) state that though the proportion of students enrolling and graduating from colleges and universities far exceeds those dropping out, dropout still represents a third of said students. Further, Aina et al. (2022) assert that thirty percent of students in United States dropout, with many of them being early dropouts (i.e., first year of college). Student dropout is a complex issues and there are many factors that attribute to it (Aina et al., 2022). The factors that contribute to student dropout boarder sociological, economic, and psychological domains, as such according to Aina et al. (2022) scholars approach the phenomena from their respective fields of expertise. There is a need for a more holistic lens when researching the causes of student dropout to better understand the relationships among the factors that contribute to it, thereby gaining insights on how best to mitigate it through intervention (Aina et al., 2022).

Machine Learning provides the ability for researchers to approach student dropout in a holistic manner specifically because models can converge high dimensional data sets (Raschka et al., 2022). Researchers (e.g., Realinho et al., 2022; Ridwan et al., 2024) have utilized machine learning methods to predict student dropout. However, according to Mduma (2023), the prediction of student dropout is a particularly challenging issue for education researchers, due in large part to class imbalance. According to Mduma (2023) many real world datasets concerning student dropout are largely skewed towards the retained or enrolled classes. Many scholars pay particular attention to feature engineering methods when using machine learning methodologies to address the issue of predicting student attrition. However, Mduma (2023) asserted that class imbalance presents limitations concerning model accuracy and generalization and thus framed their study around that issue.

This capstone project focuses on the appropriate data preprocessing methods, feature engineering methods, and methods to adequately address class imbalance. Further, I specifically use machine learning models that have a proven track record in the literature (e.g., Realinho et al., 2022; Ridwan et al., 2024) concerning their ability to address multi-class datasets, for example Random Forest(RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM) and Categorical Boosting (CatBoost).

2 Research Questions

The purpose of this capstone project is to assess and report the results of various multi-class machine learning prediction models on their ability to accurately predict student dropout in the higher education setting using the (UCI Student Dropout Dataset). To inform this capstone project’s purpose, I asked the following primary and sub-research questions.

Primary Research Question

RQ1: Which multi-class machine learning classifier achieves the highest predictive performance as measured by; accuracy, precision, and F1-score when applied to the (UCI Student Dropout Dataset)?

RQ1a: How do methods of addressing class imbalance affect the predictive performance of multi-class machine learning classifiers?

RQ1b: How does the performance of gradient boosting models compare to traditional ensable models in multi-class student attrition prediction?

3 Methods

The main research question of this capstone project centers around comparing multi-class machine learning classifiers and addressing class imbalance, to accurately predict student dropout in a higher education setting. Thus, the methods I use correlate directly to those used by (e.g., Mduma, 2023; Realinho et al., 2022). For example, like the dataset used by Realinho et al. (2022), the target in this capstone’s dataset has three categories (i.e, dropout, enrolled, and graduate), which requires the use of multi-class classifiers. As is the case with many real world datasets the classes in the (UCI Student DropoutDataset) are skewed towards enrolled and graduated. To address the issue of class imbalance I used five varied data balancing techniques each paired with multi-class machine learning classifier to determine which pairing most accurately predicted student dropout (Mduma, 2023).

As mentioned previously the dataset I used in this study stems from the UC Irvine Machine Learning Repository and can be accessed here:(UCI Student Dropout Dataset) (Realinho et al., 2022). Realinho et al. (2022) characterize the dataset as tubular, and state that it is meant for the social sciences, specifically for researchers working on classification based projects. There are (n = 4,424) cases and (n = 36) features in the dataset. The feature types according to Realinho et al. (2022) consist of real, categorical and integers based features. According to Realinho et al. (2022) the dataset was created specifically to help researchers who are interested in studying which factors contribute to student dropout or academic failure in higher education using machine learning methods. There target has three categories (i.e., dropout, enrolled and graduate). According to Realinho et al. (2022) the dataset was rigorously preprocessed and consist of no null or missing values.

To better understand the dataset I conducted exploratory data analysis (EDA) on the (UCI Student Dropout Dataset). My goal was to gain a deeper understanding of the datasets’ features, their relationship to eachother and the target classes. EDA is a fundamental step in the machine learning pipeline as data may present limitations concerning the chosen methods (Raschka et al., 2022). The full EDA code can be found in this (Jupyter Notebook).

Code

# --- Install (run once in Colab / fresh venv) ---
# !pip install -q catboost lightgbm imbalanced-learn xgboost seaborn

# --- Standard library ---
import os
from pathlib import Path
from math import ceil

# --- Core scientific stack ---
import numpy as np
import pandas as pd

# --- Visualization ---
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import MaxNLocator
import seaborn as sns

# --- Scikit-learn: model selection / preprocessing / pipelines ---
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# --- Scikit-learn: feature selection ---
from sklearn.feature_selection import mutual_info_classif

# --- Scikit-learn: models ---
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# --- Scikit-learn: metrics ---
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    confusion_matrix,
    make_scorer
)

# --- External models ---
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

# --- Imbalanced-learn: pipelines + resampling ---
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.combine import SMOTEENN, SMOTETomek

Code

from pathlib import Path
import pandas as pd

# HARD-SET your repo root (this is the path that contains index.qmd and the data/ folder)
repo_root = Path("/Users/kareempiper/Projects/IDC6940_DataPraxisAI")
data_dir = repo_root / "data"
train_path = data_dir / "train.csv"
test_path  = data_dir / "test.csv"

print("Repo root:", repo_root)
print("Train exists:", train_path.exists(), train_path)
print("Test exists:", test_path.exists(), test_path)

if not (data_dir.exists() and train_path.exists() and test_path.exists()):
    raise FileNotFoundError(
        f"Expected data files not found.\n"
        f"data_dir: {data_dir} (exists={data_dir.exists()})\n"
        f"train: {train_path} (exists={train_path.exists()})\n"
        f"test:  {test_path} (exists={test_path.exists()})"
    )

train_data = pd.read_csv(train_path)
test_data  = pd.read_csv(test_path)

print("Train shape:", train_data.shape)
print("Test shape:", test_data.shape)

Repo root: /Users/kareempiper/Projects/IDC6940_DataPraxisAI
Train exists: True /Users/kareempiper/Projects/IDC6940_DataPraxisAI/data/train.csv
Test exists: True /Users/kareempiper/Projects/IDC6940_DataPraxisAI/data/test.csv
Train shape: (3539, 37)
Test shape: (885, 37)

Code

# Show all columns
pd.set_option('display.max_columns', None)
train_data.head()

Code

train_data.describe()

3.0.1 Table 1: Descriptive Statistics for Selected Numerical Features

Statistic	Age at Enrollment	Admission Grade	Curricular Units 2nd Sem (Grade)	Unemployment Rate	Inflation Rate	GDP
Count	3539.0	3539.0	3539.0	3539.0	3539.0	3539.0
Mean	23.3	126.88	11.23	11.5	1.23	1.76
Std	6.13	14.53	2.94	2.62	1.63	2.33
Min	17.0	95.0	0.0	7.6	-0.8	-4.06
25%	19.0	117.8	10.33	9.4	0.3	0.58
50%	21.0	126.0	12.0	11.1	1.4	1.74
75%	25.0	134.9	13.0	13.9	2.6	3.26
Max	70.0	190.0	18.0	16.2	3.7	3.51

Code

train_data.info()

Code

test_data.info()

Code

tg_cnts = train_data.Target.value_counts()
tg_cnts_sum = tg_cnts.sum()
tg_cnts_pct = (tg_cnts / tg_cnts_sum) * 100

# Plot
colors = ['#00FFFF', '#FF00FF', '#00FF00']  # Neon blue, magenta, lime green

plt.bar(tg_cnts.index, tg_cnts_pct, color=colors)
plt.xlabel('Categories')
plt.ylabel('Percentage')
plt.title('Target Distribution')

# Percentages on bars
for i, pct in enumerate(tg_cnts_pct):
    plt.text(i, pct + 1, f'{pct:.1f}%', ha='center', va='bottom')

plt.show()

Code

# Feature Cardinality Check
feature_count = (
    train_data.drop(columns=['Target'])
    .nunique()
    .reset_index()
    .rename(columns={'index': 'feature', 0: 'unique_count'})
    .sort_values(by='unique_count')
)

feature_count

3.0.2 Table 2: Unique Value Counts per Feature (Top and Bottom)

Feature	Unique Count
Daytime/evening attendance	2
Displaced	2
Debtor	2
Educational special needs	2
International	2
Scholarship holder	2
Gender	2
Tuition fees up to date	2
Marital status	6
Application order	8
…	…
Age at enrollment	46
Previous qualification (grade)	99
Admission grade	591
Curricular units 2nd sem (grade)	674
Curricular units 1st sem (grade)	683

Code

# Target Distribution by Gender
crosstab_gender = pd.crosstab(train_data['Gender'], train_data['Target'])
crosstab_gender = crosstab_gender.div(crosstab_gender.sum(axis=1), axis=0) * 100

# Plot
colors = ['#00FFFF', '#FF00FF', '#00FF00']  # Neon blue, magenta, lime green
ax_gender = crosstab_gender.plot(kind='bar', stacked=True, color=colors)
plt.xlabel('Gender (0 = Female, 1 = Male)')
plt.ylabel('Percentage')
plt.title('Distribution of Target Based on Gender')
plt.legend(title='Target')
plt.xticks(rotation=0)

for p in ax_gender.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    ax_gender.annotate(
        f'{height:.1f}%',
        (x + width / 2, y + height / 2),
        ha='center',
        va='center',
        fontsize=10,
        color='white'
    )

plt.show()

3.0.3 Table 3: Target Distribution by Gender (Percentages)

Gender	Dropout	Enrolled	Graduate
0	25.28	16.80	57.92
1	44.80	20.06	35.13

Code

# Create a crosstab to count the occurrences of each Target value for each Scholarship status
crosstab_ss = pd.crosstab(train_data['Scholarship holder'], train_data['Target'])

# Normalize the crosstab counts to percentages
crosstab_ss = crosstab_ss.div(crosstab_ss.sum(axis=1), axis=0) * 100

# Plot the data
colors = ['#00FFFF', '#FF00FF', '#00FF00']  # Neon blue, magenta, lime green
ax_ss = crosstab_ss.plot(kind='bar', stacked=True, color=colors)
plt.xlabel('Scholarship (0 = No Scholarship, 1 = Scholarship Holder)')
plt.ylabel('Percentage')
plt.title('Distribution of Target Based on Scholarship Holder')
plt.legend(title='Target')
plt.xticks(rotation=0)

# Annotate the percentage labels on each bar
for p in ax_ss.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    ax_ss.annotate(
        f'{height:.1f}%',
        (x + width / 2, y + height / 2),
        ha='center',
        va='center',
        fontsize=10,
        color='white'
    )

plt.show()

3.0.4 Table 4: Target Distribution by Scholarship Holder Status (Percentages)

Scholarship Holder	Dropout	Enrolled	Graduate
0	38.64	20.08	41.29
1	12.86	11.63	75.50

Code

# Create a crosstab to count the occurrences of each Target value for each Debtor status
crosstab_debtor = pd.crosstab(train_data['Debtor'], train_data['Target'])

# Normalize the crosstab counts to percentages
crosstab_debtor = crosstab_debtor.div(crosstab_debtor.sum(axis=1), axis=0) * 100

# Plot the data
colors = ['#00FFFF', '#FF00FF', '#00FF00']  # Neon blue, magenta, lime green
ax_debtor = crosstab_debtor.plot(kind='bar', stacked=True, color=colors)
plt.xlabel('Debtor (0 = No Debt, 1 = Debtor)')
plt.ylabel('Percentage')
plt.title('Distribution of Target Based on Debt Holder')
plt.legend(title='Target')
plt.xticks(rotation=0)

# Annotate the percentage labels on each bar
for p in ax_debtor.patches:
    width = p.get_width()
    height = p.get_height()
    x, y = p.get_xy()
    ax_debtor.annotate(
        f'{height:.1f}%',
        (x + width / 2, y + height / 2),
        ha='center',
        va='center',
        fontsize=10,
        color='white'
    )

plt.show()

3.0.5 Table 5: Target Distribution by Debtor Status (Percentages)

Debtor	Dropout	Enrolled	Graduate
0	28.19	17.97	53.84
1	64.01	17.74	18.25

Code

### Age Distribution by Target (ECDF)

plt.figure(figsize=(10, 6))
colors = ['#00FFFF', '#FF00FF', '#00FF00']  # Neon blue, magenta, lime green
plot = sns.ecdfplot(data=train_data, x='Age at enrollment', hue='Target', palette=colors)
plt.xlabel('Age at enrollment')
plt.ylabel('Proportion')
plt.title('Cumulative Distribution of Age by Target')
plot.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))

mean_value = train_data['Age at enrollment'].mean()
plt.axvline(mean_value, color='grey', linestyle='--', label=f'Mean: {mean_value:.1f}')

plt.legend(title='Target')
plt.show()

Code

### Approved 2nd Semester Units by Target (ECDF)

plt.figure(figsize=(10, 6))
colors = ['#00FFFF', '#FF00FF', '#00FF00']  # Neon blue, magenta, lime green
plot = sns.ecdfplot(
    data=train_data,
    x='Curricular units 2nd sem (approved)',
    hue='Target',
    palette=colors
)

plt.xlabel('Curricular units 2nd sem (approved)')
plt.ylabel('Proportion')
plt.title('Cumulative Distribution of Curricular Units 2nd Sem (approved) by Target')
plot.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))

mean_value = train_data['Curricular units 2nd sem (approved)'].mean()
plt.axvline(mean_value, color='grey', linestyle='--', label=f'Mean: {mean_value:.1f}')

plt.legend(title='Target')
plt.show()

Code

### 2nd Semester Grade Distribution by Target (ECDF)

plt.figure(figsize=(10, 6))
colors = ['#00FFFF', '#FF00FF', '#00FF00']  # Neon blue, magenta, lime green
plot = sns.ecdfplot(
    data=train_data,
    x='Curricular units 2nd sem (grade)',
    hue='Target',
    palette=colors
)

plt.xlabel('Curricular units 2nd sem (grade)')
plt.ylabel('Proportion')
plt.title('Cumulative Distribution of Curricular Units 2nd Sem (grade) by Target')

mean_value = train_data['Curricular units 2nd sem (grade)'].mean()
plt.axvline(mean_value, color='grey', linestyle='--', label=f'Mean: {mean_value:.1f}')

plt.legend(title='Target')
plt.show()

Code

### Identification of Categorical Features

# Identify categorical columns based on low cardinality
cat_cols = [
    col for col in train_data.columns
    if col != 'Target' and train_data[col].nunique() <= 8
]

# Convert to categorical dtype (for CatBoost compatibility later)
for col in cat_cols:
    train_data[col] = train_data[col].astype('category')
    test_data[col] = test_data[col].astype('category')

print("Categorical columns:")
print(cat_cols)

Categorical columns:
['Marital status', 'Application order', 'Daytime/evening attendance', 'Displaced', 'Educational special needs', 'Debtor', 'Tuition fees up to date', 'Gender', 'Scholarship holder', 'International']

Code

# Categorical feature distributions
n_cols = 4
n_rows = int(np.ceil(len(cat_cols) / n_cols))

fig, axs = plt.subplots(n_rows, n_cols, figsize=(11, 2.2 * n_rows))
axs = np.array(axs).ravel()

# Define a cyberpunk color palette
cyberpunk_palette = sns.color_palette('magma', n_colors=8)

for ax, col in zip(axs, cat_cols):
    vc = train_data[col].value_counts(normalize=True).sort_index()
    # Use the color palette for bars, cycling through it if more categories than colors
    ax.bar(vc.index.astype(str), vc, color=cyberpunk_palette[:len(vc)])
    ax.set_title(col, fontsize=10)
    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.0%}'))
    ax.tick_params(axis='x', labelrotation=45)

# Turn off any unused axes
for ax in axs[len(cat_cols):]:
    ax.set_visible(False)
plt.tight_layout()
plt.show()

Code

### Continuous Feature Distributions

# Float feature distributions
float_cols = [
    col for col in train_data.columns
    if col != 'Target' and train_data[col].dtype == 'float64'
]

n_cols = 3
n_rows = int(np.ceil(len(float_cols) / n_cols))

fig, axs = plt.subplots(n_rows, n_cols, figsize=(11, 2.5 * n_rows))
axs = np.array(axs).ravel()

for ax, col in zip(axs, float_cols):
    ax.hist(train_data[col], bins=50, density=True, color='#00FFFF') # Neon blue
    ax.set_title(col, fontsize=10)

# Turn off unused axes
for ax in axs[len(float_cols):]:
    ax.axis('off')

plt.suptitle('Float Variables Distribution', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

Code

# Integer feature distributions
int_cols = [
    col for col in train_data.columns
    if col != 'Target' and train_data[col].dtype == 'int64'
]

n_cols = 4
n_rows = int(np.ceil(len(int_cols) / n_cols))

fig, axs = plt.subplots(n_rows, n_cols, figsize=(11, 2.2 * n_rows))
axs = np.array(axs).ravel()

for ax, col in zip(axs, int_cols):
    vc = train_data[col].value_counts(normalize=True).sort_index()
    ax.bar(vc.index, vc, color='#00FF00') # Lime green
    ax.xaxis.set_major_locator(MaxNLocator(integer=True))
    ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.0%}'))
    ax.set_title(col, fontsize=10)

# Turn off unused axes
for ax in axs[len(int_cols):]:
    ax.axis('off')

plt.suptitle('Integer Variables Distribution', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

Code

### Spearman Correlation Analysis

# Select numeric features only (exclude Target)
numeric_cols = train_data.select_dtypes(include=['int64', 'float64']).columns
numeric_cols = [col for col in numeric_cols if col != 'Target']

plt.figure(figsize=(20, 15))
sns.heatmap(
    train_data[numeric_cols].corr(method='spearman'),
    cmap='magma',
    annot=True,
    fmt='.1f',
    linewidths=0.5
)
plt.title('Spearman Correlation Matrix (Numeric Features)')
plt.show()

The (UCI Student Dropout Dataset) does depict significant class imbalance with the Graduate class representing 49.9%, (n = 2,209) of the distribution followed by the Dropout Category class at 32.1%, (n = 1,421) and finally the Enrolled class at 17.9%, (n = 794). I observed several meaningful relationships across the feature groups (i.e., demographic, financial and academic). For example, demographic and financial features displayed strong correlations to the target. When I ran the crosstab on the gender feature, it showed that males were more likely to dropout and females more likely to graduate. Those students who held scholarships also showed higher graduation rates compared to non-scholarship students, and were less likely to drop out. Further, those students who were in financial distress as indicated by being coded as debtors had lower graduation rates than those coded as non-debtors. Concerning academic features, the second-semester grades and approved curricular units features showed clear dispersion between the target classes, which may indicate importance pertaining to class prediction.

As mentioned the dataset does display significant class imbalance. This class imbalance if left un-checked can lead to model bias towards the majority class (Mduma, 2023). I also noticed that many of the categorical features were encoded as integers, which may lead to false interpretation when conducting correlation analysis if not addressed properly (Raschka et al., 2022). Another limitation that this data presents is its lack of temporal features. As the data represents a snapshot in time, models may not be able to capture the full complexity of predicting student dropout due to the lack of dynamic changes over time. Finally, as this data stems from a single institution, the findings may present limitation concerning generalization. Though I am not as concerned with this as the purpose of this study is to test which model and data balancing technique when paired has the highest accuracy. In the following paragraphs I explain the full machine learning pipeline that I employed to test and evaluate the multi-class classifiers as well as the class balancing techniques that accompanied them.

Following the EDA, I began to prepare the dataset for analysis. As is consistent with prior research using machine learning to predict student dropout (e.g., Barramuno et al., 2022; Kok et al., 2024; Mun & Jo, 2023). For example, I separated the feature matrix (X) from the target (Y). The target was cast as a string type to prevent ordinal interpretation and to ensure consistent label processing. Next, I used LabelEcoder to transform the categories of the target (i.e., “DropOut”, “Enrolled”, and “Graduate”) into numerical categories (i.e., ‘Dropout’: 0, ‘Enrolled’: 1, and ‘Graduate’ : 2) (Raschka et al., 2022). The dataset was then partitioned into training subset and a validation subset using a train_test_split of 80:20 and stratification to preserve the the original class distribution across both sets (Mduma, 2023; Pek et al., 2023). Stratification is used by researchers (e.g., Mduma, 2023; Pek et al., 2023), to maintain balance among the target classes when splitting, and is particularly important to predicting student dropout when class imbalance presents a limitation, as class imbalance, if left unchecked, can lead to bias.

Next, I established a structured machine learning pipeline using ColumnTransfromer which ensured reproducibility and prevented data leakage (Raschka et al., 2022). Categorical features which were identified by ‘object’ and ‘categorical’ as well as variables with low cardinality (i.e., ≤ 8 unique values) were preprocessed using SimpleImputer followed by OneHotEcoder converting all nominal variables with binary classification to ensure compatibility with classification algorithms (Andrade-Giron et al., 2023; Namoun & Alshanqiti, 2020). Though the dataset, according to Realinho et al. (2022) had no missing values, as a precaution and sanity check I addressed any potential missing values in the numerical features using the SimpleImputer median technique. By embedding preprocessing into the pipeline, I ensured that all transformations were learnt from the training data thus reducing data leakage and enhancing the potential for generalization.

This study employs a systematic experimental framework to evaluate the effects of class- imbalance mitigation techniques paired with multi-class classifiers within a machine learning pipeline. In comparing multiple resampling techniques and gradient boosting architectures, while ensuring rigorous validation methods that enabled robustness, minority class sensitivity, and generalization performance (e.g., Delen et al., 2024; Kok et al., 2024; Ridwan et al., 2024). This established framework provided the foundation for all model analyses and interpretive model evaluations conducted in the following Analysis and Results section (Mduma, 2023; Pek et al., 2023).

4 Analysis and Results

All model pipeline and experimentation was conducted in Google Colab using a (Jupyter Notebook). As mentioned previously I encoded the target into numeric class labels (i.e., ‘Dropout’: 0, ‘Enrolled’: 1, and ‘Graduate’ : 2). Next, I split the dataset into training and validation sets using a 80:20 train_test_split and stratification (Raschka et al., 2022). All numeric and categorical features were preprocessed by imputation and one-hot encoding (Raschka et al., 2022).

After the data was pre-processed I implemented the class-imbalancing techniques into the pipeline, that is, Random Under Sampling (RUS), Random Over Sampling (ROS), Synthetic Minority Over Sampling (SMOTE), Synthetic Minority Over Sampling with Edited Nearest Neighbors (SMOTE-ENN), and Synthetic Minority Over Sampling with Tomek Links (SMOTE-Tomek), on the training set. Each data balancing technique was paired with a multi-class classifiers, that is, Random Forest (RF), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and Categorical Boosting (CatBoost). I explain in more detail what comprises each of these class-imbalanacing techniques and multi-class classifiers in the formulas sections below. All models were evaluated with multi-class metrics however F1-score was the primary metric (Mduma, 2023). The effect of all model-balancing combinations were compared, and finally once the best model pipeline was selected I conduced a detailed evaluation using a confusion matrix and permutation feature importance (Mduma, 2023).

Permutation Feature importance is calculated by observing differences in model error (i.e., decreases or increases) when the feature values are permutated (Realinho et al., 2022). If premutating feature values causes big differences in the model error, the feature is important to the overall model (see formula below).

Permutation Feature Importance Formula

\[ PI_j = E_{\text{perm}(j)} - E_{\text{base}} \] Where:

\(PI_j\) = the permutation importance score for the feature \(_j\)
\(E_{\text{base}}\) = the base model error (i.e., the performance metric) computed on the original dataset
\(E_{\text{perm}(j)}\) = the model error after random permutation of the values of feature \(_j\)
\(j\) = the index of the feature being evaluated
\(E\) = the prediction error metric used for evaluation (e.g., mean squared error, log loss or error rate)

To assess the error when conducting permutation I used F1-score as the metric, which according to Realinho et al. (2022), is most adequate when dealing with an imbalanced data set. The F1-score measures the accuracy of classification models assessing how well they predict the positive class by balancing both precision and recall. In the context of selecting “important” features, if permutation causes significant decreases in F1-score, then a feature is deemed important (see formula below).

F1-Score Formula

\[ F_1 = \frac{2PR}{P + R} \]

Where:

\(F_1\) = The \(F1\)-score which is the harmonic mean of precision and recall
\(P\) = Precision, which is the proportion of predicted positive that are true positives
\(R\) = Recall, which is the ratio of actual positive that are correctly predicted
\(2\) = The weighting factor that yields the harmonic mean and it penalizes extreme imbalances between \(P\) and \(R\) (Realinho et al., 2022).

As mentioned previously in this capstone project Permutation Feature Importance was strictly used as an evaluation criteria on the final machine learning pipeline to assess which features were most important (Realinho et al., 2022). Specifically, permutation of feature importance was conducted on the XGBoost and SMOTE Tomek pipeline which achieved the highest accuracy (i.e., 77.82%). Below I discuss in more detail each of the classifiers and data balancing techniques used in this project.

RF is considered an ensemble learning method that builds upon many decision trees and then aggregates their prediction thus improving model accuracy while reducing overfitting (Ho, 1995).

Random Forest Formula

\[ \hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x) \]

Where:

\(\hat{y}\) = The final predicted out of the RF model
\(B\) = The total number of decision trees in the RF
\(T_b(x)\) = The prediction of \(b\)-th decision tree for input \(x\)
\(x\) = The feature vector (i.e., the feature observation)
\(\sum_{b=1}^{B}\) = The aggregation across all trees

XGBoost operates on the premise of gradient boosting and builds trees in a sequential manner where each new tree corrects the errors of the previous tree (Chen & Guestrin, 2016).

Extreme Gradient Boosting Formula

\[ \hat{y}_i = \sum_{k=1}^{K} f_k(x_i), \quad f_k \in \mathcal{F} \]

Where:

\(\hat{y}_i\) = The predicted output for the observation \(i\)
\(K\) = The number of trees (i.e., the number of boosting iterations)
\(f_k(x_i)\) = The prediction from the \(k-th\) decision tree
\(x_i\) = The feature vector for the observation \(i\)
\(F\) = The space of all possible decision trees

LightGBM also operates using a gradient boosting method but uses a histogram-based tree framework while it learns to scale large datasets (Ke et al., 2017).

Light Gradient Boosting Machine Formula

\[ \hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + \eta f_t(x_i) \]

Where:

\(\hat{y}_i^{(t)}\) = The updated observation \(i\) at iteration \(t\)
\(\hat{y}_i^{(t-1)}\) = The prediction from the previous boosting ineration
\(\eta\) = The learning rate while controlling for the contribution of each new tree
\(f_t(x_i)\) = The prediction of the tree added at iteration \(t\)
\(x_i\) = The feature vector for observation \(i\)

CatBoost also uses a gradient boosting method but is specifically designed to handle categorical features (Prokhorenkova et al., 2018). Because of the distinct capability I did analyze CatBoost separately rather than integrate it into the pipeline.

Categorical Boosting Formula

\[ \hat{y}_i = \sum_{t=1}^{T} \eta f_t(x_i) \]

Where:

\(\hat{y}_i\) = The predicted output for observation \(i\)
\(T\) = The total number of trees (i.e., boosting interations)
\(\eta\) = The learning rate
\(f_t(x_i)\) = The predcition from the tree at interation \(t\)
\(x_i\) = The input vector including the catgorical variables

Data Balancing

As mentioned previously, to address the issues of class imbalance, I implemented five data balancing techniques on the dataset; 1. RUS, 2. ROS, 3. SMOTE, 4. SMOTE ENN, and 5. SMOTE TOMEK (Mduma, 2023).

Concerning RUS, this technique works by randomly removing observations from the majority class until the classes are approximately equal.

Random Under Sampling Formula

\[ \alpha_{us} = \frac{N_{m}}{N_{rM}} \]

Where:

\(\alpha_{us}\) = The under sampling ratio
\(N_m\) = The number of class samples in the minority
\(N_{rM}\) = The number of majority class samples after under sampling

ROS, is the oposite of RUS, and works by randomly duplicating the minority class until the classes are approximatly equaly.

Random Over Sampling Formula

\[ \alpha_{os} = \frac{N_{r m}}{N_{m}} \]

Where:

\(\alpha_{os}\) = Is the oversampling ration
\(N_m\) = The original number of the minority class samples
\(N_{rM}\) = The number of samples in the minority class after over-sampling

SMOTE works by generating a new synthetic minority class, which is done by interpolating the minority class using the nearest neighbor until both the minority and majority class match.

Synthetic Minority Oversampling Formula

\[ x_{\text{new}} = x_i + \lambda \left(x_{nn} - x_i\right) \]

Where:

\(x_{new}\) = The newly generated synthetic minority class sample
\(x_i\) = The original minority class instance
\(x_{nn}\) = Is on of the \(k\)-nearest neighbors of \(x_i\)
\(\lambda\) = Is the random interpolation where 0 \(\le\) \(\lambda\) \(\le\) 1
\(k\) = The total number of nearest neighbors considered

SMOTE ENN works the same as SMOTE but adds a cleaning step that removes miss-classified classes due to the nearest neighbor step, this reduces statistical noise and overlap.

Synthetic Minority Oversampling with Edited Nearest Neighbors Formula \[ D^{*} = \text{ENN}\bigl(\text{SMOTE}(D)\bigr) \]

Where:

\(D\) = The original imbalanced dataset
\(SMOTE(D)\) = The dataset after synthetic minority over samples are generated
\(ENN(.)\) = The rule that edits the nearest neighbors that may be miss-classified or ambiguous
\(D^{*}\) = The final cleaned and balanced dataset

SMOTE TOMEK also works the same as SMOTE but uses TOMEK links, which remove pairs of nearest neighbors instances from the majority class that are in close proximity to eachother.

Synthetic Minority Oversampling with TOMEK Links Formula

\[ D^{*} = D_{\text{SMOTE}} \setminus TL \]

Where:

\(D_SMOTE\) = The dataset after minority synthetic over sampling
\(TL\) = The TOMEK link sets between the minority and majority classes
\(\backslash\) = The removal operator (i.e.,set subtraction)
\(D^{*}\) = The final balanced dataset after overlapping TOMEK links are removed.

By implementing the above data balancing techniques I ensured that the decision boundary was cleaned thus reducing statistical noise and improving model performance (Mduma, 2023). The five data balancing techniques (i.e, ROS, RUS, SMOTE, SMOTE ENN, and SMOTE Tomek), in addition to the original unbalanced dataset, were paired with each of the multi-class classifiers (i.e., RF, XGBoost, and LightGBM) to assess which pair yielded the most optimal results. In other words, I assessed performance in six instances: 1. On the original unbalanced dataset, 2. on the ROS balanced dataset, 3. on the RUS balanced dataset, 4. on the SMOTE balanced dataset, 5. on the SMOTE ENN balanced dataset, and 6. on the SMOTE TOMEK balanced dataset.

Concerning model evaluation, specifically for measures of accuracy, I chose metrics that are particularly robust against unbalanced datasets. These included macro-averaged Precision, macro-averaged Recall, and macro-averaged F1-Score (\(F1_{\text{macro}}\)). The macro-averaged F1-Score, in particular, was the primary criterion used to identify the best performing model, as it treats each class equally. I also considered overall accuracy and the weighted-averaged F1-Score (\(F1_{\text{weighted}}\)) (Raschka et al., 2022). Macro-averaged metrics provide a balanced view of model performance across all classes, preventing excellent performance on majority classes from masking poor performance on minority classes (Raschka et al., 2022).

As previously mentioned, this project utilized the F1-Score, a specific instance of the F-measure, primarily its macro-averaged and weighted-averaged forms, rather than a single general F-measure (Raschka et al., 2022). The fundamental formula for an F1-Score is:

\[\text{F1-Score} = \frac{2PR}{P + R}\]

Where:

\(P\)\(P\) = Precision for a given class \(R\)\(R\) = Recall for a given class And for a specific class:

\[P = \frac{TP}{TP + FP}\] \[R = \frac{TP}{TP + FN}\]

Where:

\(TP\)\(TP\) = True Positive \(FN\)\(FN\) = False Negative \(FP\)\(FP\) = False Positive

In this project, the macro-averaged F1-Score was calculated by computing the F1-Score independently for each class and then taking the unweighted mean. The weighted-averaged F1-Score was calculated by computing the F1-Score for each class and then averaging them by the number of true instances for each class (Raschka et al., 2022).

4.1 Modeling and Results

Code

# Defining X,y, and Encoding Multiclass Target

# Separate features and target
X = train_data.drop(columns=['Target'])
y = train_data['Target'].astype(str)  # ensure string labels

# Encode target to integers for models that require numeric targets
label_encoder = LabelEncoder()
y_enc = label_encoder.fit_transform(y)

# Keep mapping for interpretation later
target_classes = list(label_encoder.classes_)
class_map = {cls: int(label_encoder.transform([cls])[0]) for cls in target_classes}

print("Target classes:", target_classes)
print("Class mapping:", class_map)

Target classes: ['Dropout', 'Enrolled', 'Graduate']
Class mapping: {'Dropout': 0, 'Enrolled': 1, 'Graduate': 2}

Code

# Stratified Train/Validation Split

X_train, X_val, y_train, y_val = train_test_split(
    X, y_enc,
    test_size=0.20,
    random_state=42,
    stratify=y_enc
)

print("Train shape:", X_train.shape, "Val shape:", X_val.shape)

Train shape: (2831, 36) Val shape: (708, 36)

Code

## Indentifying Feature Types and Building the Pipeline

# Identify categorical and numeric columns
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print("Categorical cols:", len(cat_cols))
print("Numeric cols:", len(num_cols))

# Preprocessing for numeric and categorical features
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, num_cols),
        ("cat", categorical_transformer, cat_cols)
    ],
    remainder="drop"
)

Categorical cols: 10
Numeric cols: 26

Code

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.combine import SMOTEENN, SMOTETomek
from imblearn.pipeline import Pipeline as ImbPipeline

# Here I define samplers for imbalance experiments

samplers = {
    "None": None,
    "RUS": RandomUnderSampler(random_state=42),
    "ROS": RandomOverSampler(random_state=42),
    "SMOTE": SMOTE(random_state=42),
    "SMOTEENN": SMOTEENN(random_state=42),
    "SMOTETomek": SMOTETomek(random_state=42),
}

Code

# Here I define my models

models = {
    "RandomForest": RandomForestClassifier(
        n_estimators=400,
        random_state=42,
        n_jobs=-1
    ),
    "XGBoost": XGBClassifier(
        objective="multi:softprob",
        num_class=3,
        n_estimators=600,
        learning_rate=0.05,
        max_depth=6,
        subsample=0.9,
        colsample_bytree=0.9,
        reg_lambda=1.0,
        random_state=42,
        n_jobs=-1
    ),
    "LightGBM": LGBMClassifier(
        objective="multiclass",
        num_class=3,
        n_estimators=800,
        learning_rate=0.05,
        random_state=42,
        n_jobs=-1
    )
}

Code

# This is an evaluation helper for multiclass metrics

def evaluate_multiclass(y_true, y_pred, class_names):
    results = {
        "accuracy": accuracy_score(y_true, y_pred),
        "precision_macro": precision_score(y_true, y_pred, average="macro", zero_division=0),
        "recall_macro": recall_score(y_true, y_pred, average="macro", zero_division=0),
        "f1_macro": f1_score(y_true, y_pred, average="macro", zero_division=0),
        "f1_weighted": f1_score(y_true, y_pred, average="weighted", zero_division=0),
    }
    print(pd.Series(results).round(4))
    print("\nClassification Report (macro focus):\n")
    print(classification_report(y_true, y_pred, target_names=class_names, zero_division=0))
    return results

Code

# Here I train and compare my models and balanacing methods

all_results = []

for sampler_name, sampler in samplers.items():
    for model_name, model in models.items():

        # Build imblearn pipeline (sampler must occur AFTER preprocessing)
        steps = [("preprocess", preprocessor)]
        if sampler is not None:
            steps.append(("sampler", sampler))
        steps.append(("model", model))

        pipe = ImbPipeline(steps=steps)

        # Fit and predict
        pipe.fit(X_train, y_train)
        y_pred = pipe.predict(X_val)

        # Collect metrics
        results = {
            "sampler": sampler_name,
            "model": model_name
        }
        results.update({
            "accuracy": accuracy_score(y_val, y_pred),
            "precision_macro": precision_score(y_val, y_pred, average="macro", zero_division=0),
            "recall_macro": recall_score(y_val, y_pred, average="macro", zero_division=0),
            "f1_macro": f1_score(y_val, y_pred, average="macro", zero_division=0),
            "f1_weighted": f1_score(y_val, y_pred, average="weighted", zero_division=0),
        })

        all_results.append(results)

results_df = pd.DataFrame(all_results).sort_values(by="f1_macro", ascending=False)
results_df.head(15)

4.1.1 Table 6: Comparative Performance of Models with Different Sampling Techniques (Macro F1-Score Sorted)

Sampler	Model	Accuracy	Precision (macro)	Recall (macro)	F1-Macro	F1-Weighted
SMOTETomek	XGBoost	0.7782	0.7273	0.7052	0.7131	0.7708
SMOTETomek	LightGBM	0.7768	0.7281	0.7037	0.7124	0.7692
SMOTETomek	RandomForest	0.7684	0.7152	0.7037	0.7086	0.7655
SMOTE	XGBoost	0.7768	0.7215	0.7008	0.7079	0.7686
None	XGBoost	0.7811	0.7270	0.6991	0.7070	0.7699

Code

# Now I will select the best pipeline and show confusion matrix

best_row = results_df.iloc[0]
best_sampler = best_row["sampler"]
best_model = best_row["model"]

print("Best sampler:", best_sampler)
print("Best model:", best_model)

# Rebuild the best pipeline
steps = [("preprocess", preprocessor)]
if best_sampler != "None":
    steps.append(("sampler", samplers[best_sampler]))
steps.append(("model", models[best_model]))

best_pipe = ImbPipeline(steps=steps)
best_pipe.fit(X_train, y_train)

y_pred_best = best_pipe.predict(X_val)

# Metrics
evaluate_multiclass(y_val, y_pred_best, target_classes)

# Confusion matrix
cm = confusion_matrix(y_val, y_pred_best)
plt.figure(figsize=(7, 6))
sns.heatmap(cm, annot=True, fmt="d", xticklabels=target_classes, yticklabels=target_classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title(f"Confusion Matrix: {best_model} + {best_sampler}")
plt.show()

Best sampler: SMOTETomek
Best model: XGBoost
accuracy           0.7782
precision_macro    0.7302
recall_macro       0.7063
f1_macro           0.7150
f1_weighted        0.7710
dtype: float64

Classification Report (macro focus):

              precision    recall  f1-score   support

     Dropout       0.82      0.78      0.80       227
    Enrolled       0.56      0.44      0.49       127
    Graduate       0.81      0.90      0.85       354

    accuracy                           0.78       708
   macro avg       0.73      0.71      0.72       708
weighted avg       0.77      0.78      0.77       708

Code

# CatBoost (Native Catagorical + Class Weights)
# CatBoost works better with no one hot encoding

# Prepare CatBoost inputs (no one-hot encoding)
X_train_cb = X_train.copy()
X_val_cb = X_val.copy()

# Ensure object/category columns are treated as strings or category
for col in cat_cols:
    X_train_cb[col] = X_train_cb[col].astype(str)
    X_val_cb[col] = X_val_cb[col].astype(str)
    test_data[col] = test_data[col].astype(str)

# Cat feature indices (CatBoost needs indices, not names)
cat_feature_indices = [X_train_cb.columns.get_loc(c) for c in cat_cols]

# Compute class weights (inverse frequency)
class_counts = np.bincount(y_train)
class_weights = (class_counts.sum() / (len(class_counts) * class_counts)).tolist()

cat_model = CatBoostClassifier(
    loss_function="MultiClass",
    iterations=1200,
    learning_rate=0.05,
    depth=6,
    random_seed=42,
    verbose=0,
    class_weights=class_weights
)

cat_model.fit(X_train_cb, y_train, cat_features=cat_feature_indices)
y_pred_cat = cat_model.predict(X_val_cb).astype(int).ravel()

print("CatBoost Results:")
evaluate_multiclass(y_val, y_pred_cat, target_classes)

cm_cat = confusion_matrix(y_val, y_pred_cat)
plt.figure(figsize=(7, 6))
sns.heatmap(cm_cat, annot=True, fmt="d", xticklabels=target_classes, yticklabels=target_classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix: CatBoost (class weights)")
plt.show()

CatBoost Results:
accuracy           0.7599
precision_macro    0.7008
recall_macro       0.6940
f1_macro           0.6969
f1_weighted        0.7568
dtype: float64

Classification Report (macro focus):

              precision    recall  f1-score   support

     Dropout       0.81      0.79      0.80       227
    Enrolled       0.47      0.43      0.45       127
    Graduate       0.82      0.86      0.84       354

    accuracy                           0.76       708
   macro avg       0.70      0.69      0.70       708
weighted avg       0.75      0.76      0.76       708

Code

# Now like Realinho et al. (2022):Permutation Feature Importance (Macro F1)

from sklearn.inspection import permutation_importance
from sklearn.metrics import make_scorer

f1_macro_scorer = make_scorer(f1_score, average="macro", zero_division=0)

perm = permutation_importance(
    best_pipe,
    X_val,
    y_val,
    scoring=f1_macro_scorer,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# The feature names for permutation importance should be the original column names
# because permutation_importance shuffles original features of X_val
feature_names = X_val.columns

importances = pd.DataFrame({
    "feature": feature_names,
    "importance_mean": perm.importances_mean,
    "importance_std": perm.importances_std
}).sort_values(by="importance_mean", ascending=False)

importances.head(20)

	feature	importance_mean	importance_std
30	Curricular units 2nd sem (approved)	0.187648	0.019643
16	Tuition fees up to date	0.031444	0.011680
24	Curricular units 1st sem (approved)	0.027742	0.006355
3	Course	0.025257	0.010252
10	Mother's occupation	0.021320	0.005959
31	Curricular units 2nd sem (grade)	0.015764	0.005662
19	Age at enrollment	0.014897	0.005074
28	Curricular units 2nd sem (enrolled)	0.014220	0.007198
29	Curricular units 2nd sem (evaluations)	0.010250	0.010510
1	Application mode	0.009378	0.004804
12	Admission grade	0.006683	0.006598
15	Debtor	0.006636	0.004866
22	Curricular units 1st sem (enrolled)	0.005902	0.004234
11	Father's occupation	0.004601	0.006005
18	Scholarship holder	0.004422	0.005459
9	Father's qualification	0.004266	0.007057
35	GDP	0.003711	0.005160
5	Previous qualification	0.003208	0.004559
34	Inflation rate	0.002330	0.003575
23	Curricular units 1st sem (evaluations)	0.002097	0.008426

Code

# Final Model Evaluation (on the validation set)

# Fit best pipeline on training data only
best_pipe.fit(X_train, y_train)

# Predict on validation set
y_pred_best = best_pipe.predict(X_val)

print("Final Model Performance (Validation Set):")
evaluate_multiclass(y_val, y_pred_best, target_classes)

# Confusion matrix
cm = confusion_matrix(y_val, y_pred_best)
plt.figure(figsize=(7, 6))
sns.heatmap(cm, annot=True, fmt="d",
            xticklabels=target_classes,
            yticklabels=target_classes)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title(f"Confusion Matrix: {best_model} + {best_sampler}")
plt.show()

Final Model Performance (Validation Set):
accuracy           0.7782
precision_macro    0.7302
recall_macro       0.7063
f1_macro           0.7150
f1_weighted        0.7710
dtype: float64

Classification Report (macro focus):

              precision    recall  f1-score   support

     Dropout       0.82      0.78      0.80       227
    Enrolled       0.56      0.44      0.49       127
    Graduate       0.81      0.90      0.85       354

    accuracy                           0.78       708
   macro avg       0.73      0.71      0.72       708
weighted avg       0.77      0.78      0.77       708

5 Conclusion

The initial exploratory data analysis as well as the results of preliminary modeling provided important insights into the datasets predictive structure and feasibility concerning the use multi-class machine learning approaches. Of all of the features in the dataset, Curricular units 2nd semester (approved) emerged as the most informative predictor of the target as indicated by the highest mutual information score. This particular finding indicates that students’ academic performance in the second semester is key to determining student outcomes which aligns to prior research concerning the importance of academic progression indicators (e.g., Realinho et al., 2022; Ridwan et al., 2024).

The feature Curricular units 2nd semester (grade) showed a generally consistent pattern across both the training and testing datasets as measured by distribution analysis (Raschka et al., 2022).This observation is critical as it shows that the models trained on the training set can also effectively generalize on unseen data (Raschka et al., 2022). Further, this consistency reduces any risk of shift in distribution and supports and validates the cross-validation techniques used in this project (Raschka et al., 2022). The Spearman Rank Correlation Heatmap afforded for a visual representation concerning the relationship features and the target which helped identify any issues with multicollinearity (Raschka et al., 2022).

Cross-validation was conducted across all of the classifiers used in this project (i.e., RF, XGBoost and LightGBM and CATBoost). Initially, there were issues with datatype. However, the datatype issues were resolved by implementing the appropriate preprocessing for example by label encoding the all categorical features. The adjustments made in the preprocessing step allowed for evaluation consistency across all classifiers and comparable performance metrics (Raschka et al., 2022). The bar plot provided a visual reference for summarizing the mean cross-validation accuracy and gave insights into the model performance.

In conclusion, based on these findings, future research should focus more on feature importance. In particular, future studies should pay attention to Curricular units 2nd semester (approved) concerning the model pipeline. Though beyond the scope of this study, a deeper analysis of cross-validation results could be used to select the top performing features. Once those features that are not top predictors are pruned, hyperparameter tuning can be conducted to improve model prediction accuracy (Raschka et al., 2022). Hyperparameter tuning incorporated with the established pipeline in this study (i.e.,XGBoost and SMOTE Tomek pipeline) could prove to be an effective strategy for accurately predicting student attrition.

6 References

Aina, C., Baici, E., Casalone, G., & Pastore, F. (2022). The determinants of university dropout: A review of the socio-economic literature. Socio-Economic Planning Sciences, 79, 101102. https://doi.org/10.1016/j.seps.2021.101102

Andrade-Giron, D. et al. (2023). Predicting student dropout based on machine learning and deep learning: A systematic review. ICST Transactions on Scalable Information Systems, 10(5), 1–11. https://doi.org/10.4108/eetsis.3586

Barramuno, M., Meza-Narvaez, C., & Galvez-Garcia, G. (2022). Prediction of student attrition risk using machine learning. Journal of Applied Research in Higher Education, 14(3), 974–986. https://doi.org/10.1108/JARHE-02-2021-0073

Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794. https://doi.org/10.1145/2939672.2939785

Delen, D., Davazdahemami, B., & Rasouli Dezfouli, E. (2024). Predicting and mitigating freshman student attrition: A local-explainable machine learning framework. Information Systems Frontiers, 26(2), 641–662. https://doi.org/10.1007/s10796-023-10397-3

Ho, T. K. (1995). Random decision forests. Proceedings of the Third International Conference on Document Analysis and Recognition, 1, 278–282. https://doi.org/10.1109/ICDAR.1995.598994

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.-Y. (2017). LightGBM: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems, 30, 3147–3155.

Kok, C. L., Ho, C. K., Chen, L., Koh, Y. Y., & Tian, B. (2024). A novel predictive modeling for student attrition utilizing machine learning and sustainable big data analytics. Applied Sciences, 14(21), 9633. https://doi.org/10.3390/app14219633

Mduma, N. (2023). Data balancing techniques for predicting student dropout using machine learning. Data, 8(3), 49. https://doi.org/10.3390/data8030049

Mun, J., & Jo, M. (2023). Applying machine learning-based models to prevent university student dropouts. Journal of Educational Evaluation, 36(2), 289–313. https://doi.org/10.31158/JEEV.2023.36.2.289

Namoun, A., & Alshanqiti, A. (2020). Predicting student performance using data mining and learning analytics techniques: A systematic literature review. Applied Sciences, 11(1), 237. https://doi.org/10.3390/app11010237

Pek, R. Z., Ozyer, S. T., Elhage, T., Ozyer, T., & Alhajj, R. (2023). The role of machine learning in identifying student at-risk and minimizing failure. IEEE Access, 11, 1224–1243. https://doi.org/10.1109/ACCESS.2022.3232984

Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2018). CatBoost: Unbiased boosting with categorical features. Advances in Neural Information Processing Systems, 31, 6638–6648.

Raschka, S., Liu, Y. H., & Mirjalili, V. (2022). Machine learning with PyTorch and scikit-learn. Packt Publishing.

Realinho, V., Machado, J., Baptista, L., & Martins, M. V. (2022). Predicting student dropout and academic success. Data, 7(11), 146. https://doi.org/10.3390/data7110146

Ridwan, A., Priyatno, A. M., & Ningsih, L. (2024). Predict students’ dropout and academic success with XGBoost. Journal of Education and Computer Applications, 1(2), 1–8. https://doi.org/10.69693/jeca.v1i2.13