Introduction

I strongly believe that education is one of the most valuable resources we have in life. As a college student, I have found that beyond just instructional material, the people and experiences that I have come across in college have expanded my mind and opened up so many opportunities in my life. In order to set ourselves up for success in college and beyond, it is important to utilize the resources we have in high school. But what factors lead to academic success?

The purpose of this project is to develop a model that will predict a student’s final grade in a class as a way of measuring academic success. By creating a prediction model, we can investigate the levels of significance that certain variables - such as a student’s distance from school or their number of absences - have on a student’s class performance. This is important because doing well in class doesn’t just open doors for higher education, but it also requires students to develop good life habits, which is important beyond the realm of education. With the results of this model, we may be able to better understand how to support high school student towards achieving success in school and beyond.

The Dataset

We will use the “Student Performance” dataset from the UCI Machine Learning Repository, which contains data for 662 secondary students in Portugal. (Secondary school age is 15-18; equivalent to high school aged students in the US.) The final grades reported in the data are from Mathematics and Portuguese Language classes.

Tidying the Data

Before we begin creating our predictive model, it is important that we tidy the data so that it can be used more efficiently. In this section, we will download the data from two csv files (one for the math subject and one for Portuguese), combine them, and then remove the 382 instances of duplicated students, in which the same student is represented in the data for both math and Portuguese.

df1 <- read.csv("/Users/isabellasri/Downloads/student+performance/student/student-mat.csv", header = TRUE, sep = ";")
df2 <- read.csv("/Users/isabellasri/Downloads/student+performance/student/student-por.csv", header = TRUE, sep = ";")
df3 <- rbind(df1,df2)
student_df <- df3 %>% distinct(school,sex,age, address , famsize , Pstatus , Medu , Fedu , Mjob , Fjob , reason , nursery , internet , .keep_all = TRUE)
head(student_df) %>%
  kbl() %>% kable_styling("striped") %>% scroll_box(width = "100%")
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 3 yes no yes no yes yes yes no 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes yes yes yes yes yes yes 3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes yes no yes yes no no 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes yes yes yes yes yes no 5 4 2 1 2 5 10 15 15 15

Missingness

sum(is.na(student_df))
## [1] 0

There is no missing data in our dataset, so there is no need for any imputation or cuts to the data.

Exploratory Data Analysis

Next, let’s get to know our dataset. It’s helpful to visualize any existing patterns and relationships between variables that are present in the data in order to better understand how the data can best be represented in a predictive model. Let’s visualize our data and get a better understanding of our dataset:

hist(student_df$G3, main = "Histogram of G3 Scores", xlab = "G3", col = "lavender")

This histogram displays the distribution of the students’ G3 scores (final course grade). It appears to be fairly bell-shaped, with variation that is more predictable in the higher scores on the right side of the mean, and more variability in the lower scores on the left side of the mean. It appears that the majority of students receive scores equal to or above the level of the average G3 grade.

 ggplot(student_df, aes(x = schoolsup, y = G3)) +
   geom_boxplot(fill = "pink", color = "black") +
   labs(title = "Box Plots of Extra Educational Support and G3 Scores",
        x = "Extra Educational Support", y = "G3 score") +
   theme(axis.text.x = element_text(angle = 0, hjust = 1)) +
   scale_y_continuous(limits = c(0, 20)) +
   coord_flip()

The box plots represent the range of G3 scores for students who do receive extra educational support (top box plot) and who do not (bottom box plot). Interestingly, the mean G3 and overall range of scores from students who do receive extra educational support are lower than those of students who do not receive extra support. This indicates that there might be other factors influencing a student’s academic success beyond just educational support.

student_df %>% 
  select(is.numeric) %>% 
  cor() %>% 
  corrplot(type = "lower", diag = FALSE)

This correlation matrix visualizes the relationships, or more specifically the correlations, between our variables. From the dark blue circles between the G1, G2, and G3 variables, we can understand that these variables are highly linearly correlated, meaning that they are essentially representing the same thing.

It makes sense that a student’s semester grades are indicative of their final grade, since the final grade, G3, is calculated from the G1 and G2 scores. Thus, it is not useful to include G1 and G2 in the model, since their strong relationship with the G3 variable will distract the predictive model from the influence of other variables on G3, which is the purpose of this project.

There are also significant positive correlations between the variables Medu and Fedu, and between Dalc and Walc. It makes sense that Medu and Fedu are strongly positively correlated, since it indicates that the mother and father of a child have similar education levels. The strong positive correlation between Dalc and Walc also makes sense, since it indicates that someone who consumes a lot of alcohol during the week also does so during the weekend, and vice versa.

We can also see some light red circles between G1, G2, G3, and the failures variable, which demonstrates that these variables have an apparent negative linear relationship, meaning that higher amounts of failures in class grades might result in lower scores in G1, G2, and G3. It makes sense that a student’s past class failures can represent a larger pattern of failures throughout all their classes.

Setting Up the Models

Data Preparation

To prepare the data for model input, we are going to split the data into training and testing sets to use for training and testing the model, respectively. To create a good model, it is essential to have a robust training set, so we are using 80% of the data in the training set, and the remaining 20% in the testing set. We will stratify these data sets on the outcome variable: G3. Then we will set up 5 folds to use for k-fold cross-validation, also stratified on G3. Using cross-validation is an important step to maximize our model’s efficiency, as it allows us to test our results throughout the training process.

set.seed(123)
student_split <- initial_split(student_df, prop = 0.8,
                               strata = G3)
student_train <- training(student_split)
student_test <- testing(student_split)

cv_folds <- vfold_cv(student_train, v = 5, strata = G3)

Recipe Formulation

Now let’s set up a recipe to use in the models. Here, we identify G3 as the outcome variable, and all other variables as predictors, except for the variables identified to be linearly correlated: G1, G2, Medu, and Walc. We exclude these variables because linear dependence can inaccurately alter a variable’s importance by conflating the significance of the dependent variables. Then we dummy encode the categorical predictors so that they can be represented numerically. Lastly, we will center and scale all the predictors. We will use the training data in the recipe so that the models are initially trained and cross-validated with data that is separate from the data we will use in our final test.

student_recipe <- recipe(G3 ~ ., data = student_train)  %>%
                  step_rm(c(G1, G2, Medu, Walc)) %>% 
                  step_dummy(all_nominal_predictors())  %>%
                  step_normalize(all_predictors()) 

Regression Model

Now it’s time to define our models. For this project we will define regression models since the distinct G3 scores are numerical with an inherent order. We will define linear regression, K-nearest-neighbors, elastic net, and random forest models in order to find a model that best fits our data, and as a result, has the greatest prediction power.

lm_mod <- linear_reg() %>% set_mode("regression") %>% set_engine("lm")
knn_mod <- nearest_neighbor(neighbors = tune()) %>% set_mode("regression") %>% set_engine("kknn")
en_mod <- linear_reg(penalty= tune(), mixture = tune())  %>% set_mode("regression") %>% set_engine("glmnet")
rf_reg_spec <- rand_forest(mtry = tune(), trees = tune(),  min_n = tune()) %>% set_engine("ranger",
                          importance = "impurity") %>% set_mode("regression")

rf_reg_wf <- workflow() %>%  add_model(rf_reg_spec) %>%  add_recipe(student_recipe)
lm_wkflow <- workflow() %>% add_model(lm_mod) %>% add_recipe(student_recipe)
knn_wkflow <- workflow() %>% add_model(knn_mod) %>% add_recipe(student_recipe)
en_wkflow <- workflow() %>% add_model(en_mod) %>% add_recipe(student_recipe)

Tune and Fit

We are going to fit the models and set up tuning for the k-nearest-neighbors, elastic net, and random forest models’ parameters. This allows us to evaluate different parameter values in order to find the exact parameters that maximize the models’ performance. (The metric used to define a model’s success will be defined in the next section). The linear regression model does not have parameters to tune.

knn_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)
en_grid <- grid_regular(penalty(), mixture(range = c(0, 1)), levels = 10)
rf_grid <- grid_regular(mtry(range = c(1, 6)), trees(range = c(200, 600)),min_n(range = c(10, 20)), levels = 5)
lm_fit_res <- lm_wkflow %>% fit_resamples(resamples = cv_folds)
tune_knn_res <- tune_grid(object = knn_wkflow, resamples = cv_folds, grid = knn_grid)
tune_en_res <- tune_grid(en_wkflow, grid = en_grid, resamples = cv_folds)
tune_reg <- tune_grid(
  rf_reg_wf, 
  resamples = cv_folds, 
  grid = rf_grid
)
save(tune_reg, file = "tune_reg.rda")
load("tune_reg.rda")

Evaluating the Models

To evaluate the models’ performance, we will look at their respective root mean squared error (RMSE) values based on the training data. RMSE measures the average difference between values predicted by the model and the actual values, and is commonly used to evaluate regression models.

rf_rmse <- collect_metrics(tune_reg) %>% filter(.metric == 'rmse') 
rf_best <- rf_rmse %>% select(mean) %>% min()
knn_rmse <- collect_metrics(tune_knn_res)  %>% filter(.metric == "rmse")
best_knn <- min(knn_rmse$mean)
elastic_rmse <- collect_metrics(tune_en_res) %>% filter(.metric == "rmse")
best_en <- min(elastic_rmse$mean)
lm_rmse <- collect_metrics(lm_fit_res) %>% filter(.metric == "rmse")
lm_rmse_mean <- lm_rmse$mean
# Creating a tibble of all the models and their RMSE
final_compare_tibble <- tibble(Model = c("Linear Regression", "K-Nearest-Neighbors", "Elastic Net", "Random Forest"), RMSE = c(lm_rmse_mean, best_knn, best_en, rf_best))

# Arranging by lowest RMSE
final_compare_tibble <- final_compare_tibble %>% 
  arrange(RMSE)

final_compare_tibble %>%
  kbl() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Model RMSE
Random Forest 3.701476
Elastic Net 3.849337
Linear Regression 3.942142
K-Nearest-Neighbors 4.210373

From the results above, we can see that the random forest model achieved the lowest RMSE value, meaning that the difference between this model’s predicted values and the actual values is smaller than that of the other models.

# Creating a data frame of the model RMSE's so we can plot
all_models <- data.frame(Model = c("Linear Regression", "K-Nearest-Neighbors", "Elastic Net", "Random Forest"),
                         RMSE = c(lm_rmse_mean, best_knn, best_en, rf_best))

# Creating a barplot of the RMSE values
ggplot(all_models, aes(x=Model, y=RMSE)) +
  geom_bar(stat = "identity", aes(fill = Model)) +
  scale_fill_manual(values = c("darkseagreen3", "darkseagreen4", "darkseagreen2", "darkseagreen1")) +
  theme(legend.position = "none") +
  labs(title = "Comparing RMSE by Model")

This bar plot visualizes the RMSE values for each model. While all the scores are relatively close between models, the random forest model clearly outperforms the others.

load("tune_reg.rda")

autoplot(tune_reg) + theme_minimal() 

These plots visualize the RMSE (top line of graphs) and RSQ (bottom line of graphs) for the Random Forest model across the different parameter values, using the training data and cross-validation folds. The RMSE appears to improve as the number of randomly selected predictors increases. Interestingly, the number of trees that produces the best RMSE varies between folds. The results for 600 trees are very consistent across folds, and in fold 2 it achieves the lowest RMSE of any number of trees across all folds.

best_rf_1 <- rf_rmse %>% filter(mean == rf_best)
best_rf_1 %>%
  kbl() %>%
  kable_styling()
mtry trees min_n .metric .estimator mean n std_err .config
6 600 12 rmse standard 3.701476 5 0.1103132 Preprocessor1_Model050

The exact values of the random forest model’s parameters that achieved the lowest RMSE value are mtry=6, trees=600, min_n=12, n=5.

Final Fit with Best Model: Random Forest

Now that we know the random forest performs the best based on the training data, let’s refit this model to see how it performs on the testing data. This gives us a better understanding of the model’s performance on data it has never seen before, which more accurately represents the model’s predictive abilities.

final_rf_model <- finalize_workflow(rf_reg_wf, best_rf_1)
final_rf_model <- fit(final_rf_model, student_train)
final_rf_model_test <- augment(final_rf_model, student_test)
#rmse(final_rf_model_test, truth = G3, .pred)
# Creating a tibble of all the models and their RMSE
compare <- tibble(Random_Forest_Model = c("Training Data", "Testing Data"), RMSE = c(rf_best, rmse(final_rf_model_test, truth = G3, .pred)$.estimate))

# Arranging by lowest RMSE
compare <- compare %>% 
  arrange(RMSE)

compare %>%
  kbl() %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Random_Forest_Model RMSE
Testing Data 3.576502
Training Data 3.701476

From the results above, we can see that the random forest model using testing data outperformed the model using training data and cross-validation folds, seeing as it has a lower RMSE of about 3.588. This indicates that our model has been trained well with the training data, and can perform well on unseen data. As a result, we can assume that there are real patterns within the data that positively or negatively influence a student’s class grade.

final_rf_model_test %>% 
  ggplot(aes(x = G3, y = .pred)) +
  geom_point(alpha = 0.4) +
  geom_abline(lty = 2) +
  coord_obs_pred() +
  labs(title = "Predicted Values vs. Actual Values")

This scatter plot visualizes the model’s G3 predictions against the actual values. Based on the cluster of dark circles close to the slope line, it appears that a lot of the model’s predictions are close in value to the actual G3 scores. The plot’s outliers appear to have a very slight positively linear representation, but still remain very centered around the mean, which shows that the model did not do well at predicting scores that were far from the mean.

Variable Significance

Now that we know the random forest model is doing well at picking up on certain factors in the data that positively or negatively impact a student’s success, let’s look inside our final model to see which variables it considers important when predicting G3.

final_rf_model %>% extract_fit_parsnip() %>% 
  vip(aesthetics = list(fill = "slateblue1", color="slateblue1")) + 
  theme_minimal()

This bar plot visualizes the importance of the 10 most useful variables out of the 38 total predictor variables. (The correlation matrix we visualized earlier displays if the relationship between these variables is positively or negatively correlated.) ‘Failures’ is by far the most important variable the model considers when predicting G3. This might mean that when a student does poorly in a class, it is most likely not an isolated event. So rather than just failing one subject, they might be doing poorly in school overall. I think it is very interesting that the variable ‘absences’ is more important than a variable like ‘studytime’. This indicates that being present in school is an important contributor to a student’s grade. It is also extremely interesting that ‘famrel’ is one of the most useful variables, which indicates that a student’s academic success might be closely tied with the amount of support they receive from their family.

Overall, the variables included in this plot highlight how academic success can be influenced in a variety of areas, and it is also important to recognize not all of them are controllable. Showing up to school and devoting time for studying after school are important factors, but so is the student’s parents’ education level and their relationships with their family.

Conclusion

By investigating the factors that influence a student’s class grade, we ultimately found that patterns related to academic success do indeed exist. The random forest model we tuned achieved the lowest RMSE value, and therefore captured these patterns the best. While it would be ideal to use these findings to lay out a step-by-step guide on how to achieve great grades in school, it appears that the reality is not that simple. There are certainly variables that student’s can and should try to prioritize if they want to be successful in class, such as showing up by being physically present during school and by studying and limiting the amount of time spent out with friends after school. However, the uncontrollable factors like parental education levels, distance from school, and the quality of familial relationships indicate that a student’s school performance is also reliant on factors that are out of their control.

The takeaways from this project are extremely important because it can present families with a better understanding of the importance of their role in supporting their child’s academic success. Although there are certain variables that are up to the student to control, working towards an enriching education is difficult to do alone. And lastly, perhaps the most important takeaway that students of any age should understand is that the visible trends in our data demonstrate that a student’s grades can be predicted using many different factors, none of which are determined by how ‘smart’ the student is. Grades are important, but they do not determine our intelligence or who we are.

Sources

Cortez,Paulo. (2014). Student Performance. UCI Machine Learning Repository. https://doi.org/10.24432/C5TG7T.