Introduction

I strongly believe that education is one of the most valuable resources we have in life. As a college student, I have found that beyond just instructional material, the people and experiences that I have come across in college have expanded my mind and opened up so many opportunities in my life. In order to set ourselves up for success in college and beyond, it is important to utilize the resources we have in high school. But what factors lead to academic success?

The purpose of this project is to develop a model that will predict a student’s final grade in a class as a way of measuring academic success. By creating a prediction model, we can investigate the levels of significance that certain variables - such as a student’s distance from school or their number of absences - have on a student’s class performance. This is important because doing well in class doesn’t just open doors for higher education, but it also requires students to develop good life habits, which is important beyond the realm of education. With the results of this model, we may be able to better understand how to support high school student towards achieving success in school and beyond.

The Dataset

We will use the “Student Performance” dataset from the UCI Machine Learning Repository, which contains data for 662 secondary students in Portugal. (Secondary school age is 15-18; equivalent to high school aged students in the US.) The final grades reported in the data are from Mathematics and Portuguese Language classes.

Tidying the Data

Before we begin creating our predictive model, it is important that we tidy the data so that it can be used more efficiently. In this section, we will download the data from two csv files (one for the math subject and one for Portuguese), combine them, and then remove the 382 instances of duplicated students, in which the same student is represented in the data for both math and Portuguese.

df1 <- read.csv("/Users/isabellasri/Downloads/student+performance/student/student-mat.csv", header = TRUE, sep = ";")
df2 <- read.csv("/Users/isabellasri/Downloads/student+performance/student/student-por.csv", header = TRUE, sep = ";")
df3 <- rbind(df1,df2)
student_df <- df3 %>% distinct(school,sex,age, address , famsize , Pstatus , Medu , Fedu , Mjob , Fjob , reason , nursery , internet , .keep_all = TRUE)
head(student_df) %>%
  kbl() %>% kable_styling("striped") %>% scroll_box(width = "100%")
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 6 5 6 6
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 4 5 5 6
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 3 yes no yes no yes yes yes no 4 3 2 2 3 3 10 7 8 10
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes yes yes yes yes yes yes 3 2 2 1 1 5 2 15 14 15
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes yes no yes yes no no 4 3 2 1 2 5 4 6 10 10
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes yes yes yes yes yes no 5 4 2 1 2 5 10 15 15 15

Missingness

sum(is.na(student_df))
## [1] 0

There is no missing data in our dataset, so there is no need for any imputation or cuts to the data.

Exploratory Data Analysis

Next, let’s get to know our dataset. It’s helpful to visualize any existing patterns and relationships between variables that are present in the data in order to better understand how the data can best be represented in a predictive model. Let’s visualize our data and get a better understanding of our dataset:

hist(student_df$G3, main = "Histogram of G3 Scores", xlab = "G3", col = "lavender")