Tutorial: Regression Models in Python – From Basics to Advanced
Introduction
Regression analysis is a powerful statistical method that allows you to examine the relationship between two or more variables. In this tutorial, we’ll explore several types of regression models using Python, including Linear Regression, Lasso, Ridge, Elastic Net, Bayesian Regression, and Decision Trees.
We’ll be using real-world data from the Our World in Data repository.
Part I: Setting Up Your Environment
To get started, we first need to install some necessary libraries. If you don’t have them installed already, you can do so by running the following commands:
!pip install pandas numpy sklearn matplotlib seaborn
pandas
and numpy
will help us handle the data, sklearn
will provide the regression models, and matplotlib
and seaborn
will help us visualize the data.
Part II: Importing the Data
We’re going to use the Our World in Data’s CO2 and GDP dataset. This dataset includes information about countries’ CO2 emissions and their GDP.
First, let’s import our libraries and load our data.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Load the data
= "https://raw.githubusercontent.com/owid/co2-data/master/owid-co2-data.csv"
url = pd.read_csv(url)
data
# Check the data
data.head()
We use pandas
’ read_csv
function to load our data. data.head()
prints the first 5 rows of our dataframe.
Part III: Preprocessing the Data
Before we start modeling, we need to preprocess our data. We will deal with missing values and scale our data.
# Drop missing values
= data[['year', 'co2', 'gdp']].dropna()
data
# Define our inputs and target
= data[['year', 'gdp']]
X = data['co2']
y
# Split the data into train and test sets
= train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_test, y_train, y_test
# Scale the data
= StandardScaler()
scaler = scaler.fit_transform(X_train)
X_train = scaler.transform(X_test) X_test
We use dropna()
to drop missing values, and train_test_split()
to split our data into a training set and a test set. The StandardScaler()
standardizes our data to have a mean of 0 and a standard deviation of 1.
Part IV: Linear Regression
Linear regression is a basic and commonly used type of predictive analysis.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Initialize the model
= LinearRegression()
lr
# Fit the model
lr.fit(X_train, y_train)
# Predict the test set
= lr.predict(X_test)
y_pred
# Evaluate the model
= mean_squared_error(y_test, y_pred)
mse print('MSE: ', mse)
We initialize our model with LinearRegression()
, fit it to our training data with lr.fit()
, and make predictions with lr.predict()
. Finally, we evaluate our model by calculating the Mean Squared Error.
Part V: Lasso, Ridge and Elastic Net Regression
These are extensions of linear regression that have regularizations.
from sklearn.linear_model import Lasso, Ridge, ElasticNet
# Lasso
= Lasso(alpha=0.1)
lasso
lasso.fit(X_train, y_train)= lasso.predict(X_test)
y_pred print
'Lasso MSE: ', mean_squared_error(y_test, y_pred))
(
# Ridge
= Ridge(alpha=0.1)
ridge
ridge.fit(X_train, y_train)= ridge.predict(X_test)
y_pred print('Ridge MSE: ', mean_squared_error(y_test, y_pred))
# Elastic Net
= ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic
elastic.fit(X_train, y_train)= elastic.predict(X_test)
y_pred print('ElasticNet MSE: ', mean_squared_error(y_test, y_pred))
For each model, we initialize it (with the appropriate hyperparameters), fit it, make predictions, and evaluate it. The alpha
hyperparameter controls the strength of the penalty, and l1_ratio
controls the mixture of L1 and L2 penalties for Elastic Net.
Part VI: Bayesian Regression and Decision Trees
Bayesian regression uses Bayes’ theorem to update the probability of a hypothesis as more evidence or information becomes available. Decision Trees split the data into branches to make predictions.
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import BayesianRidge
# Bayesian Ridge Regression
= BayesianRidge()
bayesian
bayesian.fit(X_train, y_train)= bayesian.predict(X_test)
y_pred print('BayesianRidge MSE: ', mean_squared_error(y_test, y_pred))
# Decision Tree Regression
= DecisionTreeRegressor(max_depth=5)
tree
tree.fit(X_train, y_train)= tree.predict(X_test)
y_pred print('DecisionTree MSE: ', mean_squared_error(y_test, y_pred))
The procedure for Bayesian Ridge and Decision Tree Regression is the same as before. For Decision Trees, the max_depth
hyperparameter controls the maximum depth of the tree.
Conclusion
In this tutorial, we have covered various types of regression models in Python, including linear regression, lasso, ridge, elastic net, Bayesian regression, and decision trees. We have also demonstrated how to import and preprocess real-world data, fit models, make predictions, and evaluate model performance. By understanding these fundamental concepts, you’re now well equipped to explore more complex predictive modeling techniques.