Blog
Getting started with linear regression
A common algorithm used to find the best-fitting line between two or more variables

Introduction to linear regression
Linear regression is a simple yet powerful predictive model aimed to learn the relationship between input variables and a target variable. To fully understand what it is and how it works, we're going to have to go back in time a bit.
Remember this equation from high school?
y = mx + b
This simple equation is used to draw a line on a grid. Let's break it down further to understand each of its terms.
- y: This is the dependent variable which tells us how far up or down the line is. This value depends on what happens on the right side of the equation.
- x: This is the independent variable which tells us how far along the line we are.
- m: This is the slope of the line which tells how steep the line actually is.
- b: This is the y-intercept which is the value of y when x is equal to 0.
What the Linear Regression algorithm does is try to learn the optimal values for m and b such that for any value of x we can accurately predict the value of y.
To understand this intuitively, say we want to be able to predict the price of a house based the area of the house 🏠 . The value we are trying to predict is house price and this prediction is dependent on the area of the house. If the area of the house is correlated with price, then as the house area increases we should also see an increase in price. This will allow us to learn the relationship between these two variables so that we can predict the price of a house for any area.
Now that we’ve gotten that out of the way, let’s see how we can implement this algorithm in Python.
We'll start by importing some useful libraries that we'll make use of later on.
import pandas as pd
import seaborn as sns
import numpy as np
from math import sqrt
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import train_test_splitDataset
Our dataset comes from Kaggle and contains information about the properties of a house as well as the price. You can download the dataset here.
df = pd.read_csv('Housing.csv')Data preprocessing
To keep it simple, we are only going to use the area of the house to predict the price. This means we must first create a subset of our original dataframe.
regression_data = df[['area', 'price']]
Let's create a scatter plot that can give us a quick visualization of the relationship between these two variables.
There seems to be an upward trend in our data, however, there also appears to be some minor clumping in the lower left corner of our scatter plot. This could indicate skewness in our data, so let's also visualize the distribution of each of our variables.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Distribution of price and area', size = 20)
# Create histograms for each variable
sns.histplot(ax = ax1, data = regression_data['area']);
sns.histplot(ax = ax2, data = regression_data['price']);
ax1.set_xlabel('area (Sq ft)');
ax2.set_xlabel('price ($1M)');
Just as expected, our data is right skewed. In most machine learning projects, skewness in your data is not ideal, but luckily there are ways to combat this. We will use a log transformation to shift our data towards a more normal distribution.
transformed = np.log(regression_data)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
fig.suptitle('Distribution of price and area', size = 20)
# Create histograms for each variable
sns.histplot(ax = ax1, data = transformed['area']);
sns.histplot(ax = ax2, data = transformed['price']);
ax1.set_xlabel('Ln of area (Sq ft)');
ax2.set_xlabel('Ln of price ($1M)');
plt.show();
Now we can see a much more obvious trend in our data and the clumping from before is much less apparent.
Now that our data is symmetrical, we can split our data into a train and test set and begin training