Linear Regression is one of the most basic algorithms for building Machine Learning models. It is of great importance in statistics too. It can be used to build linear regression models. There are many algorithms to train a linear regression model such as using the normal equation, gradient descent, stochastic gradient descent and batch gradient descent. In this blog post we will be using the normal equation to find the values of weights for linear regression model using the numpy library. Numpy is a python library used for mathematical calculations. We will be training the model on an artificial dataset containing only one feature. This artificial dataset will be created using numpy.
The Normal Equation
The value of weights for the linear regression can be calculated simply by using the normal equation. Mathematically, The Normal Equation can be written as-
θ=(X⊺X)−1X⊺y
Where θ is the vector containing the values from θ0 to θn that will minimise the cost function(mean squared error or MSE). These will hence be the best values of weights. The vector θ contains n+1 elements, one for each feature and the bias term. X is the matrix of all the samples from X1 to Xm with each sample having n+1 features from Xm0 to Xmn, where Xm0 is always equal to 1. The dimensions of this matrix are therefore m*(n+1). y is the vector containing all target values from y1 to ym. It contains m elements, each representing the target value for a sample.
Mean Squared Error
To put it simply,Mean Squared Errors or MSE represents the the mean of the square of errors (or difference between the actual and calculated values) summed over all the samples in the dataset. Mathematically, the Mean Squared errors can be calculated as-
Mean Squared Error = (1/m) ∑(θ⊺xi−yi)2
Where θ⊺xi represents the predicted value, yi represents the actual target value and hence their difference gives the error. This error is squared and summed over all the instances in the dataset from 1 to m. This gives the sum of squared error over the given dataset. Mean Squared Error is then calculated by dividing the sum of squared error with the instances in the dataset, i.e, m.
Now that we know what is normal equation and how can we find the value of weights that will minimise the value of a cost function, which in this case is the Sum of Squared Errors(SSE), lets jump to the coding part to find the most optimal values for the weight vector θ.
Code
Creating an Artificial Dataset
We created an artificial dataset with only one feature. The dataset can be considered to be created from the line 7 + 1.5x along with some random gaussian noise. We then store the values of feature in the variable X and their corresponding target values in variable y.
Finding the value of θ Vector
Next we create a training set by adding a column of data in original training set. This additional column consists of all 1’s. We then find the value of theta by using the normal equation. There are a few things that require explanation in the second line of code.
Numpy’s T property can be applied on any matrix to get its transpose. So X_train.T returns the transpose of the matrix X_train.
Numpy’s dot() method returns the dot product of a matrix with another matrix. So, X_train.T.dot(X_train) will return the matrix dot product of X_train and X_train.T – Transpose of X_train.
Numpy’s linalg.inv() method returns the inverse of a matrix. So, np.linalg.inv(X_train) will return the inverse of the matrix X_train.
We then print the value of theta vector, i.e, θ0 and θ1.
Predicting Output over Test Set
Since now we have calculated the values of θ that will minimise the Mean Squared Error(MSE). We can now predict the values for new instances and we do so by computing the weighted sum of features on a new test set. We created a new test set with 3 instances, i.e, 1, 3 and 5. Similar to what we did earlier with the training set, we add a column of data in the original data we created. Similarly as before, this additional column of data also contains all 1’s. Next, we predict the values for the test instanced and store it in the variable y_predicted, and print the predicted values.
Conclusion
In this blog post, linear regression using numpy, we first talked about what is the Normal Equation and how it can be used to calculate the values of weights denoted by the weight vector theta. Then we created an artificial dataset with a single feature using the Python’s Numpy library. We then calculated the value of weight vector theta using the normal equation and used it to predict the the value of new instances. In this blog we have created an artificial dataset with a single feature for the sake of simplicity. In reality however the number of features is much more. Nevertheless, the same equation can be applied over datasets with any number of features.
But In practice however, the normal equation is rarely used and gradient descent and its modifications are used much more commonly. This is due to the fact that, as the amount of features in the training set increases, the time complexity of the algorithm to find weight vector using this equation also increases dramatically. And in real life scenarios, we usually have a lot of features in a dataset. Similar to applying Linear Regression using numpy by making use of normal equation, we can apply linear regression using other popular like Tensorflow, Pytorch and other. We will be talking about it in other blog posts.
You can find the code for this blog on Github here.