DataTechNotes: Linear Regression Example with Scikit-learn

Linear regression is one of the fundamental techniques in machine learning, widely used for predictive modeling and data analysis. The basic idea of linear regression is to find the best-fitting line that minimizes the difference between the observed values of the dependent variable and the values predicted by the model. This is typically done by estimating the coefficients (or weights) of the linear equation that describes the relationship between the variables.

In linear regression, we use a straight-line equation to model the relationship between a dependent variable y and an independent variable x. The equation of a simple linear regression model can be expressed as:

$y = m x + b$

Where:

$y$ is the dependent variable (target)
$x$ is the independent variable (feature)
$m$ is the slope of the line (coefficient)
$b$ is the y-intercept (bias)

In multiple linear regression, the equation extends to accommodate multiple features:

$y = b_{0} + b_{1} x_{1} + b_{2} x_{2} + . . . + b_{n} x_{n}$

Where:

$b_{0}$ is the intercept term
$b_{1}, b_{2}, . . ., b_{n}$ are the coefficients corresponding to each feature $x_{1}, x_{2}, . . ., x_{n}$

Linear regression with Scikit-learn library

You can easily build a regression model using the scikit-learn library. In this part of the tutorial, we'll demonstrate how to implement linear regression with scikit-learn.
We'll start by importing the required libraries.

 
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import sklearn.linear_model as model
import matplotlib.pyplot as plt
import numpy as np
 

Next, we'll generate simple regression data using the make_regression() function. This creates a dataset with 100 samples, 1 feature, and a noise level of 20. The generated data is then split into training and testing sets using the train_test_split() function. 80% of the data is used for training, and 20% is used for testing.

 
# Generate synthetic data for regression
x, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=1)

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=12)
 

We create an instance of the Linear Regression model using LinearRegression() from the sklearn.linear_model module. The model is then trained on the training data using the fit() method. Subsequently, the trained model is utilized to make predictions on the test data using the predict() method.

 
# Create an instance of the Linear Regression model
linear_regression = model.LinearRegression()

# Train the Linear Regression model on the training data
linear_regression.fit(x_train, y_train)

# Use the trained model to make predictions on the test data
pred_y = linear_regression.predict(x_test)

Next, we define a function to evaluate the prediction accuracy. The function mse_rmse() calculates the Mean Squared Error (MSE) and the square root of the MSE between the actual and predicted values.

 
# Define a function to calculate the MSE and RMSE
def mse_rmse(y, y_pred):
    # Calculate the mean squared error (MSE)
    mse = np.mean((y - y_pred) ** 2)
    
    # Calculate the root mean squared error (RMSE) by taking the square root of MSE
    rmse = np.sqrt(mse)
    
    # Return both MSE and RMSE
    return mse, rmse 
 

Finally, we print the calculated MSE and RMSE to evaluate the performance of the model and visualize the result on a graph. Below, a scatter plot of the actual test data points and a line plot of the predicted values are plotted using Matplotlib to visualize the model's performance.

 
# Calculate and print the MSE and RMSE between the actual and predicted values
mse, rmse = mse_rmse(pred_y, y_test)
print(f"MSE: {mse}, RMSE: {rmse}")

# Plot the actual test data points and the predicted values
plt.scatter(x_test, y_test)
plt.plot(x_test, pred_y, color='red')
plt.show()
 

The result looks as follows.

  
 MSE: 378.9729780316333, RMSE: 19.46722830892044 
 

Conclusion

Linear regression is a powerful and versatile technique in machine learning, providing a simple yet effective method for predictive modeling and data analysis. By understanding its principles and applications, data scientists and analysts can leverage linear regression to gain valuable insights from their data and make informed decisions.

Source code listing

 
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
import sklearn.linear_model as model
import matplotlib.pyplot as plt
import numpy as np

# Generate synthetic data for regression
x, y = make_regression(n_samples=100, n_features=1, noise=20, random_state=1)

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=12)

# Create an instance of the Linear Regression model
linear_regression = model.LinearRegression()

# Train the Linear Regression model on the training data
linear_regression.fit(x_train, y_train)

# Use the trained model to make predictions on the test data
pred_y = linear_regression.predict(x_test)

# Define a function to calculate the MSE and RMSE
def mse_rmse(y, y_pred):
    # Calculate the mean squared error (MSE)
    mse = np.mean((y - y_pred) ** 2)
    
    # Calculate the root mean squared error (RMSE) by taking the square root of MSE
    rmse = np.sqrt(mse)
    
    # Return both MSE and RMSE
    return mse, rmse  

# Calculate and print the MSE and RMSE between the actual and predicted values
mse, rmse = mse_rmse(pred_y, y_test)
print(f"MSE: {mse}, RMSE: {rmse}")

# Plot the actual test data points and the predicted values
plt.scatter(x_test, y_test)
plt.plot(x_test, pred_y, color='red')
plt.show()
 

DataTechNotes

Pages

Linear Regression Example with Scikit-learn

1 comment: