Title¶

Exercise: A.3 - Linear Regression using sklearn

Description¶

The goal here is to use the sklearn package to fit a Linear Regression on the previously used Advertising.csv and produce a plot like the one given below.

Instructions:¶

We want to find the model that fit best the data. To do so we are going to

1) Use train_test_split() function to split the dataset into training and testing sets.

2) Use the LinearRegression function to make a model.

3) Fit the model on the training set

4) Predict on the testing set using the fit model.

5) Estimate the fit of the model using mean_squared_error function

6) Plot the dataset along with the predictions to visualize the fit

Hints:¶

pd.read_csv(filename) : Returns a pandas dataframe containing the data and labels from the file data

sklearn.train_test_split() : Splits the data into random train and test subsets

sklearn.LinearRegression() : LinearRegression fits a linear model

sklearn.fit() : Fits the linear model to the training data

sklearn.predict() : Predict using the linear model.

Note: This exercise is auto-graded and you can try multiple attempts

In [2]:

# import required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
%matplotlib inline

In [3]:

# Read the 'Advertising.csv' dataset

data_filename = 'Advertising.csv'

# Read data file using pandas libraries

df = pd.read_csv(data_filename)

In [4]:

# Take a quick look at the data

df.head()

Out[4]:

	TV	Radio	Newspaper	sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	9.3
3	151.5	41.3	58.5	18.5
4	180.8	10.8	58.4	12.9

In [6]:

# Assign TV advertising as predictor variable 'x' and sales as response variable 'y'


x = df[["TV"]]
y = df["sales"]

In [13]:

# divide the data into training and validation sets

x_train, x_test, y_train, y_test = train_test_split(___,___,train_size=0.8)

In [14]:

# Use the sklearn function 'LinearRegression' to fit on the training set

model = LinearRegression()

model.fit(___, ___)

# Now predict on the test set

y_pred_test = model.predict(___)

In [15]:

### edTest(test_mse) ###
# Now compute the MSE with the predicted values and print it

mse = mean_squared_error(___, ___)
print(f'The test MSE is {___}')

The test MSE is 10.1024550349862

In [16]:

# Make a plot of the data along with the predicted linear regression

fig, ax = plt.subplots()
ax.scatter(x,y,label='data points')
ax.plot(___,___,color='red',linewidth=2,label='model predictions')
ax.set_xlabel('Advertising')
ax.set_ylabel('Sales')
ax.legend()

Out[16]:

Mindchow¶

Rerun the code but this time change the training size to 60%.

Did your test $MSE$ improve or get worse?

In [17]:

# your answer here

In [ ]: