Harvard CS109A | Lecture 3: Introduction to Regression kNN and Linear Regression

Key Word(s): Knn, Knn Regression, MSE, Data Plotting

s1_ex1a_challenge

Title :¶

Description :¶

The aim of this exercise is to plot TV Ads vs Sales based on the Advertisement dataset which should look similar to the graph given below.

Data Description:¶

Instructions:¶

Read the Advertisement data and view the top rows of the dataframe to get an understanding of the data and the columns.
Select the first 7 observations and the columns TV and Sales to make a new data frame.
Create a scatter plot of the new data frame TV budget vs Sales.

Hints:¶

pd.read_csv(filename) Returns a pandas dataframe containing the data and labels from the file data

df.iloc[] Returns a subset of the dataframe that is contained in the row range passed as the argument

np.linspace() Returns evenly spaced numbers over a specified interval

df.head() Returns the first 5 rows of the dataframe with the column names

plt.scatter() A scatter plot of y vs. x with varying marker size and/or color

plt.xlabel() This is used to specify the text to be displayed as the label for the x-axis

plt.ylabel() This is used to specify the text to be displayed as the label for the y-axis

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Reading the Advertisement dataset¶

In [2]:

# "Advertising.csv" containts the data set used in this exercise
data_filename = 'Advertising.csv'

# Read the file "Advertising.csv" file using the pandas library
df = pd.read_csv("Advertising.csv")

In [3]:

# Get a quick look of the data
df.describe()

Out[3]:

	TV	Radio	Newspaper	Sales
count	200.000000	200.000000	200.000000	200.000000
mean	147.042500	23.264000	30.554000	14.022500
std	85.854236	14.846809	21.778621	5.217457
min	0.700000	0.000000	0.300000	1.600000
25%	74.375000	9.975000	12.750000	10.375000
50%	149.750000	22.900000	25.750000	12.900000
75%	218.825000	36.525000	45.100000	17.400000
max	296.400000	49.600000	114.000000	27.000000

In [4]:

### edTest(test_pandas) ###
# Create a new dataframe by selecting the first 7 rows of
# the current dataframe
df_new = df.head(7)

In [5]:

# Print your new dataframe to see if you have selected 7 rows correctly
print(df_new)

      TV  Radio  Newspaper  Sales
0  230.1   37.8       69.2   22.1
1   44.5   39.3       45.1   10.4
2   17.2   45.9       69.3    9.3
3  151.5   41.3       58.5   18.5
4  180.8   10.8       58.4   12.9
5    8.7   48.9       75.0    7.2
6   57.5   32.8       23.5   11.8

Plotting the graph¶

In [7]:

# Use a scatter plot for plotting a graph of TV vs Sales
plt.scatter(df_new.TV, df_new.Sales)

# Add axis labels for clarity (x : TV budget, y : Sales)
plt.xlabel("TV budget")
plt.ylabel("Sales")

Out[7]:

Text(0, 0.5, 'Sales')

Post-Exercise Question¶

Instead of just plotting seven points, experiment to plot all points.

In [8]:

# Your code here
plt.scatter(df.TV, df.Sales)

# Add axis labels for clarity (x : TV budget, y : Sales)
plt.xlabel("TV budget")
plt.ylabel("Sales")

Out[8]:

Text(0, 0.5, 'Sales')

In [0]: