Key Word(s): AIC
An experiment to understand AIC¶
Based on McElreath, Rethinking Statistics, Chapter 6.
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
We generate data from a gaussian with standard deviation 1 and means given by:
$$\mu_i = 0.15 x_{1,i} - 0.4 x_{2,i}, y \sim N(\mu, 1).$$
We use an interesting trick to generate this data, directly using the regression coefficients as correlations with the response variable.
def generate_data(N, k, rho=[0.15, -0.4]):
n_dim = 1 + len(rho)
if n_dim < k:
n_dim = k
Rho = np.eye(n_dim)
for i,r in enumerate(rho):
Rho[0, i+1] = r
index_lower = np.tril_indices(n_dim, -1)
Rho[index_lower] = Rho.T[index_lower]
mean = n_dim * [0.]
Xtrain = np.random.multivariate_normal(mean, Rho, size=N)
Xtest = np.random.multivariate_normal(mean, Rho, size=N)
ytrain = Xtrain[:,0].copy()
Xtrain[:,0]=1.
ytest = Xtest[:,0].copy()
Xtest[:,0]=1.
return Xtrain[:,:k], ytrain, Xtest[:,:k], ytest
We want to generate data for 5 different cases, a one parameter (intercept) fit, a two parameter (intercept and $x_1$), three parameters (add a $x_2), and four and five parameters. Here is what the data looks like for 2 parameters:
generate_data(20,2)
And for four parameters
generate_data(20,4)
from scipy.stats import norm
import statsmodels.api as sm
Analysis, n=20¶
Here is the main loop of our analysis. We take the 5 models we talked about. For each model we generate 10000 samples of the data, split into an equal sized (N=20 each) training and testing set. We fit the regression on the training set, and calculate the deviance on the training set. Notice how we have simply used the logpdf
from scipy.stats
. You can easily do this for other distributions.
We then use the fit to calculate the $\mu$ on the test set, and calculate the deviance there. We then find the average and the standard deviation across the 10000 simulations.
reps=10000
results_20 = {}
for k in range(1,6):
trdevs=np.zeros(reps)
tedevs=np.zeros(reps)
for r in range(reps):
Xtr, ytr, Xte, yte = generate_data(20, k)
ols = sm.OLS(ytr, Xtr).fit()
mutr = np.dot(Xtr, ols.params)
devtr = -2*np.sum(norm.logpdf(ytr, mutr, 1))
mute = np.dot(Xte, ols.params)
#print(mutr.shape, mute.shape)
devte = -2*np.sum(norm.logpdf(yte, mute, 1))
#print(k, r, devtr, devte)
trdevs[r] = devtr
tedevs[r] = devte
results_20[k] = (np.mean(trdevs), np.std(trdevs), np.mean(tedevs), np.std(tedevs))
import pandas as pd
df = pd.DataFrame(results_20).T
df = df.rename(columns = dict(zip(range(4), ['train', 'train_std', 'test', 'test_std'])))
df
import seaborn.apionly as sns
colors = sns.color_palette()
colors
We plot the traing and testing deviances
plt.plot(df.index, df.train, 'o', color = colors[0])
plt.errorbar(df.index, df.train, yerr=df.train_std, fmt=None, color=colors[0])
plt.plot(df.index+0.2, df.test, 'o', color = colors[1])
plt.errorbar(df.index+0.2, df.test, yerr=df.test_std, fmt=None, color=colors[1])
plt.xlabel("number of parameters")
plt.ylabel("deviance")
plt.title("N=20");
Let us see the difference between the mean testing and training deviances. This is the difference in bias between the two sets.
df.test - df.train
Voila, this seems to be roughly twice the number of parameters. In other words we might be able to get away without a test set if we "correct" the bias on the traing set by $2n_p$. This is the observation that motivates the AIC.
Analysis N=100¶
reps=10000
results_100 = {}
for k in range(1,6):
trdevs=np.zeros(reps)
tedevs=np.zeros(reps)
for r in range(reps):
Xtr, ytr, Xte, yte = generate_data(100, k)
ols = sm.OLS(ytr, Xtr).fit()
mutr = np.dot(Xtr, ols.params)
devtr = -2*np.sum(norm.logpdf(ytr, mutr, 1))
mute = np.dot(Xte, ols.params)
devte = -2*np.sum(norm.logpdf(yte, mute, 1))
#print(k, r, devtr, devte)
trdevs[r] = devtr
tedevs[r] = devte
results_100[k] = (np.mean(trdevs), np.std(trdevs), np.mean(tedevs), np.std(tedevs))
df100 = pd.DataFrame(results_100).T
df100 = df100.rename(columns = dict(zip(range(4), ['train', 'train_std', 'test', 'test_std'])))
df100
plt.plot(df100.index, df100.train, 'o', color = colors[0])
plt.errorbar(df100.index, df100.train, yerr=df100.train_std, fmt=None, color=colors[0])
plt.plot(df100.index+0.2, df100.test, 'o', color = colors[1])
plt.errorbar(df100.index+0.2, df100.test, yerr=df100.test_std, fmt=None, color=colors[1])
plt.xlabel("number of parameters")
plt.ylabel("deviance")
plt.title("N=100");
df100.test - df100.train
We get pretty much the same result at N=100.