Key Word(s): Random Variable, PDF/PMF, Uniform, Binomial, Normal, Likelihood
Title :¶
Exercise: CS109A Olympics
Description :¶
Data Description:¶
Instructions:¶
- In this exercise, you will simulate the 100m sprint race discussed during the lecture.
- We have already defined for you a Sprinter() class which has two characteristics for each sprinter:
- Base time
- Performance variance
- Run the code cell that makes four instances of the
Sprinter()
class. You will work with those for the entire exercise. - Call the time attribute of the helper class to get the time taken by a competitor in the actual race.
- First run the race simulation five times; you will do this by creating a dictionary with participant name as keys, and time taken in a simulated race as the values. You will sort this dictionary by values and determine the winner of the simulated race.
- Repeat the simulation of the race for 10,000 times and count who won the race for how many times. Based on this observation, you will then investigate why a particular participant won as many times?
- Repeat the simulation for 10,000 times, but this time get the distribution of times for each participant over these runs.
- Calculate the mean race time, standard deviation of the race time and the confidence interval for each participant.
- Use the helper code to observe a plot similar to the one given below:
Hints:¶
Counter() Helps accumulating counts of objects in a certain data structure.
np.mean() Used to calculate the mean of an array.
sorted() Used to sort data.
np.std() Used to calculate the std deviation of an array.
np.percentile Used to calculate percentile of data inbetween a given range. Frequently used for calculating confidence intervals.
PyDS Olymipics : 100m dash¶
We are going to have 4 of our team members compete against each other in the 100m dash.
# Importing libraries
import numpy as np
from time import sleep
import os
from IPython.display import clear_output
from collections import Counter
from helper import Sprinter
import matplotlib.pyplot as plt
from prettytable import PrettyTable
plt.xkcd(scale=0,randomness=4)
Taking a look at the competitors¶
Each participant has a characteristic assigned to him. The characteristic has 2 parts :
- Base speed : This is the time they gave in a non-competitive environment.
- Performance variance : Based on the mood, weather and other conditions this measure determines how much a participant's time will vary.
# Name of sprinters
sprinters = ['Pavlos','Hargun','Joy','Hayden']
# Defining charactersistics, ('Base pace','performance variance')
characteristics = [(13,0.25),(12.5,0.5),(12.25,1),(14.5,1)]
sprinters_dict = {}
for idx,sprinter in enumerate(sprinters):
sprinters_dict[sprinter] = Sprinter(*characteristics[idx])
Running a race¶
sprinters_dict
has keys as the name of each participant, and the value as a class. The time
attribute of the class is the time taken by that person to run a race.
- Call
sprinters_dict['Pavlos'].time
for 10 different times.
# Call time attribute
___
- Get the times for each participant by calling the
time
attribute. - Create a dictionary called
race
, which has the key as the name of the participant and value as the time taken by participant to run the race. - Sort
race.items()
according to time and get the item in dictionary with the least time taken to finish and assign it towinner
.
Note: The time taken by a participant to finish the race is the value of the dictionary so remember to sort by values
# Get the times for each participant and make a dictionary
race = ___
# Then sort the items of the dictionary to get the winner
# Hint: Remember to sort by the values and not the keys
winner = ___
Race simulation¶
As you would have noticed, every time you make a new dictionary race
, the results would differ.
Redefine the race
dictionary, and run the cell below for a simulation of the race!
# Again get the times for each participant and make a dictionary
race = ___
# Then sort the items of the dictionary to get the winner
winner = ___
# Execute the following code
for i in range(1,11):
clear_output(wait=True)
print("|START|"+"\n|START|".join(['----'*min(10,int((15*i)/race[runner]))+ ' '*(10-min(10,int((15*i)/race[runner])))+'|'+runner for runner in race.keys()]))
sleep(0.5)
print(f'\nThe winner is {winner[0]} with a time of {winner[1]:.2f}s!')
Multiple simulations¶
Earlier was just one race, we want to find out who performs better over multiple races. So let's run the race 5 times
- Run a loop for 5 times
- In each loop generate the race dictionary as done earlier, and get the winner after sorting
race.items()
- Append winners to the
winner_list
Keep track of everyone's timings
# Run the simulation and append winners to the winner_list
winner_list = []
for simulation in range(5):
race = ___
winner = ___
___
winner_list
Even more simulations¶
We will run 10,000 simulations and use the Counter
to see who wins how many times.
Check the hints for how to use Counter()
.
# Run the simulation and append winners to the winner_list
___
# Get the counts for each person winning the race
wins = Counter(___)
print(wins)
# Execute the code
plt.bar(list(wins.keys()),list(wins.values()),alpha=0.5)
plt.xlabel('Sprinters')
plt.ylabel('Race wins',rotation=0,labelpad=30)
Why is Joy winning so much ?¶
Let us analyze why exactly is Joy winning so frequently in our simulations. But first, we will need to record the sprint timings for each sprinter in every simulation.
We will again run 10,000 simulations but this time record the individual sprint timings for each simulation instead.
- Make a new dictionary
race_results
with keys as the name of sprinters and the value as an empty list. We will append race results to this list after each simulation. - Inside the simulation loop, loop through the items of the
race_results
dictionary, and for each participant :- Calculate time by calling
.time
append
time to the list for participant inrace_results
- Calculate time by calling
# Run the earlier simulation and store all 10000 times given by a participant
# race_results has a list of times as values for a given key( i.e participant)
# So for a key it has a corresponding list of times for that participant.
race_results= {___:___ for ___ in sprinters_dict.___}
for simulation in range(10000):
for sprinter,dash in sprinters_dict.items():
sprint_timing = ___
race_results[___].append(___)
Sample mean $\bar{x}$ sample standard deviation $s$¶
Now we have a list of times given by each participant. We have the complete distribution, so let's calculate the mean, std and confidence interval.
As discussed in the lecture, if we have a given sample, we can quickly compute the mean and standard deviation using np.mean()
and np.std()
.
Let's begin with the race results for Pavlos
.
# Using the race_results dictionary, find the mean
# and std for 'Pavlos'
pavlos_mean = np.mean(___)
pavlos_std = np.std(___)
print(f'The average pace of Pavlos is {pavlos_mean:.2f} and the sample std is {pavlos_std:2f}')
Sample mean $\bar{x}$ sample standard deviation $s$ for all sprinters¶
For each sprinter in the race_results
dicitionary, find the mean and standard deviation of the 10,000 simulations using the np.mean()
and np.std()
functions.
Store your findings in a new dictionary called race_stats
as a list. So the race_stats
dictionary has a list of corresponding stats for each participant(key)
# loop through the keys of race_results
# calculate mean and std of each participant using np.mean() and np.std()
# Assign these stats to the key, as a list
race_stats = {}
for sprinter in race_results.keys():
sprinter_mean = ___
sprinter_std = ___
race_stats[sprinter] = [___,___]
# Use the helper code below to print your findings
pt = PrettyTable()
pt.field_names = ["Sprinter", "Sample mean", "Sample std"]
for sprinter,stats in race_stats.items():
pt.add_row([sprinter, round(stats[0],3),round(stats[1],3)])
print(pt)
Confidence Interval¶
Confidence interval is the range of values for which we can claim a certain confidence level(95% mostly). The confidence interval represents values for the population parameter for which the difference between the parameter and the observed estimate is not significant at the 5% level.
- Use
np.percentile()
to calculate the 95% CI. - Calculate
np.percentile
at 2.5 and 97.5 to get the interval. - Calculate and append these to the list of stats in the
race_stats
dictionary, for each participant
#By using the race_results dictionary defined above,
# Find the 2.5 and 97.5 percentile of Hargun's race runs.
CI = np.percentile(___,[___,___])
print(f'The 95% confidence interval for Hargun is {round(CI[0],2),round(CI[1],2)}')
Confidence intervals for all sprinters.¶
Let's repeat the above for each sprinter.
You will add this information to your race_stats
dictionary.
We expect you to append the $2.5$ and the $97.5$ percentile values to the existing stats list for each sprinter.
For e.g., if for Pavlos
, we have mean=13.00
, std=0.1
, and CI as (12.8,13.2)
, your race_stats['Pavlos']
must look like: [13.00,0.1,12.8,13.2]
.
# Now lets repeat the same, but for every sprinter
# run through the race_results dictionary for each sprinter
# find the confidence interval, and add it to the race_stats dictionary
# defined above
# Hint: You can use the .extend() method to add it to the existing list of stats
for sprinter,runs in race_results.items():
ci = np.percentile(___)
race_stats[___].___
# Use the helper code below to print your findings
pt = PrettyTable()
pt.field_names = ["Sprinter", "Sample mean", "Sample std","95% CI"]
for sprinter,stats in race_stats.items():
mean = round(stats[0],3)
std = round(stats[1],3)
confidence_interval = (round(stats[2],3),round(stats[3],3))
pt.add_row([sprinter, mean,std,confidence_interval])
print(pt)
Histogram plot for each sprinter¶
Run the following cell to get a cool plot for distribution of times.
fig = plt.gcf()
fig.set_size_inches(10,6)
bins = np.linspace(10, 17, 50)
for sprinter,runs in race_results.items():
height, bins, patches = plt.hist(runs, bins, alpha=0.5, \
label=sprinter,density=True,edgecolor='k')
plt.fill_betweenx([0, height.max()], race_stats[sprinter][2], race_stats[sprinter][3], alpha=0.2)
plt.legend(loc='upper left',fontsize=16)
plt.xlabel('Seconds')
plt.ylabel('Frequency',rotation=0,labelpad=25)
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.show()
⏸ Take a look at the histograms for each participant and comment on why do you think Joy is winning the most races?¶
A. Very consistent distribution
B. Low base time and not a very high spread
C. High base time but variation causes lower times to show more frequently
D. Joy is not winning the most races
### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')
answer = '___'
⏸ What one parameter should Hargun change in order to win more races?¶
A. Reduce base time
B. Reduce consistency
C. Relax before the race
D. Increase consistency
### edTest(test_chow2) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')
answer = '___'
👩🏻🎓 Bonus (Not graded)¶
Find out who among has would have the most podium finishes (top 3).
# Your code here