\"run all above\" is also very helpful to run many cells of the notebook at once."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load exercises/exercise1.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 1.2: Change the above function so it uses list comprehension.\n",
"To load the execise simply delete the '#' in the code below and run the cell."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load exercises/exercise2.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Don't look at this cell until you've given the exercise a go! It loads the correct solution."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1.2 solution (1.1 solution is contained herein as well)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/breakoutsol1.py"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%run ./solutions/breakoutsol1.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's create a Pandas dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's get all of the awards."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"awards = []\n",
"for award_node in soup.select('.by_year'):\n",
" recipients = get_recipients(award_node)\n",
" \n",
" #initialize the dictionary\n",
" award = {} #{key: value}\n",
" \n",
" award['title'] = get_award_title(award_node)\n",
" award['year'] = get_award_year(award_node)\n",
" award['recipients'] = recipients\n",
" award['num_recipients'] = len(recipients)\n",
" award['motivation'] = get_award_motivation(award_node) \n",
" awards.append(award)\n",
"awards[0:2]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_raw = pd.DataFrame(awards)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#explain open brackets\n",
"df_awards_raw"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some quick EDA."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_raw.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_raw.year.min()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What is going on with the recipients column?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_raw.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_raw.num_recipients.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Now lets take a look at num_recipients**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_raw.num_recipients == 0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_raw[df_awards_raw.num_recipients == 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok: 2018 awards have no recipients because this is a 2018 archived version of nobel prize webpage. Some past years lack awards because none were actually awarded that year. Let's keep only meaningful data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past = df_awards_raw[df_awards_raw.year != 2018]\n",
"df_awards_past.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hm, `motivation` has a different number of items... why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past[df_awards_past.motivation.isnull()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks like it's fine that those motivations were missing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Sort the awards by year.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past.sort_values('year').head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How many awards of each type were given?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past.title.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But wait, that includes the years the awards weren't offered."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_actually_offered = df_awards_past[df_awards_past.num_recipients > 0]\n",
"df_awards_actually_offered.title.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### When was each award first given?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_actually_offered.groupby('title').year"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_actually_offered.groupby('title').year.describe() # we will use this information later!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How many recipients per year?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's include the years with missing awards; if we were to analyze further, we'd have to decide whether to include them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A good plot that clearly reveals patterns in the data is very important. Is this a good plot or not?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past.plot.scatter(x='year', y='num_recipients') #explain scatterplot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It's hard to see a trend when there are multiple observations per year (**why?**).\n",
"\n",
"Let's try looking at *total* num recipients by year."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Lets explore how important a good plot can be
"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past.groupby('year').num_recipients.sum()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=[16,6])\n",
"plt.plot(df_awards_past.groupby('year').num_recipients.mean(), 'b', linewidth='1')\n",
"\n",
"\n",
"plt.title('Total Nobel Awards per year')\n",
"plt.xlabel('Year')\n",
"plt.ylabel('Total recipients per prize')\n",
"plt.grid('on')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check out the years 1940-43? Any comment? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Any trends the last 25 years?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"set(df_awards_past.title)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.figure(figsize=[16,6])\n",
"i = 0\n",
"for award in set(df_awards_past.title):\n",
" i += 1\n",
" year = df_awards_past[df_awards_past['title']==award].year\n",
" recips = df_awards_past[df_awards_past['title']==award].num_recipients\n",
" index = year > 2020 - 25\n",
" years_filtered = year[index].values\n",
" recips_filtered = recips[index].values\n",
" \n",
" plt.subplot(2,3,i)\n",
" plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)\n",
" plt.title(award)\n",
" plt.xlabel('Year')\n",
" plt.ylabel('Number of Recipients')\n",
" plt.ylim(0, 3)\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A cleaner way to iterate and keep tabs: the ***enumerate( )*** function"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 'How has the number of recipients per award changed over time?'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# The enumerate function allows us to delete two lines of code \n",
"# The number of years shown is increased to 75 so we can see the trend.\n",
"plt.figure(figsize=[16,6])\n",
"\n",
"for i, award in enumerate(set(df_awards_past.title), 1): ################### <--- enumerate\n",
" year = df_awards_past[ df_awards_past['title'] == award].year\n",
" recips = df_awards_past[ df_awards_past['title'] == award].num_recipients\n",
" index = year > 2019 - 75 ########################### <--- extend the range\n",
" years_filtered = year[index].values\n",
" recips_filtered = recips[index].values\n",
" \n",
" #plot:\n",
" plt.subplot(2, 3, i) #arguments (nrows, ncols, index)\n",
" plt.bar(years_filtered, recips_filtered, color='b', alpha = 0.7)\n",
" plt.title(award)\n",
" plt.xlabel('Year')\n",
" plt.ylabel('Number of Recipients')\n",
" plt.ylim(0, 3)\n",
"\n",
"plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"----------------\n",
"### End of Standard Section\n",
"---------------"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Break Out Room II: Dictionaries, dataframes, and Pyplot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Exercise 2.1 (practice creating a dataframe): Build a dataframe of famous physicists from the following lists. **\n",
"Your dataframe should have the following columns: \"name\", \"year_prize_awarded\" and \"famous_for\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"famous_award_winners = [\"Marie Curie\", \"Albert Einstein\", \"James Chadwick\", \"Werner Karl Heisenberg\"] \n",
"nobel_prize_dates = [1923, 1937, 1940, 1934]\n",
"famous_for = [\"spontaneous radioactivity\", \"general relativity\", \"strong nuclear force\",\n",
" \"uncertainty principle\"]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#initialize dictionary\n",
"famous_physicists = {}\n",
"#TODO: build Pandas Dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Exercise 2.2:** Make a bar plot of the total number of Nobel prizes awarded per field. Make sure to use the 'group by' function to achieve this.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#create the figure:\n",
"plt.figure(figsize=[16,6])\n",
"#group by command:\n",
"#TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Solutions:\n",
"## Exercise 2.1 Solutions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/exercise2.1sol"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2.2 Solutions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/exercise2.2sol_vanilla"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# %load solutions/exercise2.2sol_improved"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Food for thought: Is the prize in Economics more collaborative, or just more modern?***"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra: Did anyone recieve the Nobel Prize more than once (based upon scraped data)?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's where it bites us that our original DataFrame isn't \"tidy\". Let's make a tidy one."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A great scientific article describing tidy data by Hadley Wickam: https://vita.had.co.nz/papers/tidy-data.pdf"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tidy_awards = []\n",
"for idx, row in df_awards_past.iterrows():\n",
" for recipient in row['recipients']:\n",
" tidy_awards.append(dict(\n",
" recipient = recipient,\n",
" year = row['year']))\n",
"tidy_awards_df = pd.DataFrame(tidy_awards)\n",
"tidy_awards_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can look at each recipient individually."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tidy_awards_df.recipient.value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## End of Normal Section"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optional Further Readings\n",
"\n",
"Harvard Professor
Sean Eddy in the micro and chemical Biology department at Harvard teaches a great course called
MCB-112: Biological Data Science. His course is difficult but a great complement to CS109a and is also taught in python.\n",
"\n",
"Here are a couple resources that he referenced early in his course that helped solidify my understanding of data science.\n",
"\n",
"
50 Years of Data Science by Dave Donoho (2017)\n",
"\n",
"
Tidy data by Hadley Wickam (2014)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra Material: Other structured data formats (JSON and CSV)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### CSV\n",
"CSV is a lowest-common-denominator format for tabular data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past.to_csv('../data/awards.csv', index=False)\n",
"with open('../data/awards.csv', 'r') as f:\n",
" print(f.read()[:1000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It loses some info, though: the recipients list became a plain string, and the reader needs to guess whether each column is numeric or not."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.read_csv('../data/awards.csv').recipients.iloc[20]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### JSON"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"JSON preserves structured data, but fewer data-science tools speak it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past.to_json('../data/awards.json', orient='records')\n",
"\n",
"with open('../data/awards.json', 'r') as f:\n",
" print(f.read()[:1000])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lists and other basic data types are preserved. (Custom data types aren't preserved, but you'll get an error when saving.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.read_json('../data/awards.json').recipients.iloc[20]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra: Pickle: handy for storing data\n",
"For temporary data storage in a single version of Python, `pickle`s will preserve your data even more faithfully, even many custom data types. But don't count on it for exchanging data or long-term storage. (In fact, don't try to load untrusted `pickle`s -- they can run arbitrary code!)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_awards_past.to_pickle('../data/awards.pkl')\n",
"with open('../data/awards.pkl', 'r', encoding='latin1') as f:\n",
" print(f.read()[:200])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yup, lots of internal Python and Pandas stuff..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.read_pickle('../data/awards.pkl').recipients.iloc[20]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra: Formatted data output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's make a textual table of Physics laureates by year, earliest first:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for idx, row in df_awards_past.sort_values('year').iterrows():\n",
" if 'Physics' in row['title']:\n",
" print('{}: {}'.format(\n",
" row['year'],\n",
" ', '.join(row['recipients'])))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra: Parsing JSON to get the Wayback Machine URL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We could go to http://archive.org, search for our URL, and get the URL for the archived version there. But since you'll often need to talk with APIs, let's take this opportunity to use the Wayback Machine's [API](https://archive.org/help/wayback_api.php). This will also give us a chance to practice working with JSON."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"url = \"https://www.nobelprize.org/prizes/lists/all-nobel-prizes/\"\n",
"# All 3 of these do the same thing. The third is my (KCA's) favorite new feature of Python 3.6.\n",
"wayback_query_url = 'http://archive.org/wayback/available?url={}'.format(url)\n",
"wayback_query_url = 'http://archive.org/wayback/available?url={url}'.format(url=url)\n",
"wayback_query_url = f'http://archive.org/wayback/available?url={url}'\n",
"r = requests.get(wayback_query_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We got some kind of response... what is it?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"r.text"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Yay, [JSON](https://en.wikipedia.org/wiki/JSON)! It's usually pretty easy to work with JSON, once we parse it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"json.loads(r.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Loading responses as JSON is so common that `requests` has a convenience method for it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"response_json = r.json()\n",
"response_json"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**What kind of object is this?**\n",
"\n",
"A little Python syntax review: **How can we get the snapshot URL?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"snapshot_url = response_json['archived_snapshots']['closest']['url']\n",
"snapshot_url"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
},
"varInspector": {
"cols": {
"lenName": 16,
"lenType": 16,
"lenVar": 40
},
"kernels_config": {
"python": {
"delete_cmd_postfix": "",
"delete_cmd_prefix": "del ",
"library": "var_list.py",
"varRefreshCmd": "print(var_dic_list())"
},
"r": {
"delete_cmd_postfix": ") ",
"delete_cmd_prefix": "rm(",
"library": "var_list.r",
"varRefreshCmd": "cat(var_dic_list()) "
}
},
"types_to_exclude": [
"module",
"function",
"builtin_function_or_method",
"instance",
"_Feature"
],
"window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}