{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CS-109A Introduction to Data Science \n", "\n", "\n", "## Lab 1: Introduction to Python and its Numerical Stack\n", "\n", "**Harvard University**
\n", "**Fall 2019**
\n", "**Instructors:** Pavlos Protopapas, Kevin Rader, and Chris Tanner
\n", "**Lab Instructor:** Eleni Kaxiras
\n", "**Authors:** Rahul Dave, David Sondak, Will Claybaugh, Pavlos Protopapas, Chris Tanner, and Eleni Kaxiras\n", "\n", "\n", "---\n", "\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## RUN THIS CELL TO GET THE RIGHT FORMATTING \n", "import requests\n", "from IPython.core.display import HTML\n", "styles = requests.get(\"https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css\").text\n", "HTML(styles)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "PATHTOSOLUTIONS = '../solutions'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Programming Expectations\n", "All assignments for this class will use Python and the browser-based iPython notebook format you are currently viewing. Programming at the level of CS 50 is a prerequisite for this course. If you have concerns about this, come speak with any of the instructors. \n", "\n", "We will refer to the Python 3 [documentation](https://docs.python.org/3/) in this lab and throughout the course. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Learning Goals \n", "This introductory lab is a condensed introduction to Python numerical programming. By the end of this lab, you will feel more comfortable:\n", "\n", "- Learn about anconda environments and setup your own with the necessary dependencies\n", "\n", "- Writing short Python code using functions, loops, lists, numpy arrays, and dictionaries.\n", "\n", "- Manipulating Python lists and numpy arrays and understanding the difference between them.\n", "\n", "- Introducing the stats libraries `scipy.stats` and `statsmodels`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1: Set up a Conda Python Environment and Clone the Class Repository \n", "\n", "### On Python installation packages\n", "\n", "There are two main installing packages for Python, `conda` and `pip`. Pip is the Python Packaging Authority’s recommended tool for installing packages from the **Python Package Index (PyPI)**. `Conda` is a cross platform package and environment manager that installs and manages conda packages from the **Anaconda repository** and **Anaconda Cloud**. Conda does not assume any specific configuration in your computer and will install the Python interpreter along with the other Python packages, whereas `pip` assumes that you have installed the Python interpreter in your computer. Given the fact that most operating systems do include Python this is not a problem. \n", "\n", "If I could summarize their differences into a sentence it would be that conda has the ability to create **isolated environments** that can contain different versions of Python and/or the packages installed in them. This can be extremely useful when working with data science tools as different tools may contain conflicting requirements which could prevent them all being installed into a single environment. You can have environments with pip but would have to install a tool such as virtualenv or venv. You may use either, we recommend `conda` because in our experience it leads to fewer incompatibilities between packages and thus fewer broken environments.\n", "\n", "**Conclusion: Use Both.** Most often in our data science environments we want to combining pip with conda when one or more packages are only available to install via pip. Although thousands of packages are available in the Anaconda repository, including the most popular data science, machine learning, and AI frameworks but a lot more are available on PyPI. Even if you have your environment installed via `conda` you can use `pip` to install individual packages \n", "\n", "([source: anaconda site](https://www.anaconda.com/understanding-conda-and-pip/)) \n", "\n", "### Installing Conda \n", "\n", "#### - First check if you have conda \n", "\n", "In **MacOS** or **Linux** open a Terminal window and at the prompt type \n", "\n", "`conda –V` \n", "\n", "If you get the version number (e.g. `conda 4.6.14`) you are all set! If you get an error, that means you do not have Anaconda and would be a good idea to install it. \n", "\n", "#### - If you do not have it, you can install it by following the instructions:\n", "\n", "**Mac** : https://docs.anaconda.com/anaconda/install/mac-os/\n", "\n", "**Windows** : https://docs.anaconda.com/anaconda/install/windows (Note: #8 is important: DO NOT add to your path. The reason is that Windows contains paths that may include spaces and that clashes with the way `conda` understands paths.)\n", "\n", "#### - If you do have anaconda consider upgrading it so you get the latest version of the packages: \n", "\n", "`conda update conda`\n", "\n", "Conda allows you to work in 'computing sandboxes' called environments. You may have environments installed on your computer to access different versions of Python and different libraries to avoid conflict between libraries which can cause errors.\n", "\n", "---------------------------------------------------------------------\n", "\n", "### NOTE (Sept.6, 2019): \n", "\n", "If you are still having issues please check the Announcements and the Discussion Forum (Ed) via the [2019-CS109a Canvas site](https://canvas.harvard.edu/courses/61942)\n", "\n", "Also please check the latest version of the cs109a.yml file. We have edited it as of today.\n", "\n", "---------------------------------------------------------------------\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What are environments and do I need them?\n", "\n", "Environments in Python are like sandboxes that have different versions of Python and/or packages installed in them. You can create, export, list, remove, and update environments. Switching or moving between environments is called activating the environment. When you are done with an environments you may deactivate it.\n", "\n", "For this class we want to have a bit more control on the packages that will be installed with the enviromnent so we will create an environment with a so called YAML file called `cs109a.yml`. Originally YAML was said to mean *Yet Another Markup Language* referencing its purpose as a markup language with the yet another construct, but it was then repurposed as *YAML Ain't Markup Language* [source:wikipedia]. This is included in the Lab directory in the class git repository. \n", "\n", "#### Creating an environment from an environment.yml file\n", "\n", "Using your browser, visit the class git repository https://github.com/Harvard-IACS/2019-CS109A\n", "\n", "Go to `content` --> `labs/` --> `lab1` and look for the cs109a.yml file. Download it to a local directory in your computer.\n", "\n", "Then in the Terminal again type\n", "\n", "`conda env create -f {PATH-TO-FILE}/cs109a.yml`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Activate the new environment: \n", "\n", "`source activate cs109a`\n", "\n", "You should see the name of the environment at the start of your command prompth in parenthesis.\n", "\n", "#### Verify that the new environment was installed correctly:\n", "\n", "`conda list`\n", "\n", "This will give you a list of the packages installed in this environment. \n", " \n", "#### References\n", " \n", "[Manage conda environments](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Clone the class repository\n", "\n", "In the Terminal type: \n", "\n", "`git clone https://github.com/Harvard-IACS/2019-CS109A.git`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Starting the Jupyter Notebook\n", "\n", "Once all is installed go in the Terminal and type\n", "\n", "`jupyter notebook` \n", "\n", "to start the jupyter notebook server. This will spawn a process that will be running in the Terminal window until you are done working with the notebook. In that case press `control-C` to stop it.\n", "\n", "Starting the notebook will bring up a browser window with your file structure. Look for the 2019-CS109A folder. It should be where you cloned it previously. When you visit this folder in the future, and while in the top folder of it, type\n", "\n", "`git pull`\n", "\n", "This will update the contents of the folder with whatever is new. Make sure you are at the top part of the folder by typing \n", "\n", "`pwd` \n", "\n", "which should give you `/2019-CS109A/`\n", "\n", "**For more on using the Notebook see**: https://jupyter-notebook.readthedocs.io/en/latest/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2: Getting Started with Python" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Importing modules\n", "All notebooks should begin with code that imports *modules*, collections of built-in, commonly-used Python functions. Below we import the Numpy module, a fast numerical programming library for scientific computing. Future labs will require additional modules, which we'll import with the same syntax.\n", "\n", "`import MODULE_NAME as MODULE_NICKNAME` " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np #imports a fast numerical programming library" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that Numpy has been imported, we can access some useful functions. For example, we can use `mean` to calculate the mean of a set of numbers." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.1666666666666665" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_list = [1.2, 2, 3.3]\n", "np.mean(my_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculations and variables" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(0, 0, 0.5, 9.600000000000001)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# // is integer division\n", "1/2, 1//2, 1.0/2, 3*3.2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last line in a cell is returned as the output value, as above. For cells with multiple lines of results, we can display results using ``print``, as can be seen below." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(4.0, '\\n', 9, 7)\n" ] }, { "data": { "text/plain": [ "1" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(1 + 3.0, \"\\n\", 9, 7)\n", "5/3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can store integer or floating point values as variables. The other basic Python data types -- booleans, strings, lists -- can also be stored as variables. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = 1\n", "b = 2.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the storing of a list" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "a = [1, 2, 3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Think of a variable as a label for a value, not a box in which you put the value\n", "\n", "![](../images/sticksnotboxes.png)\n", "\n", "(image: Fluent Python by Luciano Ramalho)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 2, 3]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = a\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This DOES NOT create a new copy of `a`. It merely puts a new label on the memory at a, as can be seen by the following code:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a [1, 2, 3]\n", "b [1, 2, 3]\n", "a after change [1, 7, 3]\n", "b after change [1, 7, 3]\n" ] } ], "source": [ "print(\"a\", a)\n", "print(\"b\", b)\n", "a[1] = 7\n", "print(\"a after change\", a)\n", "print(\"b after change\", b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tuples**\n", "\n", "Multiple items on one line in the interface are returned as a *tuple*, an immutable sequence of Python objects. See the end of this notebook for an interesting use of `tuples`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2, -1.0, 4.0, 10)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = 1\n", "b = 2.0\n", "a + a, a - b, b * b, 10*a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### `type()`\n", "\n", "We can obtain the type of a variable, and use boolean comparisons to test these types. VERY USEFUL when things go wrong and you cannot understand why this method does not work on a specific variable!" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(a) == float" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(a) == int" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "int" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(a)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For reference, below are common arithmetic and comparison operations.\n", "\n", "\"Drawing\"\n", "\n", "\"Drawing\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
EXERCISE 1: Create a tuple called `tup` with the following seven objects:
\n", "\n", "- The first element is an integer of your choice\n", "- The second element is a float of your choice \n", "- The third element is the sum of the first two elements\n", "- The fourth element is the difference of the first two elements\n", "- The fifth element is the first element divided by the second element\n", "\n", "- Display the output of `tup`. What is the type of the variable `tup`? What happens if you try and chage an item in the tuple? " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(1, 1.1, 2.1, -0.10000000000000009, 0.9090909090909091)\n", "\n" ] } ], "source": [ "# your code here\n", "tup = (1,1.1,1+1.1,1-1.1,1/1.1)\n", "print(tup)\n", "print(type(tup))" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# TO RUN THE SOLUTIONS\n", "# 1. uncomment the first line of the cell below so you have just %load\n", "# 2. Run the cell AGAIN to execute the python code, it will not run when you execute the %load command!!" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3, 4.0, 7.0, -1.0, 0.75)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# %load ../solutions/exercise1.py\n", "a = 3\n", "b = 4.0\n", "c = a + b\n", "d = a - b\n", "e = a / b\n", "tup = (a, b, c, d, e)\n", "tup\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lists\n", "\n", "Much of Python is based on the notion of a list. In Python, a list is a sequence of items separated by commas, all within square brackets. The items can be integers, floating points, or another type. Unlike in C arrays, items in a Python list can be different types, so Python lists are more versatile than traditional arrays in C or other languages. \n", "\n", "Let's start out by creating a few lists. " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[]\n", "[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", "([1, 2.0, 3, 4.0, 5], [1.0, 3.0, 5.0, 4.0, 2.0])\n" ] } ], "source": [ "empty_list = []\n", "float_list = [1., 3., 5., 4., 2.]\n", "int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", "mixed_list = [1, 2., 3, 4., 5]\n", "print(empty_list)\n", "print(int_list)\n", "print(mixed_list, float_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lists in Python are zero-indexed, as in C. The first entry of the list has index 0, the second has index 1, and so on." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "3.0\n" ] } ], "source": [ "print(int_list[0])\n", "print(float_list[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens if we try to use an index that doesn't exist for that list? Python will complain!" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "ename": "IndexError", "evalue": "list index out of range", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mIndexError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0;32mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfloat_list\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m10\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mIndexError\u001b[0m: list index out of range" ] } ], "source": [ "print(float_list[10])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find the length of a list using the built-in function `len`:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.0, 3.0, 5.0, 4.0, 2.0]\n" ] }, { "data": { "text/plain": [ "5" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(float_list)\n", "len(float_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Indexing on lists plus Slicing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And since Python is zero-indexed, the last element of `float_list` is" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_list[len(float_list)-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is more idiomatic in Python to use -1 for the last element, -2 for the second last, and so on" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_list[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the ``:`` operator to access a subset of the list. This is called **slicing.** " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[3.0, 5.0, 4.0, 2.0]\n", "[1.0, 3.0]\n" ] } ], "source": [ "print(float_list[1:5])\n", "print(float_list[0:2])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a summary of list slicing operations:\n", "\n", "\"Drawing\"" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['hi', 7]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lst = ['hi', 7, 'c', 'cat', 'hello', 8]\n", "lst[:2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can slice \"backwards\" as well:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1.0, 3.0, 5.0]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_list[:-2] # up to second last" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1.0, 3.0, 5.0, 4.0]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_list[:4] # up to but not including 5th element" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also slice with a stride:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1.0, 5.0]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_list[:4:2] # above but skipping every second element" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can iterate through a list using a loop. Here's a **for loop.**" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.0\n", "3.0\n", "5.0\n", "4.0\n", "2.0\n" ] } ], "source": [ "for ele in float_list:\n", " print(ele)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if you wanted the index as well?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the built-in python method `enumerate`, which can be used to create a list of tuples with each tuple of the form `(index, value)`. " ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(0, 1.0)\n", "(1, 3.0)\n", "(2, 5.0)\n", "(3, 4.0)\n", "(4, 2.0)\n" ] } ], "source": [ "for i, ele in enumerate(float_list):\n", " print(i, ele)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Appending and deleting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also append items to the end of the list using the `+` operator or with `append`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1.0, 3.0, 5.0, 4.0, 2.0, 0.333]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_list + [.333]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "float_list.append(.444)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.0, 3.0, 5.0, 4.0, 2.0, 0.444]\n" ] }, { "data": { "text/plain": [ "6" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "print(float_list)\n", "len(float_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, run the cell with `float_list.append()` a second time. Then run the subsequent cell. What happens? \n", "\n", "To remove an item from the list, use `del.`" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.0, 3.0, 4.0, 2.0, 0.444]\n" ] } ], "source": [ "del(float_list[2])\n", "print(float_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may also add an element (elem) in a specific position (index) in the list" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1.0, '3.14', 3.0, 4.0, 2.0, 0.444]" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "elem = '3.14'\n", "index = 1\n", "float_list.insert(index, elem)\n", "float_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List Comprehensions\n", "\n", "Lists can be constructed in a compact way using a *list comprehension*. Here's a simple example." ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "squaredlist = [i*i for i in int_list]\n", "squaredlist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here's a more complicated one, requiring a conditional." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8, 32, 72, 128, 200]\n" ] } ], "source": [ "comp_list1 = [2*i for i in squaredlist if i % 2 == 0]\n", "print(comp_list1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is entirely equivalent to creating `comp_list1` using a loop with a conditional, as below:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8, 32, 72, 128, 200]\n" ] } ], "source": [ "comp_list2 = []\n", "for i in squaredlist:\n", " if i % 2 == 0:\n", " comp_list2.append(2*i) \n", "print(comp_list2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The list comprehension syntax\n", "\n", "```\n", "[expression for item in list if conditional]\n", "\n", "```\n", "\n", "is equivalent to the syntax\n", "\n", "```\n", "for item in list:\n", " if conditional:\n", " expression\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Exercise 2: (do at home) Build a list that contains every prime number between 1 and 100, in two different ways:
\n", " \n", "- 2.1 Using for loops and conditional if statements.\n", "- 2.2 **(Stretch Goal)** Using a list comprehension. You should be able to do this in one line of code. **Hint:** it might help to look up the function `all()` in the documentation." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2,\n", " 3,\n", " 5,\n", " 7,\n", " 11,\n", " 13,\n", " 17,\n", " 19,\n", " 23,\n", " 29,\n", " 31,\n", " 37,\n", " 41,\n", " 43,\n", " 47,\n", " 53,\n", " 59,\n", " 61,\n", " 67,\n", " 71,\n", " 73,\n", " 79,\n", " 83,\n", " 89,\n", " 97]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "primes = []\n", "for i in range(1,101):\n", " if sum([(i % p) == 0 for p in primes]) > 0:\n", " continue\n", " if i != 1:\n", " primes.append(i)\n", "primes" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2,\n", " 3,\n", " 5,\n", " 7,\n", " 11,\n", " 13,\n", " 17,\n", " 19,\n", " 23,\n", " 29,\n", " 31,\n", " 37,\n", " 41,\n", " 43,\n", " 47,\n", " 53,\n", " 59,\n", " 61,\n", " 67,\n", " 71,\n", " 73,\n", " 79,\n", " 83,\n", " 89,\n", " 97]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[i for i in range(2,101) if all(i % j != 0 for j in range(2,i))]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load ../solutions/exercise2_1.py\n", "N = 100;\n", "\n", "# using loops and if statements\n", "primes = [];\n", "for j in range(2, N):\n", " count = 0;\n", " for i in range(2,j):\n", " if j % i == 0:\n", " count = count + 1;\n", " if count == 0:\n", " primes.append(j)\n", "print(primes)\n" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load ../solutions/exercise2_2.py\n", "primes_lc = [j for j in range(2, N) if all(j % i != 0 for i in range(2, j))]\n", "\n", "print(primes)\n", "print(primes_lc)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Simple Functions\n", "\n", "A *function* object is a reusable block of code that does a specific task. Functions are commonplace in Python, either on their own or as they belong to other objects. To invoke a function `func`, you call it as `func(arguments)`.\n", "\n", "We've seen built-in Python functions and methods (details below). For example, `len()` and `print()` are built-in Python functions. And at the beginning, you called `np.mean()` to calculate the mean of three numbers, where `mean()` is a function in the numpy module and numpy was abbreviated as `np`. This syntax allows us to have multiple \"mean\" functions in different modules; calling this one as `np.mean()` guarantees that we will execute numpy's mean function, as opposed to a mean function from a different module." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### User-defined functions\n", "\n", "We'll now learn to write our own user-defined functions. Below is the syntax for defining a basic function with one input argument and one output. You can also define functions with no input or output arguments, or multiple input or output arguments.\n", "\n", "```\n", "def name_of_function(arg):\n", " ...\n", " return(output)\n", "```\n", "\n", "We can write functions with one input and one output argument. Here are two such functions." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(25, 125)" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def square(x):\n", " x_sqr = x*x\n", " return(x_sqr)\n", "\n", "def cube(x):\n", " x_cub = x*x*x\n", " return(x_cub)\n", "\n", "square(5),cube(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if you want to return two variables at a time? The usual way is to return a tuple:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(25, 125)" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def square_and_cube(x):\n", " x_cub = x*x*x\n", " x_sqr = x*x\n", " return(x_sqr, x_cub)\n", "\n", "square_and_cube(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lambda functions\n", "\n", "Often we quickly define mathematical functions with a one-line function called a *lambda* function. Lambda functions are great because they enable us to write functions without having to name them, ie, they're *anonymous*. \n", "No return statement is needed. \n" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "9\n" ] }, { "data": { "text/plain": [ "25" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create an anonymous function and assign it to the variable square\n", "square = lambda x: x*x\n", "print(square(3))\n", "\n", "hypotenuse = lambda x, y: x*x + y*y\n", "\n", "## Same as\n", "# def hypotenuse(x, y):\n", "# return(x*x + y*y)\n", "\n", "hypotenuse(3,4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Methods\n", "A function that belongs to an object is called a *method*. By \"object,\" we mean an \"instance\" of a class (e.g., list, integer, or floating point variable).\n", "\n", "For example, when we invoke `append()` on an existing list, `append()` is a method.\n", "\n", "In other words, a *method* is a function on a specific *instance* of a class (i.e., *object*). In this example, our class is a list. `float_list` is an instance of a list (thus, an object), and the `append()` function is technically a *method* since it pertains to the specific instance `float_list`." ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1.0, 2.09, 4.0, 2.0, 0.444]\n" ] }, { "data": { "text/plain": [ "[1.0, 2.09, 4.0, 2.0, 0.444, 56.7]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float_list = [1.0, 2.09, 4.0, 2.0, 0.444]\n", "print(float_list)\n", "float_list.append(56.7) \n", "float_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Exercise 3: (do at home) generated a list of the prime numbers between 1 and 100
\n", " \n", "In Exercise 2, above, you wrote code that generated a list of the prime numbers between 1 and 100. Now, write a function called `isprime()` that takes in a positive integer $N$, and determines whether or not it is prime. Return `True` if it's prime and return `False` if it isn't. Then, using a list comprehension and `isprime()`, create a list `myprimes` that contains all the prime numbers less than 100. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# your code here\n", "def isprime(n):\n", " return all([n % i != 0 for i in range(2,n)])" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[2,\n", " 3,\n", " 5,\n", " 7,\n", " 11,\n", " 13,\n", " 17,\n", " 19,\n", " 23,\n", " 29,\n", " 31,\n", " 37,\n", " 41,\n", " 43,\n", " 47,\n", " 53,\n", " 59,\n", " 61,\n", " 67,\n", " 71,\n", " 73,\n", " 79,\n", " 83,\n", " 89,\n", " 97]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[n for n in range(2,100) if isprime(n)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %load ../solutions/exercise3.py\n", "def isprime(N):\n", " count = 0;\n", " if not isinstance(N, int):\n", " return False\n", " if N <= 1:\n", " return False\n", " for i in range(2, N):\n", " if N % i == 0:\n", " count = count + 1;\n", " if count == 0:\n", " return(True)\n", " else:\n", " return(False)\n", " \n", "print(isprime(3.0), isprime(\"pavlos\"), isprime(0), isprime(-1), isprime(1), isprime(2), isprime(93), isprime(97)) \n", "myprimes = [j for j in range(1, 100) if isprime(j)]\n", "print(myprimes)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to Numpy\n", "Scientific Python code uses a fast array structure, called the numpy array. Those who have programmed in Matlab will find this very natural. For reference, the numpy documention can be found [here](https://docs.scipy.org/doc/numpy/reference/). \n", "\n", "Let's make a numpy array:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3, 4])" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_array = np.array([1, 2, 3, 4])\n", "my_array" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# works as it would with a standard list\n", "len(my_array)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The shape array of an array is very useful (we'll see more of it later when we talk about 2D arrays -- matrices -- and higher-dimensional arrays)." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4,)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_array.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy arrays are **typed**. This means that by default, all the elements will be assumed to be of the same type (e.g., integer, float, String)." ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dtype('int64')" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_array.dtype" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy arrays have similar functionality as lists! Below, we compute the length, slice the array, and iterate through it (one could identically perform the same with a list)." ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4\n", "[3 4]\n", "1\n", "2\n", "3\n", "4\n" ] } ], "source": [ "print(len(my_array))\n", "print(my_array[2:4])\n", "for ele in my_array:\n", " print(ele)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are two ways to manipulate numpy arrays a) by using the numpy module's methods (e.g., `np.mean()`) or b) by applying the function np.mean() with the numpy array as an argument." ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2.5\n", "2.5\n" ] } ], "source": [ "print(my_array.mean())\n", "print(np.mean(my_array))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A ``constructor`` is a general programming term that refers to the mechanism for creating a new object (e.g., list, array, String).\n", "\n", "There are many other efficient ways to construct numpy arrays. Here are some commonly used numpy array constructors. Read more details in the numpy documentation." ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.ones(10) # generates 10 floating point ones" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy gains a lot of its efficiency from being typed. That is, all elements in the array have the same type, such as integer or floating point. The default type, as can be seen above, is a float. (Each float uses either 32 or 64 bits of memory, depending on if the code is running a 32-bit or 64-bit machine, respectively)." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.dtype(float).itemsize # in bytes (remember, 1 byte = 8 bits)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.ones(10, dtype='int') # generates 10 integer ones" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.zeros(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, you will want random numbers. Use the `random` constructor!" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.85115672, 0.37346821, 0.3298871 , 0.47496563, 0.69940192,\n", " 0.97207796, 0.91488615, 0.36063927, 0.81240722, 0.16128617])" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.random(10) # uniform from [0,1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can generate random numbers from a normal distribution with mean 0 and variance 1:" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The sample mean and standard devation are 0.025195 and 1.026880, respectively.\n" ] } ], "source": [ "normal_array = np.random.randn(1000)\n", "print(\"The sample mean and standard devation are %f and %f, respectively.\" %(np.mean(normal_array), np.std(normal_array)))" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1000" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(normal_array)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can sample with and without replacement from an array. Let's first construct a list with evenly-spaced values:" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ])" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid = np.arange(0., 1.01, 0.1)\n", "grid" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Without replacement" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0.3, 0.8, 0.7, 1. , 0. ])" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.choice(grid, 5, replace=False)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "ename": "ValueError", "evalue": "Cannot take a larger sample than population when 'replace=False'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrandom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mchoice\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgrid\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m20\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreplace\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32mmtrand.pyx\u001b[0m in \u001b[0;36mmtrand.RandomState.choice\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mValueError\u001b[0m: Cannot take a larger sample than population when 'replace=False'" ] } ], "source": [ "np.random.choice(grid, 20, replace=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With replacement:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 0. , 1. , 0.3, 1. , 0. , 0.9, 1. , 0.7, 0.2, 0.7, 0.4,\n", " 0.6, 0.1, 0.6, 0.4, 0.3, 0.6, 0.3, 0.8, 1. ])" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.random.choice(grid, 20, replace=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tensors\n", "\n", "We can think of tensors as a name to include multidimensional arrays of numerical values. While tensors first emerged in the 20th century, they have since been applied to numerous other disciplines, including machine learning. In this class you will only be using **scalars**, **vectors**, and **2D arrays**, so you do not need to worry about the name 'tensor'.\n", "\n", "We will use the following naming conventions:\n", "\n", "- scalar = just a number = rank 0 tensor ($a$ ∈ $F$,)\n", "

\n", "- vector = 1D array = rank 1 tensor ( $x = (\\;x_1,...,x_i\\;)⊤$ ∈ $F^n$ )\n", "

\n", "- matrix = 2D array = rank 2 tensor ( $\\textbf{X} = [a_{ij}] ∈ F^{m×n}$ )\n", "

\n", "- 3D array = rank 3 tensor ( $\\mathscr{X} =[t_{i,j,k}]∈F^{m×n×l}$ )\n", "

\n", "- $\\mathscr{N}$D array = rank $\\mathscr{N}$ tensor ( $\\mathscr{T} =[t_{i1},...,t_{i\\mathscr{N}}]∈F^{n_1×...×n_\\mathscr{N}}$ ) \n", "\n", "\n", "### Slicing a 2D array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\"Drawing\"\n", "\n", "[source:oreilly](https://www.oreilly.com/library/view/python-for-data/9781449323592/ch04.html)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# how do we get just the second row of the above array?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Numpy supports vector operations\n", "\n", "What does this mean? It means that instead of adding two arrays, element by element, you can just say: add the two arrays. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first = np.ones(5)\n", "second = np.ones(5)\n", "first + second # adds in-place" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that this behavior is very different from python lists where concatenation happens." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first_list = [1., 1., 1., 1., 1.]\n", "second_list = [1., 1., 1., 1., 1.]\n", "first_list + second_list # concatenation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On some computer chips, this numpy addition actually happens in parallel and can yield significant increases in speed. But even on regular chips, the advantage of greater readability is important." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Broadcasting\n", "\n", "Numpy supports a concept known as *broadcasting*, which dictates how arrays of different sizes are combined together. There are too many rules to list here, but importantly, multiplying an array by a number multiplies each element by the number. Adding a number adds the number to each element." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first + 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "first*5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This means that if you wanted the distribution $N(5, 7)$ you could do:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "normal_5_7 = 5 + 7*normal_array\n", "np.mean(normal_5_7), np.std(normal_5_7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Multiplying two arrays multiplies them element-by-element" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "(first +1) * (first*5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You might have wanted to compute the dot product instead:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "np.dot((first +1) , (first*5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Probabilitiy Distributions from `scipy.stats` and `statsmodels`\n", "\n", "Two useful statistics libraries in python are `scipy` and `statsmodels`.\n", "\n", "For example to load the z_test:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import statsmodels\n", "from statsmodels.stats.proportion import proportions_ztest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.array([74,100])\n", "n = np.array([152,266])\n", "\n", "zstat, pvalue = statsmodels.stats.proportion.proportions_ztest(x, n) \n", "print(\"Two-sided z-test for proportions: \\n\",\"z =\",zstat,\", pvalue =\",pvalue)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#The `%matplotlib inline` ensures that plots are rendered inline in the browser.\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the normal distribution namespace from `scipy.stats`. See here for [Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from scipy.stats import norm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create 1,000 points between -10 and 10" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x = np.linspace(-10, 10, 1000) # linspace() returns evenly-spaced numbers over a specified interval\n", "x[0:10], x[-10:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the pdf of a normal distribution with a mean of 1 and standard deviation 3, and plot it using the grid points computed before:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "pdf_x = norm.pdf(x, 1, 3)\n", "plt.plot(x, pdf_x);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you can get random variables using the `rvs` function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Referencies\n", "\n", "A useful book by Jake Vanderplas: [PythonDataScienceHandbook](https://jakevdp.github.io/PythonDataScienceHandbook/).\n", "\n", "You may also benefit from using [Chris Albon's web site](https://chrisalbon.com) as a reference. It contains lots of useful information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dictionaries\n", "A dictionary is another data structure (aka storage container) -- arguably the most powerful. Like a list, a dictionary is a sequence of items. Unlike a list, a dictionary is unordered and its items are accessed with keys and not integer positions. \n", "\n", "Dictionaries are the closest data structure we have to a database.\n", "\n", "Let's make a dictionary with a few Harvard courses and their corresponding enrollment numbers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enroll2017_dict = {'CS50': 692, 'CS109A / Stat 121A / AC 209A': 352, 'Econ1011a': 95, 'AM21a': 153, 'Stat110': 485}\n", "enroll2017_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One can obtain the value corresponding to a key via:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enroll2017_dict['CS50']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you try to access a key that isn't present, your code will yield an error:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enroll2017_dict['CS630']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, the `.get()` function allows one to gracefully handle these situations by providing a default value if the key isn't found:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enroll2017_dict.get('CS630', 5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note, this does not _store_ a new value for the key; it only provides a value to return if the key isn't found." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enroll2017_dict['CS630']" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "enroll2017_dict.get('C730', None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All sorts of iterations are supported:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enroll2017_dict.values()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "enroll2017_dict.items()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can iterate over the tuples obtained above:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for key, value in enroll2017_dict.items():\n", " print(\"%s: %d\" %(key, value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Simply iterating over a dictionary gives us the keys. This is useful when we want to do something with each item:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "second_dict={}\n", "for key in enroll2017_dict:\n", " second_dict[key] = enroll2017_dict[key]\n", "second_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above is an actual __copy__ of _enroll2017_dict's_ allocated memory, unlike, `second_dict = enroll2017_dict` which would have made both variables label the same memory location." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous dictionary example, the keys were strings corresponding to course names. Keys don't have to be strings, though; they can be other _immutable_ data type such as numbers or tuples (not lists, as lists are mutable).\n", "\n", "### Dictionary comprehension: \"Do not try this at home\"\n", "\n", "You can construct dictionaries using a *dictionary comprehension*, which is similar to a list comprehension. Notice the brackets {} and the use of `zip` (see next cell for more on `zip`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "float_list = [1., 3., 5., 4., 2.]\n", "int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", "\n", "my_dict = {k:v for (k, v) in zip(int_list, float_list)}\n", "my_dict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Creating tuples with `zip`\n", "\n", "`zip` is a Python built-in function that returns an iterator that aggregates elements from each of the iterables. This is an iterator of tuples, where the i-th tuple contains the i-th element from each of the argument sequences or iterables. The iterator stops when the shortest input iterable is exhausted. The `set()` built-in function returns a `set` object, optionally with elements taken from another iterable. By using `set()` you can make `zip` printable. In the example below, the iterables are the two lists, `float_list` and `int_list`. We can have more than two iterables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "float_list = [1., 3., 5., 4., 2.]\n", "int_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", "\n", "viz_zip = set(zip(int_list, float_list))\n", "viz_zip" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "type(viz_zip)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }