Syllabus

Draft Syllabus Subject to Change

Advanced Topics in Data Science (Spring 2022)

CS 109b, AC 209b, Stat 121b, or CSCI E-109b

Instructors

Pavlos Protopapas (SEAS) & Mark Glickman (Statistics)

Lectures: Mon & Wed 9:45‐11am SEC 1.321
Labs: Fri 9:45-11am SEC 1.321
Advanced Sections: Wed 2:15-3:30pm SEC 1.321 (starting 2/2; see schedule for specific dates)
Office Hours: See Ed Post

Prerequisites: CS 109a, AC 209a, Stat 121a, or CSCI E-109a or the equivalent.

Course description
Tentative Course Topics
Course Objectives
Course Components
Course Resources
Course Policies and Expectations

Course Description

Advanced Topics in Data Science (CS109b) is the second half of a one-year introduction to data science. Building upon the material in Introduction to Data Science, the course introduces advanced methods for data wrangling, data visualization, statistical modeling, and prediction. Topics include big data, multiple deep learning architectures such as CNNs, RNNs, language models, transformers, autoencoders, and generative models as well as Bayesian modeling, sampling methods, and unsupervised learning.

The programming language will be Python.

Tentative Course Topics

Unsupervised Learning, Clustering
Bayesian Inference
Hierarchical Bayesian Modeling
Fully Connected Neural Networks
Convolutional Neural Networks
Autoencoders
Recurrent Neural Networks
NLP / Text Analysis
Transformers
Variational AutoEncoders & Generative Models
Generative Adversarial Networks

Course Objectives

Upon successful completion of this course, you should feel comfortable with the material mentioned above, and you will have gained experience working with others on real-world problems. The content knowledge, the project, and teamwork will prepare you for the professional world or further studies.

Course Components

Lectures, labs, and advanced sections will be live-streamed for Extension School students and can be accessed through the Zoom section on Canvas. Video recordings of the live stream will be made available to all students within 24 hours after the event, and will be accessible from the Lecture Video section on Canvas.

Lectures

The class meets for lectures twice a week (M & W). Attending and participating in lectures is a crucial component of learning the material presented in this course. Students may be asked to complete short readings before certain lectures. Some lectures will also include real-time coding exercises which we will complete as a class.

Labs

Lab will be held every Friday. Labs will present guided, hands-on coding challenges to prepare students for successfully completing the homework assignments.

Advanced Sections

The course will include advanced sections for 209b students and will cover a different topic per week. These 75 min sessions will cover advanced topics like the mathematical underpinnings of the methods seen in the main course lectures and lab as well as extensions of those methods. The material covered in the advanced sections is required for all AC209b students. Tentative topics are:

Gaussian Mixture Models
Particle Filters/Sequential Monte Carlo
NN as Universal Approximator
Solvers
Segmentation Techniques, YOLO, Unet and M-RCNN
Variational Autoencoders
Word2Vec
BERT
GANS, Cycle GANS, etc.

Note: Advanced Section are not held every week. Consult the course calendar for exact dates.

Midterm

There will be a midterm exam on Friday, March 25th from 9:45-11am (regular lab time). The exam will likely consist of multiple choice questions with a take-home coding component. More information to follow.

Projects

Beginning the last week of classes (4/25), students will join groups of 3-4 and be divided into break-out, thematic sections to study an open problem in one of various domains. The domains are tentative at the moment but may include medicine, law, astronomy, e-commerce, government, and areas in the humanities. Each section will include lectures by Harvard faculty who are experts in the field. Project work will continue on through reading period and conclude with final submissions on 5/6. The final submission will consist of a written report, a Jupyter notebook with all relevant code, and a 6-minute, pre-recorded presentation video.

Homework Assignments

There will be 7 graded homework assignments. Some of them will be due one week after being assigned and some will be due two weeks after being assigned. For 5 assignments, you have the option to work and submit in pairs, the 2 remaining are to be completed individually.

Standard assignments are graded out of 5 points.

AC209b students will have additional homework content for most assignments worth 1 point.

Course Resources

Online Materials

All course materials, including lecture notes, lab notes, and section notes will be published on Ed, the course GitHub repo, and the public site's 'Materials' section.

Note: Lecture content for lectures 1-7 will only be accessible to registered students.

Assignments will only be posted on Canvas.

Working Environment

You will be working in Jupyter Notebooks which you can run in your own machine or in the SEAS JupyterHub.

ISLR: An Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani (Springer: New York, 2013)
BDA3: Bayesian Data Analysis by Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, Donald B. Rubin (CRC Press: New York, 2013)
DL: Deep Learning by Goodfellow, Bengio and Courville. (The MIT Press: Cambridge, 2016)
Glassner: Deep Learning, Vol. 1 & 2 by Andrew Glassner
SLP Speech and Language Processing by Jurafsky and Martin (3rd Edition Draft)
INLP Introduction to Natural Language Processing by Jacob Eisenstein (The MIT Press: Cambridge, 2019) Free electronic versions are available (ISLR, DL, SLP, INLP) or hard copy through Amazon (ISLR, DL, Glassner, SLP, INLP).

Articles & Excerpts

Unsupervised learning:
- Basics: James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning (2nd ed.). New York: Springer. Chapter 12 https://hastie.su.domains/ISLR2/ISLRv2_website.pdf
- Silhouette plots: https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
- Gap statistic: Tibshirani, R., Walther, G., & Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2), 411-423. https://hastie.su.domains/Papers/gap.pdf
- DBSCAN: Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 1-21. https://dl.acm.org/doi/pdf/10.1145/3068335?casa_token=_P479lYnlpsAAAAA:PckzU6ZiTt3yMNzFrXyzESZ3N_pp904kN0N2QEwIoq6CxtfPCxnL9bNTGtjhuiNtzSfKyXoM-QI
Bayesian material
- Basics: Glickman, Mark E. and Van Dyk, David A. (2007) "Basic Bayesian Methods" In Topics in Biostatistics (Methods in Molecular Biology). Edited by Walter Ambrosius. The Humana Press Inc., Totowa, NJ. ISBN 1-58829-531-1. pp 319-338. Chapter accessible from http://www.glicko.net/research/glickman-vandyk.pdf
- Importance sampling, rejection sampling, MCMC, Metropolis, Gibbs sampler: Andrieu, C., De Freitas, N., Doucet, A., & Jordan, M. I. (2003). An introduction to MCMC for machine learning. Machine learning, 50(1), 5-43. Article accessible from: https://www.cs.ubc.ca/~arnaud/andrieu_defreitas_doucet_jordan_intromontecarlomachinelearning.pdf
- Bayesian examples - regression, hierarchical modeling: Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (2013). Bayesian Data Analysis (3rd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/b16018 http://www.stat.columbia.edu/~gelman/book/BDA3.pdf Chapter 14: Introduction to regression models Chapter 15: Hierarchical linear models Chapter 16: Generalized linear models (includes logistic regression)

Getting Help

For questions about homework, course content, package installation, the process is:

try to troubleshoot yourself by reading the lecture, lab, and section notes, and looking up online resources.
go to office hours this is the best way to get help.
post on the class Ed forum; we want you and your peers to engage in helping each other. TFs also monitor Ed and will respond within 24 hours. Note that Ed questions are visible to everyone. If you are citing homework solution code you must post privately so that only the staff sees your message.
watch for official announcements via Ed. These announcements will also be sent to the email address associated with your Canvas account so make sure you have it set appropriately.
send an email to the Helpline cs109b2022@gmail.com for administrative issues, regrade requests, and non-content specific questions.
for personal matters that you do not feel comfortable sharing with the TFs, you may send an email to either or both of the instructors.

Course Policies and Expectations

Grading for CS109b, STAT121b, and CS209b:

Your final score for the course will be computed using the following weights:

Assignment	Final Grade Weight
Paired Homework (5)	45%
Individual Homework (2)	23%
Midterm	12%
Project	20%
Total	100%

Note: Regular homework (for everyone) counts as 5 points. 209b extra homework counts as 1 point.

Collaboration Policy

We expect you to adhere to the Harvard Honor Code at all times. Failure to adhere to the honor code and our policies may result in serious penalties, up to and including automatic failure in the course and reference to the ad board. If you work with a partner on an assignment make sure both parties solve all the problems. Do not divide and conquer. You are expected to be intellectually honest and give credit where credit is due. In particular:

if you work with a fellow student but decide to submit different papers, include the name of each other in the designated area of the submission paper.
if you work with a fellow student and want to submit the same paper you need to form a group prior to the submission. Details in the assignment. Not all assignments will permit group submissions.
you need to write your solutions entirely on your own or with your collaborator
you are welcome to take ideas from code presented in labs, lecture, or sections but you need to change it, adapt it to your style, and ultimately write your own. We do not want to see code copied verbatim from the above sources.
if you use code found on the internet, books, or other sources you need to cite those sources.
you should not view any written materials or code created by other students for the same assignment;
you may not provide or make available solutions to individuals who take or may take this course in the future.
if the assignment allows it you may use third-party libraries and example code, so long as the material is available to all students in the class and you give proper attribution. Do not remove any original copyright notices and headers.

Late or Incorrectly Submitted Assignments

Each student is allowed up to 3 late days over the semester with at most 1 day applied to any single homework. Outside of these allotted late days, late homework will not be accepted unless there is a medical (if accompanied by a doctor's note) or other official University-excused reasons. There is no need to ask before using one of your late days.

If you forgot to join a Group with your peer and are asking for the same grade we will accept this with no penalty up to HW3. For homeworks beyond that we feel that you should be familiar with the process of joining groups. After that there will be a penalty of -1 point for both members of the group provided the submission was on time.

Grading Guidelines

Homework will be graded based on:

How correct your code is (the Notebook cells should run, we are not troubleshooting code)
How you have interpreted the results — we want text not just code. It should be a report.
How well you present the results.

The scale is 0 to 5 for each assignment and 0 to 1 for the additional 209 assignments.

Re-grade Requests

Our graders and instructors make every effort in grading accurately and in giving you a lot of feedback.

If you discover that your answer to a homework problem was correct but it was marked as incorrect, send an email to the Helpline with a description of the error. Please do not submit regrade requests based on what you perceive is overly harsh grading. The points we take off are based on a grading rubric that is being applied uniformly to all assignments.

If you decide to send a regrade request, send an email to the Helpline with subject line "Regrade HW1: Grader=johnsmith" replacing 'HW1' with the current assignment and 'johnsmith' with the name of the grader within 48 hours of the grade release.

Auditing the Class

You are welcome to audit this course. To request access, send an email to cs109b2022@gmail.com with you HUID (required) and a statement of agreement to the terms below.

All auditors must agree to abide by the following rules:

Auditors must attend class in person. This is a Harvard policy. Auditors who do not confirm their presence with the head TF during the first week of in-class instruction will lose course access.
Auditors are held to the same standard of academic honesty as our registered students. Please do not share homeworks or solutions with anyone. Violations will be reported to the Harvard Administrative Board.
Auditors are not permitted to take the course for credit in the future.
Audiors should not submit HWs or participate in projects.
Auditors should refrain from using any course and TF resources that are designed for our registered students like Ed, FASOnDemand, JupyterHub, and office hours.

Academic Integrity

Ethical behavior is an important trait of a Data Scientist, from ethically handling data to attribution of code and work of others. Thus, in CS109b we give a strong emphasis to Academic Honesty. As a student your best guidelines are to be reasonable and fair. We encourage teamwork for problem sets, but you should not split the homework and you should work on all the problems together. For more detailed expectations, please refer to the Collaborations section above.

You are responsible for understanding Harvard Extension School policies on academic integrity https://www.extension.harvard.edu/resources-policies/student-conduct/academic-integrity and how to use sources responsibly. Stated most broadly, academic integrity means that all course work submitted, whether a draft or a final version of a paper, project, take-home exam, online exam, computer program, oral presentation, or lab report, must be your own words and ideas, or the sources must be clearly acknowledged. The potential outcomes for violations of academic integrity are serious and ordinarily include all of the following: required withdrawal (RQ), which means a failing grade in the course (with no refund), the suspension of registration privileges, and a notation on your transcript.

Using sources responsibly https://www.extension.harvard.edu/resources-policies/resources/avoiding-plagiarism is an essential part of your Harvard education. We provide additional information about our expectations regarding academic integrity on our website. We invite you to review that information and to check your understanding of academic citation rules by completing two free online 15-minute tutorials that are also available on our site. (The tutorials are anonymous open-learning tools.)

Accommodations for students with disabilities

Harvard students needing academic adjustments or accommodations because of a documented disability must present their Faculty Letter from the Accessible Education Office (AEO) and speak with the professor by the end of the second week of the term, (fill in specific date). Failure to do so may result in the Course Head's inability to respond in a timely manner. All discussions will remain confidential, although Faculty are invited to contact AEO to discuss appropriate implementation.

Harvard Extension School is committed to providing an inclusive, accessible academic community for students with disabilities and chronic health conditions. The Accessibility Services Office (ASO) https://www.extension.harvard.edu/resources-policies/accessibility-services-office-aso offers accommodations and supports to students with documented disabilities. If you have a need for accommodations or adjustments in your course, please contact the Accessibility Services Office by email at accessibility@extension.harvard.edu or by phone at 617-998-9640.

Diversity and Inclusion Statement

Data Science and Computer Science have historically been representative of only a small sliver of the population. This is despite the contributions of a diverse group of early pioneers - see Ada Lovelace, Dorothy Vaughan, and Grace Hopper for just a few examples.

As educators, we aim to build a diverse, inclusive, and representative community offering opportunities in data science to those who have been historically marginalized. We will encourage learning that advances ethical data science, exposes bias in the way data science is used, and advances research into fair and responsible data science.

We need your help to create a learning environment that supports a diversity of thoughts, perspectives, and experiences, and honors your identities (including but not limited to race, gender, class, sexuality, religion, ability, etc.) To help accomplish this:

If you have a name and/or set of pronouns that differ from those in your official Harvard records, please let us know!
If you feel like your performance in the class is being impacted by your experiences outside of class, please do not hesitate to come and talk with us. We want to be a resource for you. Remember that you can also submit anonymous feedback (which will lead to us making a general announcement to the class, if necessary, to address your concerns). If you prefer to speak with someone outside of the course, you may find helpful resources at the Harvard Office of Diversity and Inclusion.
We (like many people) are still learning about diverse perspectives and identities. If something was said in class (by anyone) that made you feel uncomfortable, please talk to us about it.
As a participant in course discussions, you are expected to respect your classmates’ diverse backgrounds and perspectives.

Our course will discuss diversity, inclusion, and ethics in data science. Please contact us (in person or electronically) or submit anonymous feedback if you have any suggestions for how we can improve.