I keep saying that the sexy job in the next 10 years will be statisticians. And I'm not kidding.
Me: My primary area of expertise is psychology and economics.
This class is totally, unapologetically a work in progress.
Material is a mish-mash of stuff from:
Stanford University (graduate course)
Harvard (graduate course)
...so, yeah, it will be challenging. Hopefully, you'll find it fun!
My research: occasionally touches the topics in the course, but mostly utilizes things in the course as tools.
New phone who dis? Please write down your
You must spend 5 minutes telling me a little bit about your interests before the end of the week.
The syllabus is posted on the course website
I'll walk through highlights now, but read it later -- it's long.
But eventually, please read it. It is "required."
Grade is composed of problem sets, exams, and a written assignment.
Although exams are given a relatively low weight, you must attempt both exams to pass the course.
Labs consist of a practical implementation of something we've covered in the course (e.g., code your own Recommender System).
Grading: come to class.
If you are the type of student that doesn't generally enjoy coming to class, this is not the course for you.
I suspect the exams will be much like my exams in my other course. Students have described those exams as ``painfully difficult". You are only entitled to the rubric on the previous slide.
If you complete all assignments and attend all class dates, I will utilize the following curve for grading:
4.0 Came to class regularly, contributed substantive comments to discussions, did modestly well on exams, turned in all assignments.
3.5 Came to class regularly, said some stuff (sometimes interesting), did modestly poorly on exams, turned in all assignments.
3.0 Came to class regularly and said some stuff (mostly uninteresting), did very poorly on exams, turned in all assignments.
< 3.0 Didn't come to class regularly or didn't turn in all assignments.
There are sort of three texts for this course and sort of zero.
The main text is free and available online (see syllabus or Google it). The secondary text is substantially more difficult, but also free online. The third text costs about $25.
Please please please please please: Ask questions during class.
Return of the Please: If there is some topic that you really want to learn about, ask. If you are uncomfortable asking in front of the whole group, please see me during office hours.
Because this is a new course:
Some of the lectures will be way too long or too short.
Some (most?) of the lectures won't make sense.
Some of the time I'll forget what I intended to say and awkwardly stare at you for a few moments (sorry).
Comment throughout the course, not just at the end.
The material will improve with time and feedback.
I encourage measured feedback and thoughtful responses to questions. If I call on you and you don't know immediately, don't freak out. If you don't know, it's totally okay to say you don't know.
I teach using ``math''.
...Don't be afraid. The math won't hurt you.
I fundamentally believe that true knowledge of how we learn from data depends on a basic understanding of the underlying mathematics.
-Good news: no black boxes.
-Bad news: notation-heavy slides and reading.
Finally: I cannot address field-specific questions in areas outside economics to any satisfying degree.
Good news: I'm good at knowing what I don't know and have a very small ego, which means that I'm much less likely to blow smoke up your ass than other professors.
Bad news: I can't help with certain types of questions.
This course should be applicable broadly, but many of the examples will lean on my personal expertise (sorry).
Your "assignment": read syllabus and Lab 0.
Things to stress from syllabus and Lab 0:
Despite my hard-assness in these intro slides: I'm here to help and I am not in the business of giving bad grades for no reason.
How do you define "data analytics"? (Not a rhetorical question!)
Some "data analytics" topics we will cover:
Better utilizing existing data can improve our predictive power whilst providing interpretable outputs for making policies.
[I.] Theoretical Underpinnings of Statistical Learning
[A.] Setup and a "Case Study" [B.] The Learning Problem [C.] Linear Regression [D.] Bias versus Variance [E.] Training versus Testing [F.] The VC Dimension [G.] Bias versus Variance
[II.] Parametric Models in Statistical Learning
[IIa.] Models of Classification
[IIb.] Linear Model Selection
[III.] Non-Parametric Models in Statistical Learning
[IIIa.] Tree-Based Methods
[IIIb.] Neural Networks
[IV] Unsupervised Learning
Suppose you are a researcher and you want to teach a computer to recognize images of a tree.
Note: this is an ``easy" problem. If you show pictures to a 3-year-old, that child will probably be able to tell you if there is a tree in the picture.
Computer scientists spent about 20 years on this problem because they thought about the problem like nerds and tried to write down a series of rules.
Rules are difficult to form, and simply writing rules misses the key insight: the data can tell you something.
Suppose you are a researcher and you want to know whether prisons reduce crime.
from ``A Call for a Moratorium on Prison Building'' (1976)
|Prison Capacity||Crime Rate|
X causes Y (causality)
Y causes X (reverse causality)
Z causes X and Y (common cause)
X causes Y and Y causes X (simultaneous equations)
We will start in this course by examining situations where we do not care about why something has happened, but instead we care about our ability to predict its occurrence from existing data.
(But of course keep in back of mind that if you are making policy, you must care about why something happened).
We will also borrow a few other ideas from CS:
Example: a firm wishes to predict user behavior based on previous purchases or interactions.
Small margins huge payoffs. $1 million.
Not obvious why this was true for Netflix; quite obvious why this is true in financial markets.
Machine learning arose as a subfield of Artificial Intelligence.
Statistical learning arose as a subfield of Statistics.
There is much overlap; however, a few points of distinction:
Machine learning has a greater emphasis on large scale applications and prediction accuracy.
Statistical learning emphasizes models and their interpretability, and precision and uncertainty.
Obviously true: machine learning has the upper hand in marketing.
The following are the basic requirements for statistical learning: