Hoeffding's Inequality states, loosely, that $\nu$ cannot be too far from $\mu$.
$\mathbb{P} \left [ | \nu - \mu | > \epsilon \right ] \leq 2 e^{-2\epsilon^2 n}$
$\nu \approx \mu$ is called probably approximately correct (PAC-learning)
Example: $n = 1, 000$; draw a sample and observe $\nu$.
What does this mean?
If I repeatedly pick a sample of size 1,000, observe $\nu$ and claim that
$\mu \in [\nu - 0.05, \nu + 0.05]$ (or that the error bar is $\pm 0.05$) I will be right 99% of the time.
On any particular sample you may be wrong, but not often.
Suppose a bag is filled with $\clubsuit$x9 and $\diamondsuit$x1.
Calculate the odds of drawing 3 $\diamondsuit$ in a row (obviously, with replacement).
Provide the formula for $n$ draws in a row of $\diamondsuit$.
Show that Hoeffding's Inequality holds for six consecutive draws of $\diamondsuit$.
Critical requirement: samples must be independent.
If the sample is constructed in some arbitrary fashion, then indeed we cannot say anything.
Even with independence, $\nu$ can take on arbitrary values.
The key player in the bound $2 e^{-2\epsilon^2 n}$ is $n$.
In learning, the unknown object is an entire function $f$; in the bag it was a single number $\mu$.
White area in second figure: $h(x) = f(x)$
Green area in second figure: $h(x) \neq f(x)$
Define the following notion:
$\mathbb{E}(h) = \mathbb{P}_x \left [ h(x) \neq f(x) \right ]$
That is, this is the "size" of the green region.
We can re-frame $\mu, \nu$ in terms of $\mathbb{E}(h) = \mathbb{P}_x \left [ h(x) \neq f(x) \right ]$
Out-of-sample error: $\mathbb{E}_\text{out}(h) = \mathbb{P}_x \left [ h(x) \neq f(x) \right ]$ %\pause
In-sample error: $\mathbb{E}_\text{in}(h) = \frac{1}{n} \sum_{i=1}^n \mathbb{I} \left [ h(x) \neq f(x) \right ]$
$\mathbb{P} \left [ |\mathbb{E}_\text{in}(h) - \mathbb{E}_\text{out}(h) | > \epsilon \right ] \leq 2 e^{-2\epsilon^2 n}$
Victory! If we just minimize in-sample error, we are likely to be right out of sample!
...right?
The entire previous argument assumed a FIXED $h$ and then came the data.
Given $h \in \mathcal{H}$, a sample can verify whether or not it is good (w.r.t. $f$):
In this (artificial example) world: we have no control over $\mathbb{E}_\text{in}$.
In learning, you actually try to fit the data!