Hoeffding's Inequality

Hoeffding's Inequality states, loosely, that ν\nu cannot be too far from μ\mu.

Theorem (Hoeffding's Inequality)

P[νμ>ϵ]2e2ϵ2n\mathbb{P} \left [ | \nu - \mu | > \epsilon \right ] \leq 2 e^{-2\epsilon^2 n}

νμ\nu \approx \mu is called probably approximately correct (PAC-learning)

Hoeffding's Inequality: Example

Example: n=1,000n = 1, 000; draw a sample and observe ν\nu.

  • 99% of the time μ0.05νμ+0.05\mu - 0.05 \leq \nu \leq \mu + 0.05
  • (This is implied from setting ϵ=0.05\epsilon = 0.05 and using given nn)
  • 99.9999996% of the time μ0.10νμ+0.10\mu - 0.10 \leq \nu \leq \mu + 0.10 %

What does this mean?

If I repeatedly pick a sample of size 1,000, observe ν\nu and claim that
μ[ν0.05,ν+0.05]\mu \in [\nu - 0.05, \nu + 0.05] (or that the error bar is ±0.05\pm 0.05) I will be right 99% of the time.

On any particular sample you may be wrong, but not often.

In-Class Exercise

  1. Suppose a bag is filled with \clubsuitx9 and \diamondsuitx1.

  2. Calculate the odds of drawing 3 \diamondsuit in a row (obviously, with replacement).

  3. Provide the formula for nn draws in a row of \diamondsuit.

  4. Show that Hoeffding's Inequality holds for six consecutive draws of \diamondsuit.

Extending the Example

  • Critical requirement: samples must be independent.

  • If the sample is constructed in some arbitrary fashion, then indeed we cannot say anything.

Even with independence, ν\nu can take on arbitrary values.

  • Some values are way more likely than others. This is what allows us to learn something – it is likely that νμ\nu \approx \mu.
  • The bound 2e2ϵ2n2 e^{-2\epsilon^2 n} does not depend on μ\mu or the size of the bag.
  • The bag can be infinite.
  • It’s great that it does not depend on μ\mu because μ\mu is unknown.

The key player in the bound 2e2ϵ2n2 e^{-2\epsilon^2 n} is nn.

  • If n,νμn \to \infty, \nu \approx \mu with very very high probabilty, but not for sure.

Learning a Target Function

In learning, the unknown object is an entire function ff; in the bag it was a single number μ\mu.

Targfunc

Learning a Target Function

Targfunc2

Learning a Target Function

Targfunc3

Learning a Target Function

  1. White area in second figure: h(x)=f(x)h(x) = f(x)

  2. Green area in second figure: h(x)f(x)h(x) \neq f(x)

Define the following notion:
E(h)=Px[h(x)f(x)]\mathbb{E}(h) = \mathbb{P}_x \left [ h(x) \neq f(x) \right ]
That is, this is the "size" of the green region.

  • \clubsuit "marble": h(x)=f(x)h(x) = f(x)
  • \diamondsuit "marble": h(x)f(x)h(x) \neq f(x).

We can re-frame μ,ν\mu, \nu in terms of E(h)=Px[h(x)f(x)]\mathbb{E}(h) = \mathbb{P}_x \left [ h(x) \neq f(x) \right ]

Closing the Metaphor

Out-of-sample error: Eout(h)=Px[h(x)f(x)]\mathbb{E}_\text{out}(h) = \mathbb{P}_x \left [ h(x) \neq f(x) \right ] %\pause

In-sample error: Ein(h)=1ni=1nI[h(x)f(x)]\mathbb{E}_\text{in}(h) = \frac{1}{n} \sum_{i=1}^n \mathbb{I} \left [ h(x) \neq f(x) \right ]

Hoeffding's Inequality, restated:

P[Ein(h)Eout(h)>ϵ]2e2ϵ2n\mathbb{P} \left [ |\mathbb{E}_\text{in}(h) - \mathbb{E}_\text{out}(h) | > \epsilon \right ] \leq 2 e^{-2\epsilon^2 n}

Victory! If we just minimize in-sample error, we are likely to be right out of sample!

...right?

Verification vs Learning

The entire previous argument assumed a FIXED hh and then came the data.

Given hHh \in \mathcal{H}, a sample can verify whether or not it is good (w.r.t. ff):

  • if Ein\mathbb{E}_\text{in} is small, hh is good, with high confidence.
  • if Ein\mathbb{E}_\text{in} is large, hh is bad with high confidence.

In this (artificial example) world: we have no control over Ein\mathbb{E}_\text{in}.

In learning, you actually try to fit the data!

  • e.g., perceptron model gg results from searching an entire hypothesis set H\mathcal{H} for a hypothesis with small Ein\mathbb{E}_\text{in}.