In PART 1 we have already seen different paradigms of learning with respect to how the training data is provided. We also saw some online learning algorithms and provided mistake bounds for them. However, the problem with mistake bounds is that we don't know when the mistakes would occur. So a more general strategy to provide some guarantee on learning success is to bound the generalization performance by the number of training samples. Such guarantees are provided by the PAC learning framework.
PAC Learning
It refers to the situation where the goal is to Probably learn an Approximately Correct hypothesis. Here correctness is defined with respect to an error measure which is the expected error given that both our training and future samples belong to a distribution $\mathcal{D}$
\text{Error}_{\mathcal{D}} = \mathbb{E}_{\mathcal{D}}[f(x)\neq h(x)] = P_{\mathcal{D}}[f(x)\neq h(x)]
\]
where $f$ is the target function and $h$ is the chosen hypothesis.
Defining PAC Learnability
A concept class $\mathcal{C}$ defined over and instance space $\mathcal{X}$ of dimensionality $n$, is said to PAC learnable by $\mathcal{L}$ using the hypothesis space $\mathcal{H}$, if for a given $0 < \epsilon, \; \delta < 1$
- $\forall \: f \in \mathcal{C}$
- $\forall \: \mathcal{D} \: \text{over} \: \mathcal{X}$
$\exists \: m$ polynomial in $1/\epsilon, 1/\delta, n$ and $|\mathcal{H}|$ such that
P(\text{Error}_{\mathcal{D}} > \epsilon) < \delta
\]
This is a restriction on the sample complexity. One can also think about time complexity, i.e, whether such an approximation can be found in polynomial time.
PAC Analysis for Learning Conjunctions
Recall that in order to learn conjunction we eliminated non-active literals on every positive example and did nothing on negative examples. In doing so, we shall make mistakes only on positive examples. Such mistakes occur when a literal $z$, which is not part of the target conjunction, has survived all previous positive examples. This can happen only if its value is 1 in all examples seen so far. The PAC learning framework argues that since we haven't seen a positive example with $z=1$ we won't be seeing it in future training samples as well. Thus we assume the probability of this occurrence to be small
\[p(z) < \epsilon/n
\]
Then union bound implies
\[\text{Error}_{\mathcal{D}} \leq \sum_{z\in h} p(z) = \epsilon
\]
Another possibility is that there actually are bad literals. In this case PAC argues that by definition ad bad literal is one which has does not occur in the target function but has high probability of occurring as a negative variable in a positive example ($p(z) > \epsilon/n$)and so the chances of not seeing it in training data is very low. To show this consider the probability that $z$ survives $m$ training examples
Pr(z \: \text{survives one example})^{m} &= (1-Pr(z \: \text{gets eliminated in an example}))^{m} \\
&= (1-p(z))^{m} \\
&< (1-\epsilon/n)^{m} \\
&< e^{-m\epsilon/n}
\end{align}
In the worst case since there could be at most $n$ bad literals, and mistake could be made if any of them survives, the probability of the cases where the error could be greater than $\epsilon$ is
\begin{align}Pr(\text{any bad literal survives}) = n(1-\epsilon/n)^m < ne^{-m\epsilon/n}
\end{align}
Bounding this by $\delta$ gives us a lower bound on the number of examples that need to be seen
\begin{align}ne^{-m\epsilon/n} &< \delta \\
\implies m &> \frac{n}{\epsilon}[\ln{n} + \ln(1/\delta)]
\end{align}
Occam's Razor
It says that for any hypothesis $h\in H$ which is consistent with $m$ examples, the probability that it has a generalization error, $Error_D(h)>\epsilon$ is less than $|H|(1-\epsilon)^m$. Note that this is often interpreted as "The simplest explanation is the best explanation", where $|H|$ encodes simplicity.
Corollary
If we want $P(Error_D(h)>\epsilon) < \delta$ then $|H|(1-\epsilon)^m < \delta$ which implies
\begin{equation}
m > \frac{1}{\epsilon}\left\{ln(|H|)+ln(\frac{1}{\delta}) \right\}
\end{equation}
Infinite Hypothesis Spaces
If the hypothesis space is infinitely large then Occam's Razor is not of much use. In such cases VC dimension comes to the rescue.
VC Dimension
Defined as the maximum cardinality set of points which can be shattered by the hypothesis class $H$. Note that this is a property of the hypothesis class. A set of points is set to be shattered by $H$ if for any partition (or labelling) of points into positive and negative sets, there exists a hypothesis $h\in H$ consistent with this labelling.
- If any set of $d$ points can be shattered by $H$, then $VC(H) \geq d$
- If no set of $d+1$ points can be shattered by $H$, then $VC(H) < d+1$
- Together they can be used to show $VC(H) = d$
Relation between $VC(H)$ and $|H|$
In order to shatter $VC(H)$ points, for every labelling there must be at least one $h\in H$ consistent with it. Hence $|H| \geq 2^{VC(H)} \implies VC(H) \leq log(|H|)$. A result similar to Occam's Razor exists for infinite hypothesis spaces
m > \frac{1}{\epsilon}\{8VC(H)log(13/\epsilon) + 4log(2/\delta)\}
\end{equation}
Structural Risk Minimization
The training error provides a bound on the generalization error which holds with probability $1-\eta$. If $h\in H$ is hypothesis learned by the learning algorithm then
Error_D(h) < Error_{TR}(h) + \sqrt{\frac{VC(H)(log(2m/VC(H)) + 1) - log(\eta/4)}{m}}
\end{equation}
Overfitting
A hypothesis $h$ is said to overfit the training data if there exists another hypothesis $h'$ which has a higher training error but a lower test error. Overfitting occurs when:
- Training data is not representative of instance space
- Noisy features or labels in the training data
Boosting
A class of ensemble algorithms with strong guarantees. Fall out of a theoretical question - Does weak learnability imply strong learnability?
Strong and Weak PAC Algorithms
An algorithm is said to be Strong PAC Algorithm if for any distribution, $\forall \; \epsilon, \; \delta$, given polynomially many random samples, the algorithm finds a hypothesis with generalization error $\leq \epsilon$ with probability $\geq 1-\delta$. An algorithm is Weak PAC Algorithm if the above holds for $\epsilon \leq 1/2 - \gamma$.
Adaboost
- Construct a distribution $D_1$ (initialized to uniform) over training examples
- In each iteration
- Learn a weak hypothesis $h_t$ with $Error_{D_t} = P_{D_t}(h_t(x)\neq y) < 1/2$
- Modify the distribution by increasing the weight of misclassified examples and vice-versa using $D_{t+1}(x_i) = D_t(x_i) e^{-\alpha_t y_ih_t(x_i)}/Z_t$ where $\alpha_t = ln\left((1-\epsilon)/\epsilon\right)/2$
- Produce the final hypothesis as $h(x) = \sum_t \alpha_t h_t(x)$
If $\epsilon_t = 1/2 - \gamma_t$, then training error of the final hypothesis is bounded by $\exp(-2\sum_t\gamma_t^2)$.
Bagging
Another ensemble algorithm involving creation of multiple bootstrap replicates of the learning set and aggregating the predictors learnt on each of these. Eg - Bagged Decision Trees
Random Forests (Bagged Trees ++) where not only is the data sample for creating each tree but also for each node candidates for splitting are chosen from a randomly sampled subset of available attributes.