Monday, 19 October 2015

Machine Learning (Notes from Prof. Dan Roth's Lectures at UIUC): PART 2

In PART 1 we have already seen different paradigms of learning with respect to how the training data is provided. We also saw some online learning algorithms and provided mistake bounds for them. However, the problem with mistake bounds is that we don't know when the mistakes would occur. So a more general strategy to provide some guarantee on learning success is to bound the generalization performance by the number of training samples. Such guarantees are provided by the PAC learning framework.

PAC Learning

It refers to the situation where the goal is to Probably learn an Approximately Correct hypothesis. Here correctness is defined with respect to an error measure which is the expected error given that both our training and future samples belong to a distribution $\mathcal{D}$
$\text{Error}_{\mathcal{D}} = \mathbb{E}_{\mathcal{D}}[f(x)\neq h(x)] = P_{\mathcal{D}}[f(x)\neq h(x)]$
where $f$ is the target function and $h$ is the chosen hypothesis.

Defining PAC Learnability

A concept class $\mathcal{C}$ defined over and instance space $\mathcal{X}$ of dimensionality $n$, is said to PAC learnable by $\mathcal{L}$ using the hypothesis space $\mathcal{H}$, if for a given $0 < \epsilon, \; \delta < 1$
• $\forall \: f \in \mathcal{C}$
• $\forall \: \mathcal{D} \: \text{over} \: \mathcal{X}$
$\exists \: m$ polynomial in $1/\epsilon, 1/\delta, n$ and $|\mathcal{H}|$ such that
$P(\text{Error}_{\mathcal{D}} > \epsilon) < \delta$
This is a restriction on the sample complexity. One can also think about time complexity, i.e, whether such an approximation can be found in polynomial time.

PAC Analysis for Learning Conjunctions

Recall that in order to learn conjunction we eliminated non-active literals on every positive example and did nothing on negative examples. In doing so, we shall make mistakes only on positive examples. Such mistakes occur when a literal $z$, which is not part of the target conjunction, has survived all previous positive examples. This can happen only if its value is 1 in all examples seen so far. The PAC learning framework argues that since we haven't seen a positive example with $z=1$ we won't be seeing it in future training samples as well. Thus we assume the probability of this occurrence to be small
$p(z) < \epsilon/n$
Then union bound implies
$\text{Error}_{\mathcal{D}} \leq \sum_{z\in h} p(z) = \epsilon$

Another possibility is that there actually are bad literals. In this case PAC argues that by definition ad bad literal is one which has does not occur in the target function but has high probability of occurring as a negative variable in a positive example ($p(z) > \epsilon/n$)and so the chances of not seeing it in training data is very low. To show this consider the probability that $z$ survives $m$ training examples
\begin{align}
Pr(z \: \text{survives one example})^{m} &= (1-Pr(z \: \text{gets eliminated in an example}))^{m} \\
&= (1-p(z))^{m} \\
&< (1-\epsilon/n)^{m} \\
&< e^{-m\epsilon/n}
\end{align}

In the worst case since there could be at most $n$  bad literals, and mistake could be made if any of them survives, the probability of the cases where the error could be greater than $\epsilon$ is
\begin{align}
Pr(\text{any bad literal survives}) = n(1-\epsilon/n)^m < ne^{-m\epsilon/n}
\end{align}
Bounding this by $\delta$ gives us a lower bound on the number of examples that need to be seen
\begin{align}
ne^{-m\epsilon/n} &< \delta \\
\implies m &> \frac{n}{\epsilon}[\ln{n} + \ln(1/\delta)]
\end{align}

Occam's Razor

It says that for any hypothesis $h\in H$ which is consistent with $m$ examples, the probability that it has a generalization error, $Error_D(h)>\epsilon$ is less than $|H|(1-\epsilon)^m$. Note that this is often interpreted as "The simplest explanation is the best explanation", where $|H|$ encodes simplicity.

Corollary

If we want $P(Error_D(h)>\epsilon) < \delta$ then $|H|(1-\epsilon)^m < \delta$ which implies

m > \frac{1}{\epsilon}\left\{ln(|H|)+ln(\frac{1}{\delta}) \right\}

Infinite Hypothesis Spaces

If the hypothesis space is infinitely large then Occam's Razor is not of much use. In such cases VC dimension comes to the rescue.

VC Dimension

Defined as the maximum cardinality set of points which can be shattered by the hypothesis class $H$. Note that this is a property of the hypothesis class. A set of points is set to be shattered by $H$ if for any partition (or labelling) of points into positive and negative sets, there exists a hypothesis $h\in H$ consistent with this labelling.
• If any set of $d$ points can be shattered by $H$, then $VC(H) \geq d$
• If no set of $d+1$ points can be shattered by $H$, then $VC(H) < d+1$
• Together they can be used to show $VC(H) = d$

Relation between $VC(H)$ and $|H|$

In order to shatter $VC(H)$ points, for every labelling there must be at least one $h\in H$ consistent with it. Hence $|H| \geq 2^{VC(H)} \implies VC(H) \leq log(|H|)$. A result similar to Occam's Razor exists for infinite hypothesis spaces

m > \frac{1}{\epsilon}\{8VC(H)log(13/\epsilon) + 4log(2/\delta)\}

Structural Risk Minimization

The training error provides a bound on the generalization error which holds with probability $1-\eta$. If $h\in H$ is hypothesis learned by the learning algorithm then

Error_D(h) < Error_{TR}(h) + \sqrt{\frac{VC(H)(log(2m/VC(H)) + 1) - log(\eta/4)}{m}}

Overfitting

A hypothesis $h$ is said to overfit the training data if there exists another hypothesis $h'$ which has a higher training error but a lower test error. Overfitting occurs when:
• Training data is not representative of instance space
• Noisy features or labels in the training data

Boosting

A class of ensemble algorithms with strong guarantees. Fall out of a theoretical question - Does weak learnability imply strong learnability?

Strong and Weak PAC Algorithms

An algorithm is said to be Strong PAC Algorithm if for any distribution, $\forall \; \epsilon, \; \delta$, given polynomially many random samples, the algorithm finds a hypothesis with generalization error $\leq \epsilon$ with probability $\geq 1-\delta$. An algorithm is Weak PAC Algorithm if the above holds for $\epsilon \leq 1/2 - \gamma$.

1. Construct a distribution $D_1$ (initialized to uniform) over training examples
2. In each iteration
• Learn a weak hypothesis $h_t$ with $Error_{D_t} = P_{D_t}(h_t(x)\neq y) < 1/2$
• Modify the distribution by increasing the weight of misclassified examples and vice-versa using $D_{t+1}(x_i) = D_t(x_i) e^{-\alpha_t y_ih_t(x_i)}/Z_t$ where $\alpha_t = ln\left((1-\epsilon)/\epsilon\right)/2$
3. Produce the final hypothesis as $h(x) = \sum_t \alpha_t h_t(x)$
If $\epsilon_t = 1/2 - \gamma_t$, then training error of the final hypothesis is bounded by $\exp(-2\sum_t\gamma_t^2)$.

Bagging

Another ensemble algorithm involving creation of multiple bootstrap replicates of the learning set and aggregating the predictors learnt on each of these. Eg - Bagged Decision Trees

Random Forests (Bagged Trees ++) where not only is the data sample for creating each tree but also for each node candidates for splitting are chosen from a randomly sampled subset of available attributes.

Learning Protocols

1. Protocol I: Active learning, where learner queries an oracle
2. Protocol II: Teacher who knows the true function provides training data
3. Protocol III: Random samples from nature are the training data
Example:
An example that illustrates the difference in requirement of training data in each of these protocols in that of a monotone conjunction.
1. Protocol I: $N$ examples where $N$ is the number of variables
2. Protocol II: $M+1$ examples where $M$ is the number of variables in the true conjunction
3. Protocol III: With limited training data we might end up with an approximation to the true conjunction.

Online Learning

Useful in the following cases:
1. If batch training has either high time or memory complexity
2. If the true function changes with time

Mistake bound algorithms

A class of online algorithms $A$ is said to be mistake bound for the concept class $C$, if in the worst case of the target function $f$ and input sequence $S$, the number of mistakes the algorithm makes is polynomial in $n$, the complexity parameter of the target function (dimensionality).

$M_A(C) = \underset{f \in C, S}{\text{max}}\;\; M_A(f,S) = \mathcal{O}(poly(n))$

Generic mistake bound algorithms

CON Algorithm

• At step $i$, let the concept class consistent with previous inputs be $C_i$
• Randomly choose $f_i \in C_i$ to make the current prediction
• $C_{i+1}\subseteq C_i$ and $|C_{i+1}| = |C_i|-1$ if it makes a mistake
• The algorithm can make at most $|C|$ mistakes

The Halving Algorithm

• At the $i^{th}$ step, evaluate make a prediction using every $f_j \in C_i$.
• Let the final prediction given by the majority
• Then at every mistake $|C_{i+1}| \leq |C_{i}|/2$
• The algorithm can make at most $log|C|$ mistakes
While we have bounds on the mistakes, it may not be possible to efficiently learn the function.

Perceptron

The Algorithm

Iterate over examples and if a mistake is made update the weight using
$w = w + \eta y x$

Perceptron Convergence Theorem
The algorithm will converge if the data is linearly separable.

Perceptron Cycling Theorem

If data is not linearly separable, the algorithm will repeat the same set of weights and enter an infinite loop.

Mistake bound for Perceptron (General)

If the data is linearly separable and $\exists \; u,\gamma \ni \|u\|=1$ and $\gamma > 0$ with
$y_i u \cdot x_i \geq \gamma \; \forall i$, then the maximum number of mistakes made by perceptron is $R^2/\gamma^2$ where $R$ is the upper bound on the norm of input features for the given training data.

Mistake bound for Perceptron with Margin (General)

Given margin $\gamma$ define $\xi_i = \max(0,\gamma - y_i w \cdot x_i)$ and $\mathcal{D}_2 = (\sum_i \xi_i^2)^{1/2}$. Then the number of mistakes made by perceptron is bounded by $(R+\mathcal{D}_2)^/\gamma^2$

Question: What is the mistake bound of Perceptron algorithms for k-disjunction? We will compare it to that of Winnow later.

The update rules for perceptron with and without margin can be derived by recognizing them as gradient descent steps where the cost function at the $i^{th}$ step is $max(0,\gamma - y_i w \cdot x_i)$. In the case of no margin $\gamma = 0$. Given this cost function the update to the weight vector is given by
$\Delta w = -\eta \frac{\partial{\xi_i}}{\partial{w}}$
Clearly, this derivative is 0 when $y_i w \cdot x_i > \gamma$ and its $-y_i w_i$ when the point falls inside or on the wrong side of current margin.

Winnow

Similar to perceptron but with multiplicative update rule:
• Initialize $w = 1$ and $\theta=n$ ($\theta$ never changes)
• If $y_i=1$ but $w\cdot x_i < \theta$, for all $i \ni x_i=1$ set $w_i\leftarrow 2w_i$ (promotion)
• If $y_i=0$ but $w_i\cdot x_i> \theta$, for all $i \ni x_i=1$ set $w_i \leftarrow w_i/2$ (demotion)
Update rule can be slightly modified for certain concept classes. For instance, in learning monotone disjunctions demotion can be replaced by elimination.

Mistake bound for Winnow (k-disjunction)

Let $u$ and $v$ be the number of mistakes made on positive and negative examples respectively.

Observe that for k-disjunction, the weight of any variable that is in the disjunction (good variable) never decreases but doubles on every mistake on a positive example. Winnow will stop making mistakes on positive examples when the weights corresponding to good variables reaches $n$ which will take $\log{n}$ time for each variable and hence $u = k\log{n}$.

For mistakes on negative examples, we will first need to bound the total weight $TW(t)$by some linear function of $u$ and $v$. Then since we know that $TW(t) > 0$, we will have a bound on $v$ in terms of $u$. So observe that
• For mistakes on positive examples $TW(t+1) < TW(t) + n$
• For mistakes on negative examples $TW(t+1) < TW(t) - n/2$
Now we see that $0<TW(t) < n + un - vn/2 \implies v < 2(u+1)$.
The total number of mistakes is bounded by $u+v = 3u+2 = \mathcal{O}(k\log{n})$.

Limitation of Winnow

Winnow works only for monotone functions. For the general case, duplicate variables and learn monotone function over 2n variables.

Is there a downside to this?
If the weight of $x$ goes up then we also know that the weight of $\neg x$ should go down but winnow update rules will only change one of them at a time. A better way is to keep 2 weights for each of the $n$ variables such that the effective weight is their difference.

• If $y=1$ but $(w^+ - w^-)\cdot x \leq \theta$, $w_i^+ = 2w_i^+$ and $w_i^- = w_i^-/2$ where $x_i=1$
• If $y=0$ but $(w^+ - w^-)\cdot x \geq \theta$, $w_i^+ = w_i^+/2$ and $w_i^- = 2w_i^-$ where $x_i=1$

Winnow - R

Version of winnow robust to moving target (Adaptation). Simply do not let weights go below $1/2$. This has a mistake bound of $5u+4$ as compared to $3u+2$ but in case of moving target makes $\mathcal{O(c\log{n})})$ where $c$ is the cost to the adversary which in each iteration is allowed to add or drop variables from the disjunction.

Loss Functions and Regularizations

Let a training sample be denoted by $z_i = (x_i, y_i)$. Then any convex differentiable functions $Q(z,w)$ and $R(w)$ can be minimized over $w$ to learn a classifier. Here $Q$ drives the learning and $R$ ensure the simplicity of the solution.
1. LMS: $Q(z,w) = (y-w\cdot z)^2$
• Ridge Regression: $R(w) = \|w\|_2^2$
• LASSO: $R(w) = \|w\|_1$
2. Hinge Loss: $Q(z,w) = max(0,1-yw\cdot x)$
• SVM: $R(w) = \|w\|_2^2$
3. Logistic Loss: $Q(z,w) = \log(1+exp(-yw\cdot x))$
• Logistic Regression: $R(w) = \|w\|_2^2$

Curves

Parametrization

Curve is an embedding of a line into higher dimensional space

\begin{align}
&\vec{x}: t \rightarrow \mathbb{R}^n \\
&\vec{x}(t) = (x(t), y(t), z(t), \cdots)
\end{align}

Special Case: Arc Length Parametrization

Usually denoted by $s$. In arc length parametrization, a small step $ds$ in the parameter space amounts to moving the same distance along the curve. For any other parametrization, $t$, a step $dt$ in the parameter space leads to covering a distance

\begin{align}
\|d\vec{x}\| = \left(\sqrt{\left(\frac{dx_1}{dt}\right)^2 + \left(\frac{dx_2}{dt}\right)^2 + \cdots} \right) \cdot dt
\end{align}

By comparison for arc length parametrization
\begin{align}
\sqrt{\left(\frac{dx_1}{ds}\right)^2 + \left(\frac{dx_2}{ds}\right)^2 + \cdots} &= 1 \\
\implies \|\frac{d\vec{x}}{ds}\| &= 1
\end{align}

Tangent to the curve at a point

Tangent to a curve at $t_0$ is defined by
\begin{align}
\frac{d\vec{x}}{dt}|_{t=t_0}
\end{align}

This can be thought of as a vector in the direction of a ray originating at $\vec{x}(t_0)$ which is in 2 point contact with the curve in the limit of the 2 points being arbitrarily close together.

A unit tangent is usually denoted by $\vec{T}$ and is given by
\begin{align}
\vec{T} = \frac{d\vec{x}/dt}{\|d\vec{x}/dt\|}
\end{align}

For arc length parametrization this reduces to
\begin{align}
\vec{T} = \frac{d\vec{x}/ds}{\|d\vec{x}/ds\|} = \frac{d\vec{x}}{ds}
\end{align}

Curvature at a point on the curve

For a curve parametrized by $s$, rate of change of tangent is along the normal direction and its magnitude is  governed by curvature, $\kappa(s)$. This relation is given by

\begin{align}
\frac{d\vec{T}}{ds} = \kappa(s)\vec{N}(s)
\end{align}

where $\vec{N}(s)$ is a unit normal.

Binormal and Torsion

For plane curves $\vec{T}$ and $\vec{N}$ define a moving coordinate system. For space curves there is an entire family of directions orthogonal to $\vec{T}$ and hence a binormal $\vec{B}$ is also defined. At a point, the space curve is locally planar and $\vec{N}$ is chosen to lie in this plane such that $\vec{N}.\vec{T}=0$ and then $\vec{B} = \vec{T} \times \vec{N}$. For such a coordinate system, for an infinitesimal change in $s$, $\vec{B}$ swings in the direction of the $\vec{N}$ and the magnitude of the rate of this change is governed by torsion, $\tau(s)$. The following equation summarizes the relation between the rate of change of each of these unit vectors

\begin{align}
\frac{d}{ds} \left[
\begin{array}{c}
T \\
N\\
B\\
\end{array}
\right] =
\left[

\begin{array}{ccc}
0 & \kappa & 0\\
-\kappa & 0 & \tau\\
0 & -\tau & 0\\
\end{array}
\right]
\left[
\begin{array}{c}
T \\
N\\
B\\
\end{array}
\right]
\end{align}

Surfaces

A surface is a mapping  $\vec{x} : \mathbb{R}^2 \rightarrow \mathbb{R}^3$.

Gauss Map

Its a mapping from a point on the surface to a point on the sphere which has the same normal. Since the normal at a point $\vec{z}$ on the unit sphere is given by the $\vec{z}$ itself, Gauss map just tells us the normal at every point $\vec{x}$ on the surface $\vec{N}(\vec{x})$.

Gaussian Curvature

Defined as
\begin{align}
\kappa = \underset{radius\rightarrow 0} \lim \frac{\text{Area on Gauss Map}}{\text{Area on Surface}}
\end{align}

Based on sign of $\kappa$ surfaces can be classified as:
1. Hyperbolic: $\kappa < 0$
2. Parabolic: $\kappa = 0$
3. Elliptic: $\kappa > 0$

Note:

• Bending does not change $\kappa$. Area need to be added or subtracted in order to change it.
• At each point on the surface there are 2 orthogonal directions in which directional curvature is extremal. These are known as principal curvatures, $\kappa_1, \kappa_2$
• Given the principal curvatures a surface can locally be written as \begin{align}
\vec{x}(s,t) = \left(s,t,\frac{1}{2}(\kappa_1s^2 + \kappa_2s^2)+O(3)\right)
\end{align}
• Gaussian curvature $= \kappa_1 \cdot \kappa_2$
• Mean curvature $= \frac{\kappa_1 + \kappa_2}{2}$

Tangents to the surface at a point

There is a vector subspace of tangents at a point with a basis defined by
\begin{align}
\{ \frac{\partial \vec{x}}{\partial s}, \frac{\partial \vec{x}}{\partial t} \} = \{x_s, x_t\}
\end{align}

The First Fundamental Form

It is an operator defined at every point on the surface that can be used to measure dot product between tangents at that point. Any tangent vector $\vec{u}$ can be written in terms of its basis
$\vec{u} = a\vec{x}_s + b\vec{x}_t$

Similarly we can write any other tangent vector $\vec{u}$ as
$\vec{v} = c\vec{x}_s + d\vec{x}_t$

Then their dot product is given by
$I(\vec{u}, \vec{v}) = \left[ \begin{array}{cc} a & b \\ \end{array} \right] \left[ \begin{array}{cc} \vec{x}_s \cdot \vec{x}_s & \vec{x}_s \cdot \vec{x}_t \\ \vec{x}_s \cdot \vec{x}_s & \vec{x}_t \cdot \vec{x}_t \\ \end{array} \right] \left[ \begin{array}{c} c \\ d \\ \end{array} \right]$

The Second Fundamental Form

If we think of $\vec{N}(\vec{x})$ as a map from a point on the surface to a point on the sphere (the Gauss Map), then it would have a directional derivative $d\vec{N}(\vec{u})$  which is a map from the tangent plane to the surface to the tangent plane to the sphere. Some properties:
1. $\vec{N}_s\cdot \vec{N} = 0$ and $\vec{N}_t\cdot \vec{N} = 0$ (follows by differentiating $\vec{N} \cdot \vec{N} = 1$)
2. $d\vec{N}(\vec{u}) = a\vec{N}_s + b\vec{N}_t$
3. It follows from above 2 properties that $d\vec{N}(u)$ lies in the tangent plane to the surface
Now the second fundamental form is defined as
$II(\vec{u}, \vec{v}) = -I(d\vec{N}(\vec{u}),\vec{v})$

which turns out to be

$II(\vec{u}, \vec{v}) = \left[ \begin{array}{cc} a & b \\ \end{array} \right] \left[ \begin{array}{cc} \vec{N} \cdot \vec{x}_{ss}& \vec{N} \cdot \vec{x}_{st} \\ \vec{N} \cdot \vec{x}_{st} & \vec{N} \cdot \vec{x}_{tt} \\ \end{array} \right] \left[ \begin{array}{c} c \\ d \\ \end{array} \right]$

Key results

1. Gaussian Curvature, $\kappa = det(-I^{-1}II)$
2. Mean curvature, $H = trace(-I^{-1}II)$
3. These are local geometric properties of the surface which are invariant under
• Rigid Motion (because $I$ and $II$ do not change)
• Reparametrization $\vec{x}(s(u,v), t(u,v))$ (non-trivially compute $I$ and $II$ with respect to $(u,v)$ and use the above equations)

Isometry

A transformation $\psi$ that maps one surface $S_1$ to another $S_2$ is called an isometry iff lengths along the surface (and hence the angles) are preserved.

Intrinsic Properties

Geometric properties invariant under an isometry are called Intrinsic properties.

Lemma:

Given an isometry $\psi: S_1 \rightarrow S_2$, there is a parametrization of $S_2$ such that $I$ does not change.

Corollary:

Any property that can be expressed in terms of elements of $I$ (or their derivatives) is intrinsic.

Theorem Egregium:

Gaussian curvature is intrinsic.

Corollary:

Isometric surfaces have same gaussian curvature at corresponding points.

Contour generator

A contour generator on S is a curve where rays from a chosen point $P$ are tangent to the surface. Related but different:
1. Outline: Projection of a contour generator (including invisible regions) into an image plane
2. Silhouette: Visible exterior points of the outline

Contour generator on an implicit surface as a locus

For an implicit surface $\phi(\vec{x}) = 0$, contour generator w.r.t external point $\vec{p}$,  is the locus of points which satisfy the following conditions:
1. $\phi(\vec{x}_1) = 0$
2. $\nabla\phi^T \vec{p} - \nabla \phi^T \vec{x} = 0$ (follows from the fact that tangent plane at a point $\vec{x}_0$ is given by $\nabla\phi^T \vec{x} - \nabla \phi^T \vec{x}_0 = 0$ and the point $\vec{p}$ lies on this plane for every point on the C.G)