Wednesday, 14 October 2015

Machine Learning (Notes from Prof. Dan Roth's Lectures at UIUC): PART 1

Learning Protocols

  1. Protocol I: Active learning, where learner queries an oracle
  2. Protocol II: Teacher who knows the true function provides training data
  3. Protocol III: Random samples from nature are the training data
An example that illustrates the difference in requirement of training data in each of these protocols in that of a monotone conjunction. 
  1. Protocol I: $N$ examples where $N$ is the number of variables
  2. Protocol II: $M+1$ examples where $M$ is the number of variables in the true conjunction
  3. Protocol III: With limited training data we might end up with an approximation to the true conjunction. 

Online Learning

Useful in the following cases:
  1. If batch training has either high time or memory complexity
  2. If the true function changes with time

Mistake bound algorithms

A class of online algorithms $A$ is said to be mistake bound for the concept class $C$, if in the worst case of the target function $f$ and input sequence $S$, the number of mistakes the algorithm makes is polynomial in $n$, the complexity parameter of the target function (dimensionality).

M_A(C) = \underset{f \in C, S}{\text{max}}\;\; M_A(f,S) = \mathcal{O}(poly(n))

Generic mistake bound algorithms

CON Algorithm

  • At step $i$, let the concept class consistent with previous inputs be $C_i$
  • Randomly choose $f_i \in C_i$ to make the current prediction
  • $C_{i+1}\subseteq C_i$ and $|C_{i+1}| = |C_i|-1$ if it makes a mistake
  • The algorithm can make at most $|C|$ mistakes

The Halving Algorithm

  • At the $i^{th}$ step, evaluate make a prediction using every $f_j \in C_i$.
  • Let the final prediction given by the majority
  • Then at every mistake $|C_{i+1}| \leq |C_{i}|/2$
  • The algorithm can make at most $log|C|$ mistakes
While we have bounds on the mistakes, it may not be possible to efficiently learn the function.


The Algorithm

Iterate over examples and if a mistake is made update the weight using
w = w + \eta y x

Perceptron Convergence Theorem
The algorithm will converge if the data is linearly separable.

Perceptron Cycling Theorem

If data is not linearly separable, the algorithm will repeat the same set of weights and enter an infinite loop. 

Mistake bound for Perceptron (General)

If the data is linearly separable and $\exists \; u,\gamma \ni \|u\|=1$ and $\gamma > 0$ with 
$y_i u \cdot x_i \geq \gamma \; \forall i$, then the maximum number of mistakes made by perceptron is $R^2/\gamma^2$ where $R$ is the upper bound on the norm of input features for the given training data.

Mistake bound for Perceptron with Margin (General)

Given margin $\gamma$ define $\xi_i = \max(0,\gamma - y_i w \cdot x_i)$ and $\mathcal{D}_2 = (\sum_i \xi_i^2)^{1/2}$. Then the number of mistakes made by perceptron is bounded by $(R+\mathcal{D}_2)^/\gamma^2$

Question: What is the mistake bound of Perceptron algorithms for k-disjunction? We will compare it to that of Winnow later. 

Perceptron is Stochastic Gradient Descent

The update rules for perceptron with and without margin can be derived by recognizing them as gradient descent steps where the cost function at the $i^{th}$ step is $max(0,\gamma - y_i w \cdot x_i)$. In the case of no margin $\gamma = 0$. Given this cost function the update to the weight vector is given by
\Delta w = -\eta \frac{\partial{\xi_i}}{\partial{w}}
Clearly, this derivative is 0 when $y_i w \cdot x_i > \gamma$ and its $-y_i w_i$ when the point falls inside or on the wrong side of current margin.


Similar to perceptron but with multiplicative update rule:
  • Initialize $w = 1$ and $\theta=n$ ($\theta$ never changes)
  • If $y_i=1$ but $w\cdot x_i < \theta$, for all $i \ni x_i=1$ set $w_i\leftarrow 2w_i$ (promotion)
  • If $y_i=0$ but $w_i\cdot x_i> \theta$, for all $i \ni x_i=1$ set $w_i \leftarrow w_i/2$ (demotion)
Update rule can be slightly modified for certain concept classes. For instance, in learning monotone disjunctions demotion can be replaced by elimination. 

Mistake bound for Winnow (k-disjunction)

Let $u$ and $v$ be the number of mistakes made on positive and negative examples respectively. 

Observe that for k-disjunction, the weight of any variable that is in the disjunction (good variable) never decreases but doubles on every mistake on a positive example. Winnow will stop making mistakes on positive examples when the weights corresponding to good variables reaches $n$ which will take $\log{n}$ time for each variable and hence $u = k\log{n}$.

For mistakes on negative examples, we will first need to bound the total weight $TW(t)$by some linear function of $u$ and $v$. Then since we know that $TW(t) > 0$, we will have a bound on $v$ in terms of $u$. So observe that 
  • For mistakes on positive examples $TW(t+1) < TW(t) + n$
  • For mistakes on negative examples $TW(t+1) < TW(t) - n/2$ 
Now we see that $0<TW(t) < n + un - vn/2 \implies v < 2(u+1)$.
The total number of mistakes is bounded by $u+v = 3u+2 = \mathcal{O}(k\log{n})$. 

Limitation of Winnow

Winnow works only for monotone functions. For the general case, duplicate variables and learn monotone function over 2n variables. 

Is there a downside to this?
If the weight of $x$ goes up then we also know that the weight of $\neg x$ should go down but winnow update rules will only change one of them at a time. A better way is to keep 2 weights for each of the $n$ variables such that the effective weight is their difference.

  • If $y=1$ but $(w^+ - w^-)\cdot x \leq \theta$, $w_i^+ = 2w_i^+$ and $w_i^- = w_i^-/2$ where $x_i=1$
  • If $y=0$ but $(w^+ - w^-)\cdot x \geq \theta$, $w_i^+ = w_i^+/2$ and $w_i^- = 2w_i^-$ where $x_i=1$

Winnow - R

Version of winnow robust to moving target (Adaptation). Simply do not let weights go below $1/2$. This has a mistake bound of $5u+4$ as compared to $3u+2$ but in case of moving target makes $\mathcal{O(c\log{n})})$ where $c$ is the cost to the adversary which in each iteration is allowed to add or drop variables from the disjunction. 

Loss Functions and Regularizations 

Let a training sample be denoted by $z_i = (x_i, y_i)$. Then any convex differentiable functions $Q(z,w)$ and $R(w)$ can be minimized over $w$ to learn a classifier. Here $Q$ drives the learning and $R$ ensure the simplicity of the solution. 
  1. LMS: $Q(z,w) = (y-w\cdot z)^2$
    • Ridge Regression: $R(w) = \|w\|_2^2$
    • LASSO: $R(w) = \|w\|_1$
  2. Hinge Loss: $Q(z,w) = max(0,1-yw\cdot x)$
    • SVM: $R(w) = \|w\|_2^2$
  3. Logistic Loss: $Q(z,w) = \log(1+exp(-yw\cdot x))$
    • Logistic Regression: $R(w) = \|w\|_2^2$