## Learning Protocols

- Protocol I: Active learning, where learner queries an oracle
- Protocol II: Teacher who knows the true function provides training data
- Protocol III: Random samples from nature are the training data

Example:

An example that illustrates the difference in requirement of training data in each of these protocols in that of a monotone conjunction.

- Protocol I: $N$ examples where $N$ is the number of variables
- Protocol II: $M+1$ examples where $M$ is the number of variables in the true conjunction
- Protocol III: With limited training data we might end up with an approximation to the true conjunction.

## Online Learning

Useful in the following cases:

- If batch training has either high time or memory complexity
- If the true function changes with time

### Mistake bound algorithms

A class of online algorithms $A$ is said to be mistake bound for the concept class $C$, if in the worst case of the target function $f$ and input sequence $S$, the number of mistakes the algorithm makes is polynomial in $n$, the complexity parameter of the target function (dimensionality).

\[

M_A(C) = \underset{f \in C, S}{\text{max}}\;\; M_A(f,S) = \mathcal{O}(poly(n))

\]

### Generic mistake bound algorithms

#### CON Algorithm

- At step $i$, let the concept class consistent with previous inputs be $C_i$
- Randomly choose $f_i \in C_i$ to make the current prediction
- $C_{i+1}\subseteq C_i$ and $|C_{i+1}| = |C_i|-1$ if it makes a mistake
- The algorithm can make at most $|C|$ mistakes

#### The Halving Algorithm

- At the $i^{th}$ step, evaluate make a prediction using every $f_j \in C_i$.
- Let the final prediction given by the majority
- Then at every mistake $|C_{i+1}| \leq |C_{i}|/2$
- The algorithm can make at most $log|C|$ mistakes

While we have bounds on the mistakes, it may not be possible to efficiently learn the function.

## Perceptron

#### The Algorithm

Iterate over examples and if a mistake is made update the weight using

w = w + \eta y x

\]

Perceptron Convergence Theorem

The algorithm will converge if the data is linearly separable.

#### Perceptron Cycling Theorem

If data is not linearly separable, the algorithm will repeat the same set of weights and enter an infinite loop.

### Mistake bound for Perceptron (General)

If the data is linearly separable and $\exists \; u,\gamma \ni \|u\|=1$ and $\gamma > 0$ with

$y_i u \cdot x_i \geq \gamma \; \forall i$, then the maximum number of mistakes made by perceptron is $R^2/\gamma^2$ where $R$ is the upper bound on the norm of input features for the given training data.

### Mistake bound for Perceptron with Margin (General)

Given margin $\gamma$ define $\xi_i = \max(0,\gamma - y_i w \cdot x_i)$ and $\mathcal{D}_2 = (\sum_i \xi_i^2)^{1/2}$. Then the number of mistakes made by perceptron is bounded by $(R+\mathcal{D}_2)^/\gamma^2$

#### Question: What is the mistake bound of Perceptron algorithms for k-disjunction? We will compare it to that of Winnow later.

#### Perceptron is Stochastic Gradient Descent

The update rules for perceptron with and without margin can be derived by recognizing them as gradient descent steps where the cost function at the $i^{th}$ step is $max(0,\gamma - y_i w \cdot x_i)$. In the case of no margin $\gamma = 0$. Given this cost function the update to the weight vector is given by

\Delta w = -\eta \frac{\partial{\xi_i}}{\partial{w}}

\]

Clearly, this derivative is 0 when $y_i w \cdot x_i > \gamma$ and its $-y_i w_i$ when the point falls inside or on the wrong side of current margin.

## Winnow

Similar to perceptron but with multiplicative update rule:

- Initialize $w = 1$ and $\theta=n$ ($\theta$ never changes)
- If $y_i=1$ but $w\cdot x_i < \theta$, for all $i \ni x_i=1$ set $w_i\leftarrow 2w_i$ (
*promotion*) - If $y_i=0$ but $w_i\cdot x_i> \theta$, for all $i \ni x_i=1$ set $w_i \leftarrow w_i/2$ (
*demotion*)

Update rule can be slightly modified for certain concept classes. For instance, in learning monotone disjunctions demotion can be replaced by elimination.

### Mistake bound for Winnow (k-disjunction)

Let $u$ and $v$ be the number of mistakes made on positive and negative examples respectively.

Observe that for k-disjunction, the weight of any variable that is in the disjunction (good variable) never decreases but doubles on every mistake on a positive example. Winnow will stop making mistakes on positive examples when the weights corresponding to good variables reaches $n$ which will take $\log{n}$ time for each variable and hence $u = k\log{n}$.

For mistakes on negative examples, we will first need to bound the total weight $TW(t)$by some linear function of $u$ and $v$. Then since we know that $TW(t) > 0$, we will have a bound on $v$ in terms of $u$. So observe that

- For mistakes on positive examples $TW(t+1) < TW(t) + n$
- For mistakes on negative examples $TW(t+1) < TW(t) - n/2$

Now we see that $0<TW(t) < n + un - vn/2 \implies v < 2(u+1)$.

The total number of mistakes is bounded by $u+v = 3u+2 = \mathcal{O}(k\log{n})$.

#### Limitation of Winnow

Winnow works only for monotone functions. For the general case, duplicate variables and learn monotone function over 2n variables.

**Is there a downside to this?**

If the weight of $x$ goes up then we also know that the weight of $\neg x$ should go down but winnow update rules will only change one of them at a time. A better way is to keep 2 weights for each of the $n$ variables such that the effective weight is their difference.

- If $y=1$ but $(w^+ - w^-)\cdot x \leq \theta$, $w_i^+ = 2w_i^+$ and $w_i^- = w_i^-/2$ where $x_i=1$
- If $y=0$ but $(w^+ - w^-)\cdot x \geq \theta$, $w_i^+ = w_i^+/2$ and $w_i^- = 2w_i^-$ where $x_i=1$

#### Winnow - R

Version of winnow robust to moving target (Adaptation). Simply do not let weights go below $1/2$. This has a mistake bound of $5u+4$ as compared to $3u+2$ but in case of moving target makes $\mathcal{O(c\log{n})})$ where $c$ is the cost to the adversary which in each iteration is allowed to add or drop variables from the disjunction.

## Loss Functions and Regularizations

Let a training sample be denoted by $z_i = (x_i, y_i)$. Then any convex differentiable functions $Q(z,w)$ and $R(w)$ can be minimized over $w$ to learn a classifier. Here $Q$ drives the learning and $R$ ensure the

*simplicity*of the solution.**LMS:**$Q(z,w) = (y-w\cdot z)^2$**Ridge Regression:**$R(w) = \|w\|_2^2$**LASSO:**$R(w) = \|w\|_1$**Hinge Loss:**$Q(z,w) = max(0,1-yw\cdot x)$**SVM:**$R(w) = \|w\|_2^2$**Logistic Loss:**$Q(z,w) = \log(1+exp(-yw\cdot x))$**Logistic Regression:**$R(w) = \|w\|_2^2$