Notes on Computer Science: 2015

Sunday, 29 November 2015

Differential Geometry (Notes from Prof. David Forsyth's lectures at UIUC): PART 2

Reimannian Geometry

Extrinsic vs Intrinsic Geometry

Prior to Riemann, geometry was studied from an extrinsic point of view, i.e, lines, curves and surfaces were considered to be embedded in a Euclidean space which adhered to the usual notions of distances, angles, translation and rotations. But Gauss showed through his Theorema Egregium, that Gaussian curvature is intrinsic, that is it can be computed by measuring angles, distances and their rates on the surface without considering the embedding itself (recall that $\kappa=\det(-I^{-1}II)$). That means that the embedding can often be ignored while studying certain geometric properties which are intrinsic.

But this means, that properties like Gaussian curvature are invariant to isometry, i.e. deformations that bend the surface without stretching or contraction since local geometric properties are preserved. This in turn implies that a sphere could never be unwrapped onto a plane surface without stretching or contraction of the surface because a plane has a 0 Gaussian curvature whereas a sphere has a Gaussian curvature of 1. This basically says you cannot map all of earth in a single chart without messing up distances between far away cities or areas of continents? So what do we do? We build an atlas consisting of multiple charts where each chart provides a good approximation of local distances and angles in a specific location on the earth. The motivation behind Reimannian geometry and the concept of Manifold is similar.

Now let us define a few terms in a top down fashion - earlier definitions use terms or details that are defined later on.

Riemannian Geometry

Branch of differential geometry that deals with Riemannian Manifolds.

Riemannian Manifold

A smooth manifold $M$ equipped with an inner product $g_p$ (Riemannian metric such as $I$) on tangent space (vector space of tangents at a point) $T_pM$ at each point $p$ that varies smoothly from point to point. Meaning that if $X$ and $Y$ are vector fields on $M$, then $g_p(X(p),Y(p))$ is a smooth function.

Manifold

A topological space that locally resembles a Euclidean space. Meaning that each point of the $n$-dimensional manifold is homeomorphic to a Euclidean space of the same dimension.

Homeomorphism

A function $\phi$ between topological spaces $U$ and $\Omega$ in $\mathbb{R}^n$ such that

$\phi$ is one-to-one and onto
$\phi$ is continuous
$f^{-1}$ is continuous

$C^p$ Atlas

A $C^p$ atlas on $M$ is given by

An open cover $\{U_i | i \in I\}$ of $M$
A family of homeomorphisms $\phi_i : U_i \rightarrow \Omega_i$
For all $i,j\in I$, $\phi_j \circ \phi_i^{-1}$ is a $C^p$ diffeomorphism (differentiable homeomorphism) from $\phi_i(U_i \cap U_j)$ to $\phi_j(U_i \cap U_j)$

$(U_i, \phi_i)$ together define a $chart$.

Wednesday, 11 November 2015

Search Algorithms in AI

Search in AI

Many problems in AI can be formulated as finding a sequence of actions that lead us from the current state to one of the goal states, for instance finding the shortest path through a maze. Such problems can be categorized as follows:

Based on the nature of search space: Discrete vs Continuous
Based on the type of percept or information available: Uninformed vs Informed
Based on the type of solution desired: Global vs Local
Based on ability to find optimal solution: Complete vs Incomplete
Based on the environment: Deterministic vs Non-deterministic
Miscellaneous: Adversarial

Searching as traversing a tree

When the search space is discrete it is possible to enumerate all possible sequence of states that can be reached from a given state. This is often done in the form of a tree where the nodes represent states and branches represent actions that can be taken at a particular state. Leaves of such a tree are reached as a result of taking a finite sequence of actions that either end in a goal state or a dead-end. Goal states might also occur at non-leaf nodes. Any path from the root node to the goal node is a valid solution sequence but often we are looking for shortest sequence that gets us to the goal. If at any state the maximum number of possible actions/branches is $b$ and the maximum length of action sequences is $l$, then searching for the goal can take upto $\mathcal{O}(b^l)$ time. Some brute force search algorithms are:

Breadth-First Search: Expand the node at the shallowest depth
Depth-First Search: Expand the node at the deepest depth
Uniform-Cost Search (Dijkstra): Expand the node with the smallest cost of reaching it from the root
Iterative Deepening Search: Repeatedly run DFS with increasing depth limits
Bidirectional Search: Search from both start and goal until the paths meet

In cases where no additional information is available about the states these are the only options. However, when additional information is available to guide the search other possibilities exist:

A*: Expand the node $n$, with the lowest $C(n)+h(n)$ where $C(n)$ is the cost of reaching that node and $h(n)$ is an estimate of closeness to goal. This algorithm is complete iff $h(.)$ is admissible, meaning that it never overestimates the distance to goal, for instance straight line distance to the target on a map.
Branch & Bound: Perform an inorder traversal of tree while updating the bounds on the optimal solution based on the information available at the node each time it is visited. If the bound at node $n$ does not have an intersection with its parent then prune the subtree.

Searching as exploring the energy landscape

In discrete or combinatorial optimization problems with some kind of structure and an available energy function to measure the goodness of the solution (more energy meaning a better solution), one can explore the local neighborhood of a current state to incrementally improve the solution. Some algorithms in this domain are:

Hill Climbing: Move the neighboring state which results in maximum increase in energy until no neighbor can increase the energy
Simulated Annealing: Randomly choose a neighboring state and if it the change in energy $\nabla E$ is positive then accept the move. Else accept the move with probability $e^{\nabla E/T}$ where T decreases with each iteration
Local Beam Search: Keep track of $k$ best states rather than just $1$. In each iteration generate all $k$ successors of all $k$ states and select the $k$ best
Genetic Algorithms: Begin with a population. Assign a fitness score to each member. Choose pairs for cross over with probability proportional to the fitness score. Merge/Cross over the members in each pair. Mutate the result and iterate

Searching in continuous spaces

Two simplest algorithms are:

Gradient Ascent: $x \leftarrow x + \alpha\nabla f$
Newton Raphson: Used to find zeros of a function. Maxima occurs when the gradient is $0$ leading to $x = x - H^{-1}(x)\nabla f(x)$

Searching with non-deterministic actions

If actions can lead to stochastic transition behaviour, then contingencies need to be planned for. This usually amounts to constructing an AND-OR search tree and modified version of DFS algorithm to look for goal while handling cycles appropriately. Some of the paths from the root lead to loops while others lead to goal state. One needs to find the path which avoids the loop.

Searching in presence of adversary

Searching in presence of adversary amounts to playing a two player, zero sum game (utility values at the end of the game are always equal and opposite). Optimal decision can be taken by using the minimax algorithm. First a state space search tree is created. The minimax value (utility of being in a state assuming both players play optimally from there to the end of the game) at each node is then defined as being one of the following:

Terminal state: Predefined for instance $\{0, 1, 0.5\}$ for outcomes {Checkmate, Win, Draw} in chess
Our (max's) turn: Choose the action that leads to a state of maximum minimax
Adversary's turn: Choose the action that lead to a state of minimum minimax

However, sometimes the search tree might be exponentially large to enumerate all possible state sequences. Alpha-Beta Pruning can be used to prune out subtrees early on avoiding further search along those paths. It is a form of Branch & Bound algorithm where an inorder-DFS-traversal is performed in the tree and whenever a node is revisited the bound on the minimax value is updated based on the children that have been observed so far. If the bound on a node has a null intersection with its parent then the subtree situated at that node is pruned out.

Monday, 19 October 2015

Machine Learning (Notes from Prof. Dan Roth's Lectures at UIUC): PART 2

In PART 1 we have already seen different paradigms of learning with respect to how the training data is provided. We also saw some online learning algorithms and provided mistake bounds for them. However, the problem with mistake bounds is that we don't know when the mistakes would occur. So a more general strategy to provide some guarantee on learning success is to bound the generalization performance by the number of training samples. Such guarantees are provided by the PAC learning framework.

PAC Learning

It refers to the situation where the goal is to Probably learn an Approximately Correct hypothesis. Here correctness is defined with respect to an error measure which is the expected error given that both our training and future samples belong to a distribution $\mathcal{D}$

\[
\text{Error}_{\mathcal{D}} = \mathbb{E}_{\mathcal{D}}[f(x)\neq h(x)] = P_{\mathcal{D}}[f(x)\neq h(x)]
\]

where $f$ is the target function and $h$ is the chosen hypothesis.

Defining PAC Learnability

A concept class $\mathcal{C}$ defined over and instance space $\mathcal{X}$ of dimensionality $n$, is said to PAC learnable by $\mathcal{L}$ using the hypothesis space $\mathcal{H}$, if for a given $0 < \epsilon, \; \delta < 1$

$\forall \: f \in \mathcal{C}$
$\forall \: \mathcal{D} \: \text{over} \: \mathcal{X}$

$\exists \: m$ polynomial in $1/\epsilon, 1/\delta, n$ and $|\mathcal{H}|$ such that

\[
P(\text{Error}_{\mathcal{D}} > \epsilon) < \delta
\]

This is a restriction on the sample complexity. One can also think about time complexity, i.e, whether such an approximation can be found in polynomial time.

PAC Analysis for Learning Conjunctions

Recall that in order to learn conjunction we eliminated non-active literals on every positive example and did nothing on negative examples. In doing so, we shall make mistakes only on positive examples. Such mistakes occur when a literal $z$, which is not part of the target conjunction, has survived all previous positive examples. This can happen only if its value is 1 in all examples seen so far. The PAC learning framework argues that since we haven't seen a positive example with $z=1$ we won't be seeing it in future training samples as well. Thus we assume the probability of this occurrence to be small

\[
p(z) < \epsilon/n
\]

Then union bound implies

\[
\text{Error}_{\mathcal{D}} \leq \sum_{z\in h} p(z) = \epsilon
\]

Another possibility is that there actually are bad literals. In this case PAC argues that by definition ad bad literal is one which has does not occur in the target function but has high probability of occurring as a negative variable in a positive example ($p(z) > \epsilon/n$)and so the chances of not seeing it in training data is very low. To show this consider the probability that $z$ survives $m$ training examples

\begin{align}
Pr(z \: \text{survives one example})^{m} &= (1-Pr(z \: \text{gets eliminated in an example}))^{m} \\
&= (1-p(z))^{m} \\
&< (1-\epsilon/n)^{m} \\
&< e^{-m\epsilon/n}
\end{align}

In the worst case since there could be at most $n$ bad literals, and mistake could be made if any of them survives, the probability of the cases where the error could be greater than $\epsilon$ is

\begin{align}
Pr(\text{any bad literal survives}) = n(1-\epsilon/n)^m < ne^{-m\epsilon/n}
\end{align}

Bounding this by $\delta$ gives us a lower bound on the number of examples that need to be seen

\begin{align}
ne^{-m\epsilon/n} &< \delta \\
\implies m &> \frac{n}{\epsilon}[\ln{n} + \ln(1/\delta)]
\end{align}

Occam's Razor

It says that for any hypothesis $h\in H$ which is consistent with $m$ examples, the probability that it has a generalization error, $Error_D(h)>\epsilon$ is less than $|H|(1-\epsilon)^m$. Note that this is often interpreted as "The simplest explanation is the best explanation", where $|H|$ encodes simplicity.

Corollary

If we want $P(Error_D(h)>\epsilon) < \delta$ then $|H|(1-\epsilon)^m < \delta$ which implies

\begin{equation}

m > \frac{1}{\epsilon}\left\{ln(|H|)+ln(\frac{1}{\delta}) \right\}

\end{equation}

Infinite Hypothesis Spaces

If the hypothesis space is infinitely large then Occam's Razor is not of much use. In such cases VC dimension comes to the rescue.

VC Dimension

Defined as the maximum cardinality set of points which can be shattered by the hypothesis class $H$. Note that this is a property of the hypothesis class. A set of points is set to be shattered by $H$ if for any partition (or labelling) of points into positive and negative sets, there exists a hypothesis $h\in H$ consistent with this labelling.

If any set of $d$ points can be shattered by $H$, then $VC(H) \geq d$
If no set of $d+1$ points can be shattered by $H$, then $VC(H) < d+1$
Together they can be used to show $VC(H) = d$

Relation between $VC(H)$ and $|H|$

In order to shatter $VC(H)$ points, for every labelling there must be at least one $h\in H$ consistent with it. Hence $|H| \geq 2^{VC(H)} \implies VC(H) \leq log(|H|)$. A result similar to Occam's Razor exists for infinite hypothesis spaces

\begin{equation}
m > \frac{1}{\epsilon}\{8VC(H)log(13/\epsilon) + 4log(2/\delta)\}
\end{equation}

Structural Risk Minimization

The training error provides a bound on the generalization error which holds with probability $1-\eta$. If $h\in H$ is hypothesis learned by the learning algorithm then

\begin{equation}
Error_D(h) < Error_{TR}(h) + \sqrt{\frac{VC(H)(log(2m/VC(H)) + 1) - log(\eta/4)}{m}}
\end{equation}

Overfitting

A hypothesis $h$ is said to overfit the training data if there exists another hypothesis $h'$ which has a higher training error but a lower test error. Overfitting occurs when:

Training data is not representative of instance space
Noisy features or labels in the training data

Boosting

A class of ensemble algorithms with strong guarantees. Fall out of a theoretical question - Does weak learnability imply strong learnability?

Strong and Weak PAC Algorithms

An algorithm is said to be Strong PAC Algorithm if for any distribution, $\forall \; \epsilon, \; \delta$, given polynomially many random samples, the algorithm finds a hypothesis with generalization error $\leq \epsilon$ with probability $\geq 1-\delta$. An algorithm is Weak PAC Algorithm if the above holds for $\epsilon \leq 1/2 - \gamma$.

Adaboost

Construct a distribution $D_1$ (initialized to uniform) over training examples
In each iteration

Learn a weak hypothesis $h_t$ with $Error_{D_t} = P_{D_t}(h_t(x)\neq y) < 1/2$
Modify the distribution by increasing the weight of misclassified examples and vice-versa using $D_{t+1}(x_i) = D_t(x_i) e^{-\alpha_t y_ih_t(x_i)}/Z_t$ where $\alpha_t = ln\left((1-\epsilon)/\epsilon\right)/2$

Produce the final hypothesis as $h(x) = \sum_t \alpha_t h_t(x)$

If $\epsilon_t = 1/2 - \gamma_t$, then training error of the final hypothesis is bounded by $\exp(-2\sum_t\gamma_t^2)$.

Bagging

Another ensemble algorithm involving creation of multiple bootstrap replicates of the learning set and aggregating the predictors learnt on each of these. Eg - Bagged Decision Trees

Random Forests (Bagged Trees ++) where not only is the data sample for creating each tree but also for each node candidates for splitting are chosen from a randomly sampled subset of available attributes.

Wednesday, 14 October 2015

Machine Learning (Notes from Prof. Dan Roth's Lectures at UIUC): PART 1

Learning Protocols

Protocol I: Active learning, where learner queries an oracle
Protocol II: Teacher who knows the true function provides training data
Protocol III: Random samples from nature are the training data

Example:

An example that illustrates the difference in requirement of training data in each of these protocols in that of a monotone conjunction.

Protocol I: $N$ examples where $N$ is the number of variables
Protocol II: $M+1$ examples where $M$ is the number of variables in the true conjunction
Protocol III: With limited training data we might end up with an approximation to the true conjunction.

Online Learning

Useful in the following cases:

If batch training has either high time or memory complexity
If the true function changes with time

Mistake bound algorithms

A class of online algorithms $A$ is said to be mistake bound for the concept class $C$, if in the worst case of the target function $f$ and input sequence $S$, the number of mistakes the algorithm makes is polynomial in $n$, the complexity parameter of the target function (dimensionality).

\[
M_A(C) = \underset{f \in C, S}{\text{max}}\;\; M_A(f,S) = \mathcal{O}(poly(n))
\]

Generic mistake bound algorithms

CON Algorithm

At step $i$, let the concept class consistent with previous inputs be $C_i$
Randomly choose $f_i \in C_i$ to make the current prediction
$C_{i+1}\subseteq C_i$ and $|C_{i+1}| = |C_i|-1$ if it makes a mistake
The algorithm can make at most $|C|$ mistakes

The Halving Algorithm

At the $i^{th}$ step, evaluate make a prediction using every $f_j \in C_i$.
Let the final prediction given by the majority
Then at every mistake $|C_{i+1}| \leq |C_{i}|/2$
The algorithm can make at most $log|C|$ mistakes

While we have bounds on the mistakes, it may not be possible to efficiently learn the function.

Perceptron

The Algorithm

Iterate over examples and if a mistake is made update the weight using

\[
w = w + \eta y x
\]

Perceptron Convergence Theorem

The algorithm will converge if the data is linearly separable.

Perceptron Cycling Theorem

If data is not linearly separable, the algorithm will repeat the same set of weights and enter an infinite loop.

Mistake bound for Perceptron (General)

If the data is linearly separable and $\exists \; u,\gamma \ni \|u\|=1$ and $\gamma > 0$ with

$y_i u \cdot x_i \geq \gamma \; \forall i$, then the maximum number of mistakes made by perceptron is $R^2/\gamma^2$ where $R$ is the upper bound on the norm of input features for the given training data.

Mistake bound for Perceptron with Margin (General)

Given margin $\gamma$ define $\xi_i = \max(0,\gamma - y_i w \cdot x_i)$ and $\mathcal{D}_2 = (\sum_i \xi_i^2)^{1/2}$. Then the number of mistakes made by perceptron is bounded by $(R+\mathcal{D}_2)^/\gamma^2$

Question: What is the mistake bound of Perceptron algorithms for k-disjunction? We will compare it to that of Winnow later.

Perceptron is Stochastic Gradient Descent

The update rules for perceptron with and without margin can be derived by recognizing them as gradient descent steps where the cost function at the $i^{th}$ step is $max(0,\gamma - y_i w \cdot x_i)$. In the case of no margin $\gamma = 0$. Given this cost function the update to the weight vector is given by

\[
\Delta w = -\eta \frac{\partial{\xi_i}}{\partial{w}}
\]

Clearly, this derivative is 0 when $y_i w \cdot x_i > \gamma$ and its $-y_i w_i$ when the point falls inside or on the wrong side of current margin.

Winnow

Similar to perceptron but with multiplicative update rule:

Initialize $w = 1$ and $\theta=n$ ($\theta$ never changes)
If $y_i=1$ but $w\cdot x_i < \theta$, for all $i \ni x_i=1$ set $w_i\leftarrow 2w_i$ (promotion)
If $y_i=0$ but $w_i\cdot x_i> \theta$, for all $i \ni x_i=1$ set $w_i \leftarrow w_i/2$ (demotion)

Update rule can be slightly modified for certain concept classes. For instance, in learning monotone disjunctions demotion can be replaced by elimination.

Mistake bound for Winnow (k-disjunction)

Let $u$ and $v$ be the number of mistakes made on positive and negative examples respectively.

Observe that for k-disjunction, the weight of any variable that is in the disjunction (good variable) never decreases but doubles on every mistake on a positive example. Winnow will stop making mistakes on positive examples when the weights corresponding to good variables reaches $n$ which will take $\log{n}$ time for each variable and hence $u = k\log{n}$.

For mistakes on negative examples, we will first need to bound the total weight $TW(t)$by some linear function of $u$ and $v$. Then since we know that $TW(t) > 0$, we will have a bound on $v$ in terms of $u$. So observe that

For mistakes on positive examples $TW(t+1) < TW(t) + n$
For mistakes on negative examples $TW(t+1) < TW(t) - n/2$

Now we see that $0<TW(t) < n + un - vn/2 \implies v < 2(u+1)$.

The total number of mistakes is bounded by $u+v = 3u+2 = \mathcal{O}(k\log{n})$.

Limitation of Winnow

Winnow works only for monotone functions. For the general case, duplicate variables and learn monotone function over 2n variables.

Is there a downside to this?

If the weight of $x$ goes up then we also know that the weight of $\neg x$ should go down but winnow update rules will only change one of them at a time. A better way is to keep 2 weights for each of the $n$ variables such that the effective weight is their difference.

If $y=1$ but $(w^+ - w^-)\cdot x \leq \theta$, $w_i^+ = 2w_i^+$ and $w_i^- = w_i^-/2$ where $x_i=1$
If $y=0$ but $(w^+ - w^-)\cdot x \geq \theta$, $w_i^+ = w_i^+/2$ and $w_i^- = 2w_i^-$ where $x_i=1$

Winnow - R

Version of winnow robust to moving target (Adaptation). Simply do not let weights go below $1/2$. This has a mistake bound of $5u+4$ as compared to $3u+2$ but in case of moving target makes $\mathcal{O(c\log{n})})$ where $c$ is the cost to the adversary which in each iteration is allowed to add or drop variables from the disjunction.

Loss Functions and Regularizations

Let a training sample be denoted by $z_i = (x_i, y_i)$. Then any convex differentiable functions $Q(z,w)$ and $R(w)$ can be minimized over $w$ to learn a classifier. Here $Q$ drives the learning and $R$ ensure the simplicity of the solution.

LMS: $Q(z,w) = (y-w\cdot z)^2$

Ridge Regression: $R(w) = \|w\|_2^2$
LASSO: $R(w) = \|w\|_1$

Hinge Loss: $Q(z,w) = max(0,1-yw\cdot x)$

SVM: $R(w) = \|w\|_2^2$

Logistic Loss: $Q(z,w) = \log(1+exp(-yw\cdot x))$

Logistic Regression: $R(w) = \|w\|_2^2$

Saturday, 3 October 2015

Differential Geometry (Notes from Prof. David Forsyth's lectures at UIUC): PART 1

Curves

Parametrization

Curve is an embedding of a line into higher dimensional space

\begin{align}
&\vec{x}: t \rightarrow \mathbb{R}^n \\
&\vec{x}(t) = (x(t), y(t), z(t), \cdots)
\end{align}

Special Case: Arc Length Parametrization

Usually denoted by $s$. In arc length parametrization, a small step $ds$ in the parameter space amounts to moving the same distance along the curve. For any other parametrization, $t$, a step $dt$ in the parameter space leads to covering a distance

\begin{align}
\|d\vec{x}\| = \left(\sqrt{\left(\frac{dx_1}{dt}\right)^2 + \left(\frac{dx_2}{dt}\right)^2 + \cdots} \right) \cdot dt
\end{align}

By comparison for arc length parametrization

\begin{align}
\sqrt{\left(\frac{dx_1}{ds}\right)^2 + \left(\frac{dx_2}{ds}\right)^2 + \cdots} &= 1 \\
\implies \|\frac{d\vec{x}}{ds}\| &= 1
\end{align}

Tangent to the curve at a point

Tangent to a curve at $t_0$ is defined by

\begin{align}
\frac{d\vec{x}}{dt}|_{t=t_0}
\end{align}

This can be thought of as a vector in the direction of a ray originating at $\vec{x}(t_0)$ which is in 2 point contact with the curve in the limit of the 2 points being arbitrarily close together.

A unit tangent is usually denoted by $\vec{T}$ and is given by

\begin{align}
\vec{T} = \frac{d\vec{x}/dt}{\|d\vec{x}/dt\|}
\end{align}

For arc length parametrization this reduces to

\begin{align}
\vec{T} = \frac{d\vec{x}/ds}{\|d\vec{x}/ds\|} = \frac{d\vec{x}}{ds}
\end{align}

Curvature at a point on the curve

For a curve parametrized by $s$, rate of change of tangent is along the normal direction and its magnitude is governed by curvature, $\kappa(s)$. This relation is given by

\begin{align}
\frac{d\vec{T}}{ds} = \kappa(s)\vec{N}(s)
\end{align}

where $\vec{N}(s)$ is a unit normal.

Binormal and Torsion

For plane curves $\vec{T}$ and $\vec{N}$ define a moving coordinate system. For space curves there is an entire family of directions orthogonal to $\vec{T}$ and hence a binormal $\vec{B}$ is also defined. At a point, the space curve is locally planar and $\vec{N}$ is chosen to lie in this plane such that $\vec{N}.\vec{T}=0$ and then $\vec{B} = \vec{T} \times \vec{N}$. For such a coordinate system, for an infinitesimal change in $s$, $\vec{B}$ swings in the direction of the $\vec{N}$ and the magnitude of the rate of this change is governed by torsion, $\tau(s)$. The following equation summarizes the relation between the rate of change of each of these unit vectors

\begin{align}
\frac{d}{ds} \left[
\begin{array}{c}
T \\
N\\
B\\
\end{array}
\right] =
\left[

\begin{array}{ccc}
0 & \kappa & 0\\
-\kappa & 0 & \tau\\
0 & -\tau & 0\\
\end{array}
\right]
\left[
\begin{array}{c}
T \\
N\\
B\\
\end{array}
\right]
\end{align}

Surfaces

A surface is a mapping $\vec{x} : \mathbb{R}^2 \rightarrow \mathbb{R}^3$.

Gauss Map

Its a mapping from a point on the surface to a point on the sphere which has the same normal. Since the normal at a point $\vec{z}$ on the unit sphere is given by the $\vec{z}$ itself, Gauss map just tells us the normal at every point $\vec{x}$ on the surface $\vec{N}(\vec{x})$.

Gaussian Curvature

Defined as

\begin{align}
\kappa = \underset{radius\rightarrow 0} \lim \frac{\text{Area on Gauss Map}}{\text{Area on Surface}}
\end{align}

Based on sign of $\kappa$ surfaces can be classified as:

Hyperbolic: $\kappa < 0$
Parabolic: $\kappa = 0$
Elliptic: $\kappa > 0$

Note:

Bending does not change $\kappa$. Area need to be added or subtracted in order to change it.
At each point on the surface there are 2 orthogonal directions in which directional curvature is extremal. These are known as principal curvatures, $\kappa_1, \kappa_2$
Given the principal curvatures a surface can locally be written as \begin{align}
\vec{x}(s,t) = \left(s,t,\frac{1}{2}(\kappa_1s^2 + \kappa_2s^2)+O(3)\right)
\end{align}
Gaussian curvature $= \kappa_1 \cdot \kappa_2$
Mean curvature $= \frac{\kappa_1 + \kappa_2}{2}$

Tangents to the surface at a point

There is a vector subspace of tangents at a point with a basis defined by

\begin{align}
\{ \frac{\partial \vec{x}}{\partial s}, \frac{\partial \vec{x}}{\partial t} \} = \{x_s, x_t\}
\end{align}

The First Fundamental Form

It is an operator defined at every point on the surface that can be used to measure dot product between tangents at that point. Any tangent vector $\vec{u}$ can be written in terms of its basis

\[
\vec{u} = a\vec{x}_s + b\vec{x}_t
\]

Similarly we can write any other tangent vector $\vec{u}$ as

\[
\vec{v} = c\vec{x}_s + d\vec{x}_t
\]

Then their dot product is given by

\[
I(\vec{u}, \vec{v}) =
\left[
\begin{array}{cc}
a & b \\
\end{array}
\right]
\left[
\begin{array}{cc}
\vec{x}_s \cdot \vec{x}_s & \vec{x}_s \cdot \vec{x}_t \\
\vec{x}_s \cdot \vec{x}_s & \vec{x}_t \cdot \vec{x}_t \\
\end{array}
\right]
\left[
\begin{array}{c}
c \\
d \\
\end{array}
\right]
\]

The Second Fundamental Form

If we think of $\vec{N}(\vec{x})$ as a map from a point on the surface to a point on the sphere (the Gauss Map), then it would have a directional derivative $d\vec{N}(\vec{u})$ which is a map from the tangent plane to the surface to the tangent plane to the sphere. Some properties:

$\vec{N}_s\cdot \vec{N} = 0$ and $\vec{N}_t\cdot \vec{N} = 0$ (follows by differentiating $\vec{N} \cdot \vec{N} = 1$)
$d\vec{N}(\vec{u}) = a\vec{N}_s + b\vec{N}_t$
It follows from above 2 properties that $d\vec{N}(u)$ lies in the tangent plane to the surface

Now the second fundamental form is defined as

\[
II(\vec{u}, \vec{v}) = -I(d\vec{N}(\vec{u}),\vec{v})
\]

which turns out to be

\[
II(\vec{u}, \vec{v}) =
\left[
\begin{array}{cc}
a & b \\
\end{array}
\right]
\left[
\begin{array}{cc}
\vec{N} \cdot \vec{x}_{ss}& \vec{N} \cdot \vec{x}_{st} \\
\vec{N} \cdot \vec{x}_{st} & \vec{N} \cdot \vec{x}_{tt} \\
\end{array}
\right]
\left[
\begin{array}{c}
c \\
d \\
\end{array}
\right]
\]

Key results

Gaussian Curvature, $\kappa = det(-I^{-1}II)$
Mean curvature, $H = trace(-I^{-1}II)$
These are local geometric properties of the surface which are invariant under

Rigid Motion (because $I$ and $II$ do not change)
Reparametrization $\vec{x}(s(u,v), t(u,v))$ (non-trivially compute $I$ and $II$ with respect to $(u,v)$ and use the above equations)

Isometry

A transformation $\psi$ that maps one surface $S_1$ to another $S_2$ is called an isometry iff lengths along the surface (and hence the angles) are preserved.

Intrinsic Properties

Geometric properties invariant under an isometry are called Intrinsic properties.

Lemma:

Given an isometry $\psi: S_1 \rightarrow S_2$, there is a parametrization of $S_2$ such that $I$ does not change.

Corollary:

Any property that can be expressed in terms of elements of $I$ (or their derivatives) is intrinsic.

Theorem Egregium:

Gaussian curvature is intrinsic.

Corollary:

Isometric surfaces have same gaussian curvature at corresponding points.

Contour generator

A contour generator on S is a curve where rays from a chosen point $P$ are tangent to the surface. Related but different:

Outline: Projection of a contour generator (including invisible regions) into an image plane
Silhouette: Visible exterior points of the outline

Contour generator on an implicit surface as a locus

For an implicit surface $\phi(\vec{x}) = 0$, contour generator w.r.t external point $\vec{p}$, is the locus of points which satisfy the following conditions:

$\phi(\vec{x}_1) = 0$
$\nabla\phi^T \vec{p} - \nabla \phi^T \vec{x} = 0$ (follows from the fact that tangent plane at a point $\vec{x}_0$ is given by $\nabla\phi^T \vec{x} - \nabla \phi^T \vec{x}_0 = 0$ and the point $\vec{p}$ lies on this plane for every point on the C.G)