## Learning Paradigms

Learning paradigms differ in their input variables and the task to be learned. To explain the main learning categories, suppose we have a set of input variables which represent some measurements of an experiment or some data obtained by sensors or a set of features that describes an object.

- Supervised leaning:
- In addition, we also have a set of corresponding labels. The task to be learned is to correctly label a new (unseen before) set of input variables. For example, consider the task of learning to classify images of oranges from images of apples. We are given a set of features representing the color, the texture, and the shape. In addition we have the label associated with each set of features i.e. orange or apple. The task is given a new set of features the learner should correctly decide if these feature represent an orange or an apple.
- Classical problems: classification and regression

- Unsupervised learning
- Classical problems: clustering, dimensionality reduction and density modeling

## Short Course on Pattern Recognition

[based on example from Demuth et al. 1996]

Suppose that you are hired in a produce warehouse to develop a computer program to distinguish between apples and oranges based on their digital images. In other words, the input to your program is an image and the output is a single digit: 1 represents an apple and -1 represents an orange. In addition, your manager gives you 6000 example images. Each example image has one orange or one apple. You are also given a set of digits to represent the labels of these example images. To approach this problem we will take advantage of the set of the labeled examples. We should follow the following steps:

- First of all, we would like to find a higher level description of the object in an image. An object can be an orange or an apple. Earlier, we learned about segmentation. Once we perform segmentation on the data set, we separate the object from the background. Then, we can describe each object in terms of its color and shape. Color and shape are example
**features**or**descriptors**of an object. A color feature is the RGB or the HSV values. For simplicity we represent the object shape as 1 if is it almost rounded or -1 if it is about elliptical. These two features form a**pattern**.

- Since we have a large set of examples we can try to find a function that correctly separates the majority of the two
**classes**: oranges and apples. Let’s call this function a**classifier**. The input to the classifier is a pattern and the output is a label. We have to keep in mind that these examples are not perfect. For example, we should expect some oranges to have color or shape that is very similar to apples and vice versa. Algorithms such as K-Nearest Neighbors, Least Squares Methods, Artificial Neural Networks and Support Vector Machines can learn a classifier from a training set of patterns and their associated labels. This stage is called the**training**.

- One final step is that we should make sure that our classifier will correctly label a new apple or a new orange that is not included in the set we used to train the classifier on. Therefore, we should divide our data into two sets: a
**training set**and a**validation set**. We use the training set to train the classifier and the validation set to test the accuracy of the classifier.

## Definitions

- Pattern: “a pattern is an arrangement of descriptors [Gonzalez et al. 2002].” Descriptors represent the color, the texture, the shape and other features of an object.
- Classifier: a classifier is a function that maps a set of patterns to a set of labels.

## Generative and Discriminative Models

A generative model is a “probabilistic model” which explains the distribution that generated the data including the input and the output variables together i.e. the joint distribution of the input and the output variables p(X,y). In contract, a discriminative model is only concerned with the distribution of the output variables conditioned on the input variables i.e. p(y|X). As their names suggest, we can use generative models to generate samples from the data. However, a discriminative model can only discriminate the value of the of output variable based on the values on the input variables. A generative model can also be used for discriminative purposes. Read more.

### Discriminative Models

#### Linear Models [Hastie et al. 2001]

Linear models are very simple and can be easily interpreted. We assume that the model output (the predicted label) is a linear combination of the elements in the pattern vector. Then the predicted label <math> \hat{y} </math> of a pattern vector of p dimensions is

<math> \hat{y} = w_0 + \sum_{i=1}^{p} x_i w_i </math>

To simplify the notation, we include a constant of 1 in each pattern vector and include <math> w_0 </math> in the weights vector W. The matrix form of the previous equation is

<math> \hat{Y} = X W </math>

Please note that X is a matrix of size <math> n \times p </math> such that n is the number of patterns in the training set. Each row in X represents a pattern of p elements. Also Y is a column vector of size <math> n \times 1 </math> whose elements represent the corresponding labels. W is a row vector of size <math> 1 \times p </math>.

The task of training a classifier is just the task of finding a set of weights that allows the classifier to separates the classes. One way to do so is to minimize the residual sum of squares r.

<math> r = \sum_{i=1}^{N} (y_{i} – \hat{y}_i)^2

= (Y - XW)^{T} (Y - XW)</math>

Since r has a quadratic form, then it has a minimum. To obtain the vector W that minimizes r, we differentiate r with respect to W and set the derivative to zero.

<math> X^{T} (Y – XW) = 0 </math>

<math> W = (X^{T}X)^{-1} X^{T}Y </math>

#### The K Nearest Neighbors Algorithm [Hastie et al. 2001]

In the previous section we assumed that a label can be expressed as a linear combination of the elements in the pattern vector. In this section, we express a label of a given pattern as a linear combination of the labels of the neighbor patterns. The number of neighbor patterns k needs to be determined by experimentation on the training set. We choose neighbor patterns based on specific metric. One on the most commonly used metrics is the Euclidean distance. Let’s start by expressing a label as the average of the neighbors’ labels.

<math> \hat{y} = \frac{1}{k} \sum_{x_i \in Ne(x)} y_i </math>

<math> Ne(x) </math> is the set of k neighbor patterns to a pattern x. So far we have treated all neighbors equally regardless of their distance to the pattern of interest x. However, neighbors closer to x should contribute more than neighbors that are at longer distance from x. To include this idea in our algorithm, let’s define <math> w_i </math> as the inverse of the distance between pattern <math> x_i </math> ant pattern x.

<math> w_i = \frac{1}{d(x,x_i)}</math>

Now we define y as a linear combination of the neighbors’ labels.

<math> \hat{y} = \frac{\sum_{x_i \in Ne(x)} w_i y_i }{\sum_{x_i \in Ne(x)} w_i } </math>

The denominator is a normalization term to make sure that the sum of weights add up to one. Read more.

#### Other Discriminative Models

### Generative Models

#### Naive Bayes Algorithm

- Generative and Discriminative Classifiers: Naïve Bayes and Logistic Regression, Tom Mitchell, pages 1-7.
- Generative and Discriminative Classifiers: Naïve Bayes and Logistic Regression, Tom Mitchell, Slides

#### Other Generative Models

## Applications

### Object recognition

#### Resources

### Medicine

## Exercises

## Resources

[1] Demuth, HB and Beale, M and Hagan, MT and Demuth, HB. Neural network design. PWS Publishing Company, 1996.

[2] Gonzalez, R and Woods, R. Digital Image Processing. Prentice Hall, 2002.

[3] Hastie, T and Tibshirani, R and Friedman, JH. The elements of statistical learning: data mining, inference, and prediction. Springer New York, 2001.