**Greedy Algorithms**makes up the answer through

**Locally Optimal**solutions, and therefore is quicker at arriving at an answer. However, it does not guarantee that the final answer is

**Globally Optimal**. The problem, of course, with finding the Globally Optimal is that it is prohibitively expensive to compute. Finding the Global Optimum is

**Inherently Exponential**.

Let us now look at another class of Optimization Problems. One of the most important aspect of Computer Science is the idea of

**Machine Learning**. Superficially, we can define Machine Learning as building programs that learns. However, it suffices to say that almost all programs learn in some form or another.

Machine Learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. We can have a better idea of Machine Learning by looking at the Machine Learning Wikipedia page.

A major focus of Machine Learning is to allow a computer to learn and recognize complex patterns in order to make decisions based on empirical data. This whole process is known as

**Inductive Inference**. The basic idea is for a computer to observe and record examples from a subset of a statistical phenomena, and generate a model that summarizes some statistical properties of the data in order to predict the future.

There are two distinctive approaches to Machine Learning:

1)

**Supervised Learning**

2)

**Unsupervised Learning**

In

**Supervised Learning**, we associate a

**Label**with each example in a

**Training Set**. If the Label is

**Discrete**, we call it a

**Classification Problem**. For example, we can classify whether a certain transaction on a credit card is done by the owner or not. If the Label is

**Real**, we call it a

**Regression Problem**. When we were doing the Curve Fitting in the previous articles, we were doing Machine Learning, and handling a Regression Problem.

Based on the examples in the Training Set, the goal is create a program in such a way that it is able to predict the answer for other cases before they are explicitly observed. In order to do this, we need to

**Generalize**the statistical properties of the Training Set to be able to make predictions about things we haven't seen.

Let us look at the following Training Set. Suppose that we've obtained some data plotted into axes X and Y, and the Colors (Red and Green) and Shapes (Grass and Leaf) are the Labels of the data.

Upon obtaining a set of data, we must ask ourselves a few questions:

1) Are the Labels accurate?

Labels are subject to machine and human errors.

2) Is the past representative of the future?

If the past results are generally not representative of the future, then it would be extremely difficult to generate a sufficiently accurate model for use.

3) Do we have enough data to generalize?

If the Training Set is relatively small, we should take care not have too much confidence on the predictions.

4) What features are to be extracted?

In order to find out something about the data, we must have the relevant features of the data. For example, in order to find out the composition of haze particles, it is useful to know the types of trees that are being burnt down.

5) How tight should the fit be?

How do we classify the data? Should we fit the data onto the curve, or fit the curve to the data?

It is easy to tell from the Training Data that a certain line y=c separates the labelled Green data from the labelled Red data within the given boundaries.

However, the more difficult part is thinking how should we go about classifying the Shape labels? Here's two of the many possible ways we can classify it:

Do we classify it by grouping the leaves together like this?

This is great because it minimizes the Training Error in the data. However, is this what we really want? How do we know if future data would really fit like this?

Or do we classify it like that, and assume that the leaf with the grass is a labelling error or an outlier? This is a good generalization but there is one Training Error.

That's where we need more data plotted in the vicinity of the stray leaf to find out if it's a Training Error or an Outlier.

Now let's talk about

**Unsupervised Learning**. The big difference here is that we have Training Data, but we don't have Labels. The program gets a bunch of points, but doesn't know what shape or color is it. What can the program learn?

Typically, what the program learns in Unsupervised Learning is the

**Regularities of Data**. From the initial data set, Unsupervised Learning can discover the structures in the data.

The dominant form of Unsupervised Learning is

**Clustering**. What we just did above is to find the Clusters in the data. Clustering is to organize the data into groups whose members are similar to each other. However, how do we describe "similar"? Suppose that the data plots the heights and weights of people, and the x Axis represents weight, while the y Axis represents height. We now have distinct clusters of people.

Walmart in fact used Clustering to find out how to optimally structure their shelves. They found out through Clustering that there is a strong correlation between people who bought beer and people who bought diapers. Therefore they decided to place their beer and diapers departments side by side, which worked surprisingly well.

Similarly, Amazon used Clustering to find people who like similar books based on their profile and buying habits. Netflix uses the same to recommend movies. Biologists use clustering a lot to classify plants and animals. They also use it a lot in genetics to find genes that look and function alike.

What properties do a good Clustering have? It should have:

1)

**Low Intra-Cluster Dissimilarities**

All the points in the same cluster should have as little dissimilarities as possible.

2)

**High Inter-Cluster Dissimilarities**

All points between different clusters should be highly dissimilar.

How do we measure the Intra- and Inter-Cluster Dissimilarities? Well, of course, if you thought of Variance, you are absolutely right!

**Intra-Cluster Variance**looks at the Variance of the Intra-Cluster Points with respect to the Cluster's Mean.

variance(C) = Sum of, from i=0 to i=len(C), (mean(C)-c

_{i})

^{2}

**Inter-Cluster Variance**looks at the Variance between the Means of the different Clusters.

Clustering can be formulated as an Optimization Problem. The Optimization Problem of Clustering is: Find the set of clusters, C, such that Intra-Cluster Variance is minimized, and the Inter-Cluster Variance is maximized.

However, such a definition is not enough, because if that's the case, then we can just put each point in its own cluster. The variance would be 0 and Intra-Cluster Variance would be maximized.

What we stated is just the Objective Function. In order to have a proper Optimization Problem, we need to spell out the Constraints. The Constraints can be things like:

1) We can only have a maximum of x clusters

2) There should be a minimum distance between two clusters.

Finding the absolute best solution to this problem is Computationally Prohibitive (it is minimum in the order of n

^{2}). Therefore people typically resort to some form of Greedy Algorithm. The two most common Greedy Algorithm used to solve Clustering is:

1) Hierarchical Clustering

2) k-means

Let's go through an example in Hierarchical Clustering. Suppose that we have N items, and we have an accompanying N*M matrix, where M=N, which we can look-up to find out the distance between any two points.

The algorithm is as follows:

1) Assign each item to its own cluster. Therefore, if we have N items, we have N clusters.

2) Find the most similar pair of clusters and merge them.

3) Continue the process until all items are in x number of clusters.

This type of Clustering is also known as Agglomerative Clustering. Step 2 of the algorithm is the most difficult. However, what do we compare if there are two items in each cluster now? What we need to look at is the Linkage Criteria.

There are various Linkage Criteria which we can look at. These are:

1) Single-Linkage

The shortest distance between any two pairs between clusters.

2) Complete-Linkage

The shortest distance between two furthest pairs between clusters. This takes care of the worst case.

3) Average-Linkage

The distance between the means of each cluster.

Of course, like any Greedy Algorithm, there is no Linkage Criteria that's the best. This algorithm is minimum an Order of n

^{2}and there is no guarantee that the Cluster is Globally Optimal.

The most important issue in Machine Learning is Feature Selection in order to know what to compare. To Cluster countries together, we need to have something known as a Feature Vector containing all the features that are to be compared. An example of a Feature Vector of a country is a list containing the Population, Coordinates, Size, and so on.

Let's do an example of Hierarchical Clustering. Suppose that we have these coordinates of MRTs:

`Dhoby Ghaut,1.298593,103.845909`

Buona Vista,1.307412,103.789696

Paya Lebar,1.318089,103.892982

Mountbatten,1.306306,103.882531

Jurong East,1.334308,103.741958

Tampines,1.353092,103.945229

Tanah Merah,1.327257,103.946579

Bishan,1.350772,103.848183

Yio Chu Kang,1.381905,103.844818

Choa Chu Kang,1.385482,103.74424

If we convert the Longitudinal and Latitudinal data into actual distances, we would end up with the following table:

Location | Dhoby Ghaut | Paya Lebar | Tampines | Tanah Merah | Buona Vista | Yio Chu Kang | Bishan | Jurong East | Mountbatten | Choa Chu Kang |

Dhoby Ghaut | 0 | 5662 | 12590 | 11632 | 6324 | 9260 | 5804 | 12215 | 4159 | 14863 |

Paya Lebar | 5662 | 0 | 6989 | 6043 | 11540 | 8885 | 6163 | 16880 | 1750 | 18148 |

Tampines | 12590 | 6989 | 0 | 2875 | 18015 | 11609 | 10788 | 22686 | 8694 | 22625 |

Tanah Merah | 11632 | 6043 | 2875 | 0 | 17574 | 12837 | 11243 | 22754 | 7489 | 23399 |

Buona Vista | 6324 | 11540 | 18015 | 17574 | 0 | 10299 | 8091 | 6089 | 10318 | 10040 |

Yio Chu Kang | 9260 | 8885 | 11609 | 12837 | 10299 | 0 | 3480 | 12596 | 9389 | 11185 |

Bishan | 5804 | 6163 | 10788 | 11243 | 8091 | 3480 | 0 | 11946 | 6244 | 12179 |

Jurong East | 12215 | 16880 | 22686 | 22754 | 6089 | 12596 | 11946 | 0 | 15929 | 5693 |

Mountbatten | 4159 | 1750 | 8694 | 7489 | 10318 | 9389 | 6244 | 15929 | 0 | 17709 |

Choa Chu Kang | 14863 | 18148 | 22625 | 23399 | 10040 | 11185 | 12179 | 5693 | 17709 | 0 |

Let's go through this semantically, using the Single-Linkage example:

`10 Clusters`

Dhoby Ghaut, Paya Lebar, Tampines, Tanah Merah, Buona Vista, Yio Chu Kang, Bishan, Jurong East, Mountbatten, Choa Chu Kang

9 Clusters

Dhoby Ghaut, [Paya Lebar, Mountbatten], Tampines, Tanah Merah, Buona Vista, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang

8 Clusters

Dhoby Ghaut, [Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang

7 Clusters

Dhoby Ghaut, [Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], Jurong East, Choa Chu Kang

6 Clusters

[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], Jurong East, Choa Chu Kang

5 Clusters

[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], [Jurong East, Choa Chu Kang]

4 Clusters

[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], [Yio Chu Kang, Bishan], [Jurong East, Choa Chu Kang, Buona Vista]

3 Clusters

[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan], [Tampines, Tanah Merah], [Jurong East, Choa Chu Kang, Buona Vista]

2 Clusters

[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang, Buona Vista], [Tampines, Tanah Merah]

1 Cluster

[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang, Buona Vista, Tampines, Tanah Merah]

As you can tell, at the point where there's 4 Clusters left, it's already looking pretty good. In the next article we will go through the actual implementation of this algorithm!

## No comments :

## Post a Comment