...

Thursday, June 20, 2013

Computer Science 16

Greedy Algorithms makes up the answer through Locally Optimal solutions, and therefore is quicker at arriving at an answer. However, it does not guarantee that the final answer is Globally Optimal. The problem, of course, with finding the Globally Optimal is that it is prohibitively expensive to compute. Finding the Global Optimum is Inherently Exponential.

Let us now look at another class of Optimization Problems. One of the most important aspect of Computer Science is the idea of Machine Learning. Superficially, we can define Machine Learning as building programs that learns. However, it suffices to say that almost all programs learn in some form or another.

Machine Learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. We can have a better idea of Machine Learning by looking at the Machine Learning Wikipedia page.

A major focus of Machine Learning is to allow a computer to learn and  recognize complex patterns in order to make decisions based on empirical data. This whole process is known as Inductive Inference. The basic idea is for a computer to observe and record examples from a subset of a statistical phenomena, and generate a model that summarizes some statistical properties of the data in order to predict the future.

There are two distinctive approaches to Machine Learning:
1) Supervised Learning
2) Unsupervised Learning

In Supervised Learning, we associate a Label with each example in a Training Set. If the Label is Discrete, we call it a Classification Problem. For example, we can classify whether a certain transaction on a credit card is done by the owner or not. If the Label is Real, we call it a Regression Problem. When we were doing the Curve Fitting in the previous articles, we were doing Machine Learning, and handling a Regression Problem.

Based on the examples in the Training Set, the goal is create a program in such a way that it is able to predict the answer for other cases before they are explicitly observed. In order to do this, we need to Generalize the statistical properties of the Training Set to be able to make predictions about things we haven't seen.

Let us look at the following Training Set. Suppose that we've obtained some data plotted into axes X and Y, and the Colors (Red and Green) and Shapes (Grass and Leaf) are the Labels of the data.



Upon obtaining a set of data, we must ask ourselves a few questions:
1) Are the Labels accurate?
Labels are subject to machine and human errors.

2) Is the past representative of the future?
If the past results are generally not representative of the future, then it would be extremely difficult to generate a sufficiently accurate model for use.

3) Do we have enough data to generalize?
If the Training Set is relatively small, we should take care not have too much confidence on the predictions.

4) What features are to be extracted?
In order to find out something about the data, we must have the relevant features of the data. For example, in order to find out the composition of haze particles, it is useful to know the types of trees that are being burnt down.

5) How tight should the fit be?
How do we classify the data? Should we fit the data onto the curve, or fit the curve to the data?

It is easy to tell from the Training Data that a certain line y=c separates the labelled Green data from the labelled Red data within the given boundaries.



However, the more difficult part is thinking how should we go about classifying the Shape labels? Here's two of the many possible ways we can classify it:

Do we classify it by grouping the leaves together like this?



This is great because it minimizes the Training Error in the data. However, is this what we really want? How do we know if future data would really fit like this?

Or do we classify it like that, and assume that the leaf with the grass is a labelling error or an outlier? This is a good generalization but there is one Training Error.



That's where we need more data plotted in the vicinity of the stray leaf to find out if it's a Training Error or an Outlier.

Now let's talk about Unsupervised Learning. The big difference here is that we have Training Data, but we don't have Labels. The program gets a bunch of points, but doesn't know what shape or color is it. What can the program learn?

Typically, what the program learns in Unsupervised Learning is the Regularities of Data. From the initial data set, Unsupervised Learning can discover the structures in the data.



The dominant form of Unsupervised Learning is Clustering. What we just did above is to find the Clusters in the data. Clustering is to organize the data into groups whose members are similar to each other. However, how do we describe "similar"? Suppose that the data plots the heights and weights of people, and the x Axis represents weight, while the y Axis represents height. We now have distinct clusters of people.

Walmart in fact used Clustering to find out how to optimally structure their shelves. They found out through Clustering that there is a strong correlation between people who bought beer and people who bought diapers. Therefore they decided to place their beer and diapers departments side by side, which worked surprisingly well.

Similarly, Amazon used Clustering to find people who like similar books based on their profile and buying habits. Netflix uses the same to recommend movies. Biologists use clustering a lot to classify plants and animals. They also use it a lot in genetics to find genes that look and function alike.

What properties do a good Clustering have? It should have:
1) Low Intra-Cluster Dissimilarities
All the points in the same cluster should have as little dissimilarities as possible.

2) High Inter-Cluster Dissimilarities
All points between different clusters should be highly dissimilar.

How do we measure the Intra- and Inter-Cluster Dissimilarities? Well, of course, if you thought of Variance, you are absolutely right!

Intra-Cluster Variance looks at the Variance of the Intra-Cluster Points with respect to the Cluster's Mean.

variance(C) = Sum of, from i=0 to i=len(C), (mean(C)-ci)2

Inter-Cluster Variance looks at the Variance between the Means of the different Clusters.

Clustering can be formulated as an Optimization Problem. The Optimization Problem of Clustering is: Find the set of clusters, C, such that Intra-Cluster Variance is minimized, and the Inter-Cluster Variance is maximized.

However, such a definition is not enough, because if that's the case, then we can just put each point in its own cluster. The variance would be 0 and Intra-Cluster Variance would be maximized.

What we stated is just the Objective Function. In order to have a proper Optimization Problem, we need to spell out the Constraints. The Constraints can be things like:
1) We can only have a maximum of x clusters
2) There should be a minimum distance between two clusters.

Finding the absolute best solution to this problem is Computationally Prohibitive (it is minimum in the order of n2). Therefore people typically resort to some form of Greedy Algorithm. The two most common Greedy Algorithm used to solve Clustering is:
1) Hierarchical Clustering
2) k-means

Let's go through an example in Hierarchical Clustering. Suppose that we have N items, and we have an accompanying N*M matrix, where M=N, which we can look-up to find out the distance between any two points.

The algorithm is as follows:
1) Assign each item to its own cluster. Therefore, if we have N items, we have N clusters.
2) Find the most similar pair of clusters and merge them.
3) Continue the process until all items are in x number of clusters.

This type of Clustering is also known as Agglomerative Clustering. Step 2 of the algorithm is the most difficult. However, what do we compare if there are two items in each cluster now? What we need to look at is the Linkage Criteria.

There are various Linkage Criteria which we can look at. These are:
1) Single-Linkage
The shortest distance between any two pairs between clusters.

2) Complete-Linkage
The shortest distance between two furthest pairs between clusters. This takes care of the worst case.

3) Average-Linkage
The distance between the means of each cluster.

Of course, like any Greedy Algorithm, there is no Linkage Criteria that's the best. This algorithm is minimum an Order of n2 and there is no guarantee that the Cluster is Globally Optimal.

The most important issue in Machine Learning is Feature Selection in order to know what to compare. To Cluster countries together, we need to have something known as a Feature Vector containing all the features that are to be compared. An example of a Feature Vector of a country is a list containing the Population, Coordinates, Size, and so on.

Let's do an example of Hierarchical Clustering. Suppose that we have these coordinates of MRTs:

Dhoby Ghaut,1.298593,103.845909
Buona Vista,1.307412,103.789696
Paya Lebar,1.318089,103.892982
Mountbatten,1.306306,103.882531
Jurong East,1.334308,103.741958
Tampines,1.353092,103.945229
Tanah Merah,1.327257,103.946579
Bishan,1.350772,103.848183
Yio Chu Kang,1.381905,103.844818
Choa Chu Kang,1.385482,103.74424


If we convert the Longitudinal and Latitudinal data into actual distances, we would end up with the following table:

LocationDhoby GhautPaya LebarTampinesTanah MerahBuona VistaYio Chu KangBishanJurong EastMountbattenChoa Chu Kang
Dhoby Ghaut05662125901163263249260580412215415914863
Paya Lebar5662069896043115408885616316880175018148
Tampines1259069890287518015116091078822686869422625
Tanah Merah1163260432875017574128371124322754748923399
Buona Vista6324115401801517574010299809160891031810040
Yio Chu Kang926088851160912837102990348012596938911185
Bishan58046163107881124380913480011946624412179
Jurong East12215168802268622754608912596119460159295693
Mountbatten4159175086947489103189389624415929017709
Choa Chu Kang148631814822625233991004011185121795693177090

Let's go through this semantically, using the Single-Linkage example:

10 Clusters
Dhoby Ghaut, Paya Lebar, Tampines, Tanah Merah, Buona Vista, Yio Chu Kang, Bishan, Jurong East, Mountbatten, Choa Chu Kang

9 Clusters
Dhoby Ghaut, [Paya Lebar, Mountbatten], Tampines, Tanah Merah, Buona Vista, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang

8 Clusters
Dhoby Ghaut, [Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang

7 Clusters
Dhoby Ghaut, [Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], Jurong East, Choa Chu Kang

6 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], Jurong East, Choa Chu Kang

5 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], [Jurong East, Choa Chu Kang]

4 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], [Yio Chu Kang, Bishan], [Jurong East, Choa Chu Kang, Buona Vista]

3 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan], [Tampines, Tanah Merah], [Jurong East, Choa Chu Kang, Buona Vista]

2 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang, Buona Vista], [Tampines, Tanah Merah]

1 Cluster
[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang, Buona Vista, Tampines, Tanah Merah]


As you can tell, at the point where there's 4 Clusters left, it's already looking pretty good. In the next article we will go through the actual implementation of this algorithm!

No comments :

Post a Comment

<