Wednesday, June 26, 2013

Computer Science 17

Noise, in photography, is the inability of your camera's sensor to accurately sample and reproduce a pixel from a given exposure. However, the fact that your picture is still discernible is due to the high Signal-to-Noise Ratio.

Signal-to-Noise Ratio is a measurement of how much of a given piece of information is correct, and how much of it is simply noise. For a typical picture, the SNR is actually extremely high, which means that most of the information is correct.

Noise, like most physical phenomena, is Normally Distributed. In case of a picture, for each pixel, the accuracy of the color representation is normally distributed between 3 axes, the Red, Green and Blue, with the Mean of the distribution being the most accurate color representation of the pixel that a camera can produce. If the color inaccuracy exceeds a certain threshold, we will call it "Noise".

If we measure the Noise of a single image, we measure the number of correctly produced pixels, versus the pixels that are incorrectly produced. We must, of course, define what Noise means in an image. In our example, we'll say that pixels of colors that are below 80% accuracy would be considered Noise. If for every 100 pixels there exists 1 pixel of Noise, we would have the SNR of 100:1. This also means that 1% of total pixels are actually Noise.

In the context of an image, I'll refer to noise as Image Noise. Image Noise of an image is relatively stable. The percentage of pixels that are incorrectly produced for a given exposure is roughly constant and does not fluctuate greatly. When considering Image Noise, an exposure can be 100:1 SNR, the next exposure can be 105:1, and the following can be 95:1.

However, if we look at the noise of each pixel, we measure how far is the color away from the actual. For a 100:1 SNR image, 99% of the pixels are actually sufficiently correctly reproduced, while 1% strayed too far from the Mean.

In the context of individual pixels, I'll refer to the noise as Pixel Noise. Pixel Noise can fluctuate greatly. A pixel can have RGB SNR ranging from 100:1 , or 5:1, or even 1:1 accuracy. It is this fluctuation that causes Pixels to contribute to Image Noise.

In order to fix this, we use Image Averaging. If we average every single pixel across 100 exposures of the same angle, we would have a pixel that is 99% accurate to its actual color. By averaging, no pixels would be 100% accurate this way, but pixels that are below 80% accuracy would become extremely rare. The overall Noise is unchanged, but by evening and spreading out the Pixel Noise across all pixels, we effectively reduced the Image Noise of the image.

Let's go through an example of Image Averaging. You can do this with any series of images taken from the same angle. For our purpose, we'll use frames from a video. Here's the set of frames we have, downsized from 720p:

This is the full image of a single frame. Click to open it in a separate page to further inspect it:

After averaging, this is what we got:

Not bad for a zoomed video at 60x huh?

Let's look at another example, at 30x zoom with some contrasting colors:

Notice that you can also have images where the lighting change slightly. It would all even out perfectly.

(There's been some error with the image hosting. I'll host the images once the server is online)

Here's a single shot in full resolution:

And here's the shot after averaging the pixels across 10 shots:

As you can see, the focal blur is better represented and so on.

This concludes this article. I may write another article on how we can do the averaging, especially across hundreds of images.

Thursday, June 20, 2013

Computer Science 16

Greedy Algorithms makes up the answer through Locally Optimal solutions, and therefore is quicker at arriving at an answer. However, it does not guarantee that the final answer is Globally Optimal. The problem, of course, with finding the Globally Optimal is that it is prohibitively expensive to compute. Finding the Global Optimum is Inherently Exponential.

Let us now look at another class of Optimization Problems. One of the most important aspect of Computer Science is the idea of Machine Learning. Superficially, we can define Machine Learning as building programs that learns. However, it suffices to say that almost all programs learn in some form or another.

Machine Learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. We can have a better idea of Machine Learning by looking at the Machine Learning Wikipedia page.

A major focus of Machine Learning is to allow a computer to learn and  recognize complex patterns in order to make decisions based on empirical data. This whole process is known as Inductive Inference. The basic idea is for a computer to observe and record examples from a subset of a statistical phenomena, and generate a model that summarizes some statistical properties of the data in order to predict the future.

There are two distinctive approaches to Machine Learning:
1) Supervised Learning
2) Unsupervised Learning

In Supervised Learning, we associate a Label with each example in a Training Set. If the Label is Discrete, we call it a Classification Problem. For example, we can classify whether a certain transaction on a credit card is done by the owner or not. If the Label is Real, we call it a Regression Problem. When we were doing the Curve Fitting in the previous articles, we were doing Machine Learning, and handling a Regression Problem.

Based on the examples in the Training Set, the goal is create a program in such a way that it is able to predict the answer for other cases before they are explicitly observed. In order to do this, we need to Generalize the statistical properties of the Training Set to be able to make predictions about things we haven't seen.

Let us look at the following Training Set. Suppose that we've obtained some data plotted into axes X and Y, and the Colors (Red and Green) and Shapes (Grass and Leaf) are the Labels of the data.

Upon obtaining a set of data, we must ask ourselves a few questions:
1) Are the Labels accurate?
Labels are subject to machine and human errors.

2) Is the past representative of the future?
If the past results are generally not representative of the future, then it would be extremely difficult to generate a sufficiently accurate model for use.

3) Do we have enough data to generalize?
If the Training Set is relatively small, we should take care not have too much confidence on the predictions.

4) What features are to be extracted?
In order to find out something about the data, we must have the relevant features of the data. For example, in order to find out the composition of haze particles, it is useful to know the types of trees that are being burnt down.

5) How tight should the fit be?
How do we classify the data? Should we fit the data onto the curve, or fit the curve to the data?

It is easy to tell from the Training Data that a certain line y=c separates the labelled Green data from the labelled Red data within the given boundaries.

However, the more difficult part is thinking how should we go about classifying the Shape labels? Here's two of the many possible ways we can classify it:

Do we classify it by grouping the leaves together like this?

This is great because it minimizes the Training Error in the data. However, is this what we really want? How do we know if future data would really fit like this?

Or do we classify it like that, and assume that the leaf with the grass is a labelling error or an outlier? This is a good generalization but there is one Training Error.

That's where we need more data plotted in the vicinity of the stray leaf to find out if it's a Training Error or an Outlier.

Now let's talk about Unsupervised Learning. The big difference here is that we have Training Data, but we don't have Labels. The program gets a bunch of points, but doesn't know what shape or color is it. What can the program learn?

Typically, what the program learns in Unsupervised Learning is the Regularities of Data. From the initial data set, Unsupervised Learning can discover the structures in the data.

The dominant form of Unsupervised Learning is Clustering. What we just did above is to find the Clusters in the data. Clustering is to organize the data into groups whose members are similar to each other. However, how do we describe "similar"? Suppose that the data plots the heights and weights of people, and the x Axis represents weight, while the y Axis represents height. We now have distinct clusters of people.

Walmart in fact used Clustering to find out how to optimally structure their shelves. They found out through Clustering that there is a strong correlation between people who bought beer and people who bought diapers. Therefore they decided to place their beer and diapers departments side by side, which worked surprisingly well.

Similarly, Amazon used Clustering to find people who like similar books based on their profile and buying habits. Netflix uses the same to recommend movies. Biologists use clustering a lot to classify plants and animals. They also use it a lot in genetics to find genes that look and function alike.

What properties do a good Clustering have? It should have:
1) Low Intra-Cluster Dissimilarities
All the points in the same cluster should have as little dissimilarities as possible.

2) High Inter-Cluster Dissimilarities
All points between different clusters should be highly dissimilar.

How do we measure the Intra- and Inter-Cluster Dissimilarities? Well, of course, if you thought of Variance, you are absolutely right!

Intra-Cluster Variance looks at the Variance of the Intra-Cluster Points with respect to the Cluster's Mean.

variance(C) = Sum of, from i=0 to i=len(C), (mean(C)-ci)2

Inter-Cluster Variance looks at the Variance between the Means of the different Clusters.

Clustering can be formulated as an Optimization Problem. The Optimization Problem of Clustering is: Find the set of clusters, C, such that Intra-Cluster Variance is minimized, and the Inter-Cluster Variance is maximized.

However, such a definition is not enough, because if that's the case, then we can just put each point in its own cluster. The variance would be 0 and Intra-Cluster Variance would be maximized.

What we stated is just the Objective Function. In order to have a proper Optimization Problem, we need to spell out the Constraints. The Constraints can be things like:
1) We can only have a maximum of x clusters
2) There should be a minimum distance between two clusters.

Finding the absolute best solution to this problem is Computationally Prohibitive (it is minimum in the order of n2). Therefore people typically resort to some form of Greedy Algorithm. The two most common Greedy Algorithm used to solve Clustering is:
1) Hierarchical Clustering
2) k-means

Let's go through an example in Hierarchical Clustering. Suppose that we have N items, and we have an accompanying N*M matrix, where M=N, which we can look-up to find out the distance between any two points.

The algorithm is as follows:
1) Assign each item to its own cluster. Therefore, if we have N items, we have N clusters.
2) Find the most similar pair of clusters and merge them.
3) Continue the process until all items are in x number of clusters.

This type of Clustering is also known as Agglomerative Clustering. Step 2 of the algorithm is the most difficult. However, what do we compare if there are two items in each cluster now? What we need to look at is the Linkage Criteria.

There are various Linkage Criteria which we can look at. These are:
1) Single-Linkage
The shortest distance between any two pairs between clusters.

2) Complete-Linkage
The shortest distance between two furthest pairs between clusters. This takes care of the worst case.

3) Average-Linkage
The distance between the means of each cluster.

Of course, like any Greedy Algorithm, there is no Linkage Criteria that's the best. This algorithm is minimum an Order of n2 and there is no guarantee that the Cluster is Globally Optimal.

The most important issue in Machine Learning is Feature Selection in order to know what to compare. To Cluster countries together, we need to have something known as a Feature Vector containing all the features that are to be compared. An example of a Feature Vector of a country is a list containing the Population, Coordinates, Size, and so on.

Let's do an example of Hierarchical Clustering. Suppose that we have these coordinates of MRTs:

Dhoby Ghaut,1.298593,103.845909
Buona Vista,1.307412,103.789696
Paya Lebar,1.318089,103.892982
Jurong East,1.334308,103.741958
Tanah Merah,1.327257,103.946579
Yio Chu Kang,1.381905,103.844818
Choa Chu Kang,1.385482,103.74424

If we convert the Longitudinal and Latitudinal data into actual distances, we would end up with the following table:

LocationDhoby GhautPaya LebarTampinesTanah MerahBuona VistaYio Chu KangBishanJurong EastMountbattenChoa Chu Kang
Dhoby Ghaut05662125901163263249260580412215415914863
Paya Lebar5662069896043115408885616316880175018148
Tanah Merah1163260432875017574128371124322754748923399
Buona Vista6324115401801517574010299809160891031810040
Yio Chu Kang926088851160912837102990348012596938911185
Jurong East12215168802268622754608912596119460159295693
Choa Chu Kang148631814822625233991004011185121795693177090

Let's go through this semantically, using the Single-Linkage example:

10 Clusters
Dhoby Ghaut, Paya Lebar, Tampines, Tanah Merah, Buona Vista, Yio Chu Kang, Bishan, Jurong East, Mountbatten, Choa Chu Kang

9 Clusters
Dhoby Ghaut, [Paya Lebar, Mountbatten], Tampines, Tanah Merah, Buona Vista, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang

8 Clusters
Dhoby Ghaut, [Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang

7 Clusters
Dhoby Ghaut, [Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], Jurong East, Choa Chu Kang

6 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], Jurong East, Choa Chu Kang

5 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], Buona Vista, [Yio Chu Kang, Bishan], [Jurong East, Choa Chu Kang]

4 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten], [Tampines, Tanah Merah], [Yio Chu Kang, Bishan], [Jurong East, Choa Chu Kang, Buona Vista]

3 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan], [Tampines, Tanah Merah], [Jurong East, Choa Chu Kang, Buona Vista]

2 Clusters
[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang, Buona Vista], [Tampines, Tanah Merah]

1 Cluster
[Dhoby Ghaut, Paya Lebar, Mountbatten, Yio Chu Kang, Bishan, Jurong East, Choa Chu Kang, Buona Vista, Tampines, Tanah Merah]

As you can tell, at the point where there's 4 Clusters left, it's already looking pretty good. In the next article we will go through the actual implementation of this algorithm!

Wednesday, June 19, 2013

Computer Science 15

There are many ways we can create models to solve problems. The generalized process is as shown:
1) Start with an experiment in which we gather data.
2) Use computation to find and evaluate a model.
3) Use theory, analysis and computation to derive a consequence of the model.

Optimization Problems involve writing programs to optimize real-life scenarios. Each Optimization Problem consists of an Objective Function and a set of Constraints that has to be satisfied. The Objective Function is used to compare and find the optimal results, while Constraints are rules that must be obeyed (for example, in the case of navigation, Constraints can be in the form of limiting to only pedestrian routes).

Once we come up with these things, we can use computation to attack the problem. A way to solve seemingly new problems is to attempt to map these problems to classic problems that we already know how to solve. This is known as Problem Reduction.

Optimization Problems typically take a long time to solve. Often, there is no computationally efficient way to solve them. Therefore, it is common that we see "Best Effort" results.

A classic Optimization Problem is the Knapsack Problem. The problem is typically discussed in the context of a burglar. One of the problems a burglar must deal with is deciding what to steal. There is usually far more to steal than you can carry away. The Objective Function must optimize the value of what we steal, with the Constraint of the maximum weight or volume we can carry in the knapsack.

Suppose that the burglar has done his research on the house and finds out that these are the objects of interest:

LED TV$100010Kg
HiFi Set$5005Kg

(Here I'm saying Weight in a general sense, don't tell me it should be Mass instead. I know :p)

One of the easiest solution is to implement a Greedy Algorithm. A Greedy Algorithm is iterative, and at each step, the locally optimal solution is picked. What we need to know is to find out what is considered "locally optimal". It can be the best price/weight ratio, or the highest raw value that can still fit into the bag, or the item with the least weight. There is no guarantee that each variable is the best. Therefore, the biggest problem with the Greedy Algorithm is that we have no single best "locally optimal" that would work most of the time for the 0/1 Knapsack Problem.

The term 0/1 refers to the fact that we can't take half an item. It's either the full item, or we don't take it. Greedy Algorithm, however, is the best for Continuous Knapsack Problem. Take for example, if we break into a smith shop and find barrels of gold, silver and other precious metals. We can then work by filling up with gold first, then silver, then the other metals, and when we finally run out of space, we can take a partial barrel. However, most problems in life are 0/1 Knapsack problems.

Let's create a quick object that allows us to define what an item is:

class Item(Object):
    def __init__(self, name, value, weight):
        self.name = name;
        self.value = float(value);
        self.weight = float(weight);

    def getName(self):
        return self.name

    def getValue(self):
        return self.value

    def getWeight(self):
        return self.weight

    def __str__(self):
        return str(self.name)+", $"+str(self.value)+", "+str(self.weight)+"Kg"

We then instantiate the items based on the above table:

def instantiateItems():
    names=["Desktop","Laptop","LED TV","Smartphone","Ring","Piano","HiFi Set","PSP"]
    for i in range(len(names)):
    return items

At this point we should have a list of items ready for us to work with. We then come up with the greedy algorithm that does all these. The Greedy Algorithm should accept a list of items, the maximum weight that a person can carry, and the function used for comparison.

Here is the Greedy Algorithm:

def greedy(items, maxWeight, keyFunction):
    itemsSorted = sorted(items, key=keyFunction, reverse=True)
    for item in itemsSorted:
        if totalWeight+item.getWeight()<maxWeight:
    return knapsack, totalValue, totalWeight

def value(item):
    return item.getValue()

def weight(item):
    return item.getWeight()

def vwRatio(item):
    return item.getValue()/item.getWeight()

A little explanation here about the formal parameter keyFunction. The Key Function is a function of 1 formal parameter that the sorted() function would use to compare the items in the list. value(), weight() and vwRatio() are examples of Key Functions. Using value() as the keyFunction, for example, would compare whatever is returned by the item.getValue() of all the items.

Let's test the Greedy Algorithm now:

def printResults(knapsack,totalValue,totalWeight):
    for item in knapsack:
    print("Total Value: $"+str(round(totalValue,2)))
    print("Total Weight: "+str(round(totalWeight,2))+"Kg")

def testGreedy():
    print("Using value as the key function:")
    print("Using weight as the key function:")
    print("Using vwRatio as the key function:")

Assuming our burglar has a knapsack that can carry only up to 22.5Kg of things, here's the results:

Using value as the key function:
Desktop, $1500.0, 10.0Kg
Laptop, $1200.0, 1.5Kg
LED TV, $1000.0, 10.0Kg
Ring, $1000.0, 0.01Kg
Smartphone, $750.0, 0.25Kg
PSP, $300.0, 0.5Kg
Total Value: $5750.0
Total Weight: 22.26Kg

Using weight as the key function:
Ring, $1000.0, 0.01Kg
Smartphone, $750.0, 0.25Kg
PSP, $300.0, 0.5Kg
Laptop, $1200.0, 1.5Kg
HiFi Set, $500.0, 5.0Kg
Desktop, $1500.0, 10.0Kg
Total Value: $5250.0
Total Weight: 17.26Kg

Using vwRatio as the key function:
Ring, $1000.0, 0.01Kg
Smartphone, $750.0, 0.25Kg
Laptop, $1200.0, 1.5Kg
PSP, $300.0, 0.5Kg
Desktop, $1500.0, 10.0Kg
LED TV, $1000.0, 10.0Kg
Total Value: $5750.0
Total Weight: 22.26Kg

As you can see, there is no single best Key Function that we can use. In the end it's all up to the situation and we must look at every single one to know what is the best. Since it's called Greedy Algorithm, there is definitely some sort of Order of Complexity associated with it.

The implemented algorithm is divided into two parts: Sorting, then Comparing.

We can assume that the sorting is some variant of the Merge Sort, of which we have:
O(len(items)*log(len(items))) or otherwise O(n log n)

The comparison part is just a for loop that runs a maximum of len(items) times. Therefore we have:
O(len(items)) or otherwise O(n)

Taking the worst case, the Order of Complexity of the algorithm is O(n log n).

However, it is important to note that what we have may not be the absolute optimal we can have.

If we formulate a problem well enough, it would definitely be apparent that it can be solved in a bruteforce way. The biggest trouble is whether it is feasible or not. Fortunately, bruteforce is feasible for small 0/1 Knapsack Problems.

Suppose that we have these 8 items. The maximum combination of items that we can take is:
28 = 256

We can then run a bruteforce algorithm to go through every single possible combination to find out what is the maximum [Sum of Value] where [Sum of Weight] is below maximum weight. In other words:
Objective Function - Maximum [Sum of Value]
Constraint - [Sum of Weight] Below or Equal to Maximum Weight

The Order of Complexity for such an algorithm is:
O(len(items)**2) or O(n2)

Of course, if we're doing just 8 items, we have 256 different combinations. However, what if we have a list of 50 items. That would be 1125899906842624 item combinations. Even if the computer could generate a combination in a nanosecond, it would take 36 years to complete.

Just for kicks, here's the bruteforce algorithm to solve this:

def brange(integer):
    for i in range(len(binary)):
        if binary[len(binary)-i-1]=='1':
    return brange
def bruteForceKnapsack(items,maxWeight):
    for combination in range(2**len(items)):
        for i in brange(combination):
        if currentWeight<=maxWeight:
            if maxValue<currentValue:
            elif maxValue==currentValue:
    return combinations,values,weights

def testBruteForceKnapsack():
    print("For a maxweight of 22.5:")
    for i in range(len(combinations)):
        for j in brange(combinations[i]):
        print("Total Value: $"+str(round(values[i],2)))
        print("Total Weight: "+str(round(weights[i],2))+"Kg")

The brange() function returns a list of 1's in an integer converted to binary. For example, 10 when converted to binary is 1010. It would return 1 and 3 which are the location of the 1's. This algorithm goes through all possible combinations. If there's a better combination than one it currently knows, it would replace all previous combinations. If there's a combination that gives the same value as the one it currently has, it would append it to the list. Here's some output:

For a maxweight of 22.5:
Desktop, $1500.0, 10.0Kg
Laptop, $1200.0, 1.5Kg
LED TV, $1000.0, 10.0Kg
Smartphone, $750.0, 0.25Kg
Ring, $1000.0, 0.01Kg
PSP, $300.0, 0.5Kg
Total Value: $5750.0
Total Weight: 22.26Kg

Here we see that the Greedy Algorithm got lucky. Its Best Effort results were correct! If we work with a different set of items the Greedy Algorithm results may not be as optimal, but the Brute Force it will definitely give you the best, or a set of best results.

Brute Force, of course, is unscalable. In the next article we will look at other ways we can find such results.

Here's graphically how much we can steal from the house:

As you can see, optimally 28Kg is minimum to steal mostly everything. You'll have to add more than 150Kg to your maximum load in order to start steal more.

Tuesday, June 18, 2013

Computer Science 14

Before believing the result of any simulation, we have to have confidence that our conceptual model is correct, and that we have correctly implemented the model.

Before, we looked at Statistical Tests. These tests give us Statistical conclusions which tell us things about the simulation, about how it converges to a certain result. However, we still have to perform Sanity Checks.

One of the ways we can do this is to test the results against reality. For example, in the previous article, the value of Pi obtained from the simulation can be tested with the formula to know if we got the right answer or not.

Let us look at an interplay between Physical Reality, Theoretical Models and Computational Models. The Physical Reality is based on the Physical Situation. An example of a Physical Situation can be the price of Big Mac across countries (which we know as the Big Mac Index), or a stock market, or viruses in a body. We use a Theoretical Model to have some deeper insights into it, and when it gets too complicated or is unable to show us everything, we use a Computational Model.

At times when we conduct an actual experiment, we may not get the results that match exactly with the theory. To work with experimental errors, we have to assume that there is some sort of random perturbation applied to the results. Gaussian's work suggests that we can model experimental errors as Normally Distributed.

The Hooke's Law, F=kx, was described by Robert Hooke in 1678. In plain English, the Hooke's Law states that the Force (F) stored in a spring is linearly related to the Distance (x) the spring has either been compressed or stretched multiplied by the Stiffness (k) of the spring. A stiff spring has a large k value. This law holds for a wide variety of materials and systems. However, of course, it does not hold for a force that causes the spring to exceed its elastic limit.

We can find the k value of a spring by suspending the spring and hanging known masses on it. Measuring the distance for each mass. Suppose that a 1K
Using the rules:
F = kx
F = ma
ma = kx
k = ma/x

We derive:
= 1Kg * 9.81m/s2 / 0.2m
= 49.05N/m

We can do a sanity check to make sure this is correct (using 0.4m as the distance, we should evaluate to 2Kg):
49.05 = m * 9.81m/s2  / 0.4m
49.05 / (9.81m/s2/0.4) = 2Kg

What we are assuming here is that 49.05N/m is the correct k value with just one experiment. However, remember that experimental errors are normally distributed. If we actually put a 2Kg weight there, we may or may not get a stretching of exactly 0.4m. People would typically repeat the same experiment many times with different weights in order to get enough data to compute an accurate value of k.

Suppose that we have the following data recorded by a fellow scientist:

0.0 0.0
0.05 0.00968
0.1 0.02087
0.15 0.02871
0.2 0.03849
0.25 0.05371
0.3 0.05917
0.35 0.07066
0.4 0.08041
0.45 0.08863
0.5 0.10219
0.55 0.10251
0.6 0.09903
0.65 0.10078
0.7 0.10321
0.75 0.09605
0.8 0.09730
0.85 0.09986
0.9 0.09876
0.95 0.09825
1.0 0.09986

The data comes in a file called spring_data.txt describing weights (in kg) and stretch (in m). In order to manipulate this, we can either manually copy the data into our program, or import and parse it from within Python. In this example, we are going to import and parse the data:

The following code imports the data, and then returns two lists:

def importData(fileName):
    for line in file:
    return masses,distances


Now that we have the two lists, we can then plot the data using:

import pylab

pylab.xlabel("Mass (kg)")
pylab.ylabel("Distance (m)")
pylab.title("Spring Experiment")

As you can see, the distances are somewhat linear at first (with some level of normally distributed error), then it evens off at the end:

In Numpy, there is a type called the "array". This array has nothing to do with the array we have in Java. The array here can be considered as more of a matrix. Changing data to an array allows:
1) Mathematical operations to be performed on it (i.e. array*3 will multiply every element by 3)
2) Cross-products between two arrays (array1*array2)

Lets look at some application of arrays:

import pylab

forces = pylab.array(masses)*9.81

pylab.xlabel("Force (N)")
pylab.ylabel("Distance (m)")
pylab.title("Spring Experiment")

In this example, the masses list is changed into an array and then multiplied by 9.81. What we get is a list of Force (N):

From here, we can calculate k. But before that,we must know whether our data is sensible. In our Theoretical Model, the data should fall on a line F=kx. If we can draw a good line here, we can find k very easily.

If we only have two points, we can easily join them to create a line. If we have a bunch of points scattered around, we must find the line that is the closest to all the points. This is known as Fitting. In order to find the line, we need to have a measure to how good is the fit. We need an Objective Function to compare fits. The standard Objective Function is the Least Squares Fit.

Suppose we have two lists, a list of predictions according to the line, and a list of observations. The Least Squares Fit is described as below:

def leastSquaresFit(observed,predicted):
    for i in range(len(observed)):
    return sum

For comparing two Line of Best Fits (or Polynomial of Best Fits), the lower the number, the better.

A good model is created when, for each independent variable Mass (kg), we can predict a good dependent variable Displacement (m).

For creation of the Line of Best fit of any graph, we use Linear Regression. Pylab does Linear Regression using the function polyfit(observedX,observedY,degreeOfPolynomial). Recall that a line is y=mx+c. Polyfit would return the values of m and c for 1 degree fitting:

import pylab

forcesArray = pylab.array(masses)*9.81
distancesArray = pylab.array(distances)

m,c = pylab.polyfit(forcesArray,distancesArray,1)
distancesFitArray = m*forcesArray+c

pylab.xlabel("Force (N)")
pylab.ylabel("Distance (m)")
pylab.title("Spring Experiment, k="+str(1/m))

We have the following output:

We can compute for k through this reasoning:
Distance = m * Force + c
Force =  1/m * Distance + c
k = 1/m

So is the k value correctly shown? If we look at the picture, we'll realize that the line is not a very good fit. Is it better to use a cubic fit to model our data? Let's look at the example below:

import pylab

forcesArray = pylab.array(masses)*9.81
distancesArray = pylab.array(distances)

m,c = pylab.polyfit(forcesArray,distancesArray,1)
distancesFitArray = m*forcesArray+c
a,b,c,d = pylab.polyfit(forcesArray,distancesArray,3)
distancesFitCubicArray = a*forcesArray**3 + b*forcesArray**2 + c*forcesArray**1 + d

pylab.legend(["Observed","Linear Fit","Cubic Fit"])
pylab.xlabel("Force (N)")
pylab.ylabel("Distance (m)")
pylab.title("Spring Experiment, k="+str(1/m))

It seems as though the Cubic Fit is much better than the Linear Fit in describing the spring?

The useful thing about a model is its ability to predict data. Let's look at how the Cubic Fit would predict our data. We want to see, for example, what would happen if an approximately 20N force is placed on the spring. We do this by adding another point in the Forces array:

import pylab

forcesArray = pylab.array(masses)*9.81
forcesArrayExtended = pylab.array(masses+[2])*9.81
distancesArray = pylab.array(distances)

m,c = pylab.polyfit(forcesArray,distancesArray,1)
distancesFitArray = m*forcesArray+c
a,b,c,d = pylab.polyfit(forcesArray,distancesArray,3)
distancesFitCubicArray = a*forcesArrayExtended**3 + b*forcesArrayExtended**2 + c*forcesArrayExtended**1 + d

pylab.legend(["Observed","Linear Fit","Cubic Fit"])
pylab.xlabel("Force (N)")
pylab.ylabel("Distance (m)")
pylab.title("Spring Experiment, k="+str(1/m))

Now observe the output:

The Cubic Fit seems to suggest that if we place 20N of force on the spring, the spring would retract back higher than it originally was. That happens maybe in another universe with different physics but that doesn't happen on Earth.

Remember that Hooke's Law is a first order linear approximation that will fail as the spring approaches its limit of elasticity. Just because the data is not plotted into a straight line does not mean that we should attempt to fit an arbitrary curve onto it, because any polynomial with a high enough degree can fit into any data. What we want, instead, is to look at the portions where Hooke's Law remains valid - that is, when the spring hasn't hit its limit of elasticity.

Let us repeat the experiment, discarding the last 10 results:

import pylab

forcesArray = pylab.array(masses)*9.81
forcesArrayLimited = pylab.array(masses[:-10])*9.81
#forcesArrayExtended = pylab.array(masses+[2])*9.81
distancesArray = pylab.array(distances)
distancesArrayLimited = pylab.array(distances[:-10])

m,c = pylab.polyfit(forcesArrayLimited,distancesArrayLimited,1)
distancesFitArray = m*forcesArray+c
#a,b,c,d = pylab.polyfit(forcesArray,distancesArray,3)
#distancesFitCubicArray = a*forcesArrayExtended**3 + b*forcesArrayExtended**2 + c*forcesArrayExtended**1 + d

pylab.legend(["Observed","Linear Fit","Cubic Fit"])
pylab.xlabel("Force (N)")
pylab.ylabel("Distance (m)")
pylab.title("Spring Experiment, k="+str(1/m))

Now notice that we've obtained a much more believable fit.

Now notice that we've obtained a better approximation for the k value of the spring, which is originally rated at 49.05 by the manufacturer.

To know how good a fit is to a particular set of results, we use the Coefficient of Determination. The Coefficient of Determination is:
r2=1-(Estimated Error / Variance of Measured Data).

Computationally, it is expressed as such:

def rSquare(predicted,measured):
    estimatedError = ((predicted-measured)**2).sum()
    measuredMean = measured.sum()/float(len(measured))
    measuredVariance = (measuredMean-measured**2).sum()
    return 1-estimatedError/measuredVariance

Thanks to Numpy's array, we performed array operations in one line each. Do you see how useful arrays are!?

We can compare the goodness of each fit by comparing it to the measured data. The closer it is to 1, the better. A CoD of 1 means that the fit maps directly onto the measured data.

Sunday, June 16, 2013

Computer Science 13

How do we construct computational models that will help us understand the real world?

A Gaussian Distribution, or commonly known as Normal Distribution or the Bell Curve, can be described with just its Mean and Standard Deviation.

Whenever possible, we would model distributions as Gaussian because they are extremely easy to deal with. We have things like the Empirical Rule to help us draw conclusions. However, if a certain distribution is not Gaussian and we treat it that way, the results could be extremely misleading.

Take for example the results of rolling a die. The outcome of die-rolling is not normally distributed. In the case of a fair-die, each face is equally likely to appear. Plotting the results on a histogram would not yield a Bell Curve.

In a fair lottery system, the probability of anything coming up is the same. Most people would study past results in order to predict what comes up next. However, illustrating the Gambler's Fallacy again, lottery has no memory and is thus not based on past results.

In case of the Lottery and Die Rolling, we have a Uniform Distribution. Uniform Distributions describe distributions where each result is equally plausible. The only thing required to describe a Uniform Distribution is the Range of the results.

The Uniform Distribution does not occur much in nature. It is typically observed in fair chance games deviced by humans. Uniform Distribution is not as useful in modelling complex systems as the Gaussian Distribution.

Another type of distribution that occurs quite frequently is the Exponential Distribution. Exponential Distribution is the only Continuous Distribution that is memory-less. The concentration of drug in the bloodstream can be modelled with Exponential Distribution.

Assume that each time-step, each molecule has a probability p of being cleared by the body. The system is memory-less, in the sense that the probability of each molecule being cleared is not dependent on what happened to the previous molecules in the previous steps.

If the probability of being cleared at each step is p, then at time T=1, the probability of the molecule still being in the body is 1-p. Since it is memory-less, the probability of it being cleared is still 1-p. Therefore, the probability of it still being in the body is (1-p)2.

We can see that the probability of the molecule still being in the body at the particular time is (1-p)t

Suppose there are 1000 molecules, and at each time-step, each molecule has a chance of 0.01 to be cleared by the body. We would end up with the following graphs:

Of course, what we are seeing is the mathematical graphing of what we would expect. This graph is achieved through the following code:

import pylab

def mathematicalMolecules(n, clearProbability, steps):
    for i in range(steps):
    return moleculesRemaining

moleculeList = mathematicalMolecules(1000,0.01,1000)
pylab.xlabel("Time (T)")
pylab.title("Exponential Decay of Molecules")

Now, notice that the rate of clearing of molecules decrease in a logarithmic way. The less molecules there are, the less molecules are going to be lost. If we plot the graph with semilogy, we can see that it is indeed logarithmic:

We can actually create a Monty Carlo simulation to mimick the process described, instead of simply using mathematical formulas to do it. We make use of the following code to implement this:

import random

def simulationMolecules(n, clearProbability, steps):
    for i in range(steps):
        for j in range(n):
            if random.random()<clearProbability:
    return moleculesRemaining

moleculeList = simulationMolecules(1000,0.01,1000)
pylab.xlabel("Time (T)")
pylab.title("Exponential Decay of Molecules (Simulation)")

Notice that the output is not as smooth as we want it to be, but it matches the mathematical diagram almost exactly:

If we look at this graph in a logarithmic way, this is what we'll get:

The biggest difference is that our simulation did not allow for fractions of a molecule, which is more true to what we are looking for.

Here's a direct mapping of the two graphs:

The Mathematical Model is also known as the Analytic Model. Here, we've illustrated the difference between Analytical and Simulation for Exponential Decay. There is no right or wrong answer over here. When looking at the models, we look at its Fidelity and Credibility. Fidelity refers to whether the outcome is accurate. Credibility is whether the outcome can be believed.

Then there is Utility. Which model can be applied to what kind of questions? There is an additional Utility offered by the Simulation Model, because we can easily change the model to be slightly different in ways that are usually harder for an Analytic Model.

For example, if we wish to calculate the rate of which the body clears bacteria, but also factor in things like the rate at which the bacteria can regenerate themselves, it is easier to model with a Simulation rather than an Analytic Model.

Let us look at the Monty Hall problem. In the Monty Hall problem, we start with 3 doors. Behind one of the doors, there is a great prize. Behind each of the other doors, there is a booby prize. A person is asked to choose one of the doors. The host then reveals one of the other two doors with the booby prize. There is then a question posed to the person: Do you want to switch?

The question is: Which is better? Does it matter whether we switch?

Let us look at this first in an Analytical perspective:
The chance that the prize lies in your door is 1/3
The chance that the prize lies in one of the other 2 doors is 2/3

After choosing the first door, one of the two other doors is eliminated. Now, what you have is:
The chance that the prize lies in your door is 1/3
The chance that the prize lies in the other (2 - 1) door is 2/3

Therefore, making the switch doubles your chance from 1/3 to 2/3.

The following maps the problem out in a computational way:

import random
import pylab

def montyHall(n):
    for i in range(n):
        correctDoor = random.randint(1,3)
        chosenDoor = random.randint(1,3)
        if correctDoor==chosenDoor:
    return won

def montyHallSwitch(n):
    for i in range(n):
        correctDoor = random.randint(1,3)
        chosenDoor = random.randint(1,3)
        if correctDoor!=chosenDoor:
    return won

def plotMontyHall(trialsExp):
    for i in range(trialsExp):
    for i in range(trialsExp):
    pylab.title("Monty Hall: Stand vs Switch")


This is based upon the following:
1) If guessed door is prize door, stand will give a win
2) If guessed door is NOT prize door, switch will give a win

This is the resulting graph. As we can see, the winning fraction is exactly twice for Switch. Also, 1/3 of Stand participants and 2/3 of Switch participants are winners.:

We can use randomized algorithms to solve problems in which randomness plays no roles. This is surprising, but it is an extremely important tool.

Let us consider Pi (π). Pi is a constant that people have known since the 18th century. The earliest estimate of Pi was 4*(8/9)2=3.16 by the Egyptians in 1650 B.C. In 650 B.C., when Solomon was creating his Basin, he made use of Pi, which was roughly 3. The best estimate of Pi in ancient time was by Achimedes. He didn't give the value of Pi. Instead, he built a polygon of 96 sides, in which he concluded Pi to be between 223/71 to 22/7.

Later on, Buffon and Laplace came up with a Stochastic Simulation to solve for Pi. This is the Buffon and Laplace solution to finding Pi:

Suppose that we have a circle of radius 1 inside a square of sides 2.

We know for sure that the area of the circle is πr2, and the area of the square is r2

Buffon and Laplace proposed to randomly drop needles into such a set-up on the ground, then use the number of needles in the circle to estimate the area of the circle and the total number of needles to estimate the area of the square.

We can then use the following to obtain π:
Area of Circle / Area of Square
= πr2/(2r)2
= πr2/4r2
= π/4

π = 4 (Area of Circle / Area of Square)


π = 4 (Needles in Circle / Needles in Square)

Because Buffon and Laplace didn't have the proper physical setup to conduct such a test, we can help him through the use of a simulation:

import math
import random

def estimatePi(n):
    for i in range(n):
        x = random.random()
        y = random.random()
        if math.hypot(x-0.5,y-0.5)<0.5:
    return 4*circle/float(n)

print (estimatePi(50000000))

In my example, I used a circle of radius 0.5 instead. The formula is the same. We drop 50 million pins into the setup, counted the pins in the circle, and used the formula to give us the answer. The pins were just a means of estimating area.

As you can see, as the amount of pins used increased, the results were closer to Pi and its variation decreased. Here is the code that generated these graphs:

import pylab

def standardDeviation(X):
    for x in X:
    return (total/len(X))**0.5

def mean(X):
    return sum(X)/float(len(X))

def plotEstimatePi(pinsExp, trials):
    for i in range(pinsExp):
        for j in range(trials):
    pylab.xlabel("Pins Used")
    pylab.ylabel("Estimated Pi")
    pylab.title("Pi Estimation, Final Value = "+str(pis[len(pis)-1]))
    pylab.xlabel("Pins Used")
    pylab.ylabel("Standard Deviation")
    pylab.title("Std Dev, Final Value = "+str(stdDev[len(stdDev)-1]))

Note that we can actually come up with more accurate representations of Pi through other means, like binary search. However, this example illustrates the usefulness of randomness in solving non-random problems. This method is extremely useful when solving for integration, because we know that the integration of a function is the area under the graph.

Saturday, June 15, 2013

Computer Science 12

How can we know when it's safe to assume that the average results of a finite number of trials is representative of what we would get if we did the test many more times.

The question we want to ask is: How many samples do we need, to finally believe the answer? How many trials do we have to conduct to have confidence in the result?

At the root of all of it, is the notion of Variance. Variance is a measure of how much spread is there in the possible outcomes, or rather, how much do the outcomes vary?

It is more desirable to run multiple trials compared to a single, large trial. It is because running multiple trials allow us to have a measure of Variance. It is through this measure that we can have a sense of how trustworthy is the result.

The Variance is formalized through the concept of Standard Deviation. Informally, the Standard Deviation measures the fraction of values which are close to the Mean. If many values are close to the Mean, the SD is small. If many values are far from the Mean, the SD is large. If all values are the same, then SD is 0.

Here is Standard Deviation expressed in computational form:

def standardDeviation(X):
    for x in X:
    return (total/len(X))**0.5

Now, we can use the SD to look at the relationship between the number of samples and the amount of confidence we have in the results. We can plot the Standard Deviations of our trials to demonstrate / prove that the accuracy of the result is increasing with the increase of the number of trials.

Let us look back again at the previous coin flipping example. We'll make use of the following two functions:

def flipCoin():
    return random.randint(0,1)

def flipCoinList(n):
    for i in range(n):
    return x

def mean(X):
    return sum(X)/float(len(X))

flipCoin() returns 0 or 1, and flipCoinAverage(n) returns the list of n flips, and mean() tells us the mean of a list.

We can observe Variance using this example:

def flipCoinTrial(flipsExp,trials):
    for i in range(flipsExp):
        for j in range(trials):
    pylab.xlabel("Number of Flips")
    pylab.ylabel("Obtained Mean")
    pylab.title("Mean of Flips")
    pylab.xlabel(str(trials)+" Trials of Flips")
    pylab.ylabel("Standard Deviation")
    pylab.title("Standard Deviation of Trials")

This flips coins a number of times, and performs it in sets of trials. The SD is then measured for each set, before a new set of trials begin with more flips of the coins. Here's the result of 20 to 20 flips, each done 10 times:

Beautiful, isn't it? As you can see, as the number of flips increases, the Variance in the sets of output decreases.

The size of the Standard Deviation compared to the Mean shows us how reliable our results are. The smaller the Standard Deviation, the better. The Coefficient of Variation allows us to represent this ratio. It is simply:
σ(x) / (1/n . ∑(i=1,n)xi )

Where n is the length of the set.

Or more simply:
Standard Deviation / Mean

(Doesn't Mathematical notations just freak you out sometimes?)

The smaller the Coefficient of Variation, the more accurate it is. Generally, CoV  below 1 is considered Low Variance. However, the CoV cannot be used for Confidence Intervals.

Let's look at some of the plotting functions available in pylab. One way to observe distribution is through the use of a histogram. The following function allows plotting of histograms:

def flipCoinHistogramTrial(flips,trials):
    for i in range(trials):
    pylab.xlabel("Mean of Flips")
    pylab.title("Histogram of "+str(trials)+" Trials of "+str(flips)+" flips")

Basically the following code would flip coins "flips" number of times for "trials" times. It then plots the results on a histogram. As you can see, the diagram depicts a Normal Distribution.

As we talk about the Standard Deviation and Coefficient of Variation, we are not only interested about the mean but also about the distribution of the results. A perfect Normal Distribution peaks at the mean and falls off symmetrically on both side. The Normal Distribution is what we always call the Bell Curve.

The Normal Distribution is frequently used to construct probabilistic models for two reasons:
1) It has good mathematical properties and it is easy to reason about
2) It has many naturally occurring instances in the world so most things map on it pretty well

Mathematically, the Normal Distribution can be characterized by two parameters:
1) Mean
2) Standard Deviation

If we have a Normal Distribution, the Mean and Standard Deviation can be used to compute Confidence Intervals.

A Confidence Interval allows us to estimate an unknown parameter by providing a range that is likely to contain the unknown value and a Confidence Level  that the unknown value lies within that range.

An example of a Confidence Interval is:
1 ± 0.05

If the Confidence Level is not stated, it is assumed to be 95%. This means that 95% of the time, the results will lie between 0.95 to 1.05.

An Empirical Rule gives us a handy way to estimate the Confidence Interval given the Mean and the Standard Deviation. It is as such for a true Normal Distribution:
68% of the data will be within 1 SD of the Mean
95% of the data will be within 2 SD of the Mean
99.7% of the data will be within 3SD of the Mean

A trick used to estimate the Standard Deviation is the Standard Error. The Standard Error is an estimate of the SD with the assumption that the errors are normally distributed and that the Sampled Population is small relative to the Actual Population. This is usable because the Normal Distribution is often an accurate model of reality.

If p = % of the sample that met a certain condition, and n = sample size, we can say:
Standard Error = (p*(100-p)/n)1/2

Say for example, out of 1000 voters, 60% of the voters voted for Kelvin. We can say that the Standard Error is:

This means that with 95% confidence level, the true percentage of votes Kelvin would get is within 2 SE of 60%. In order words, the Confidence Interval would be:
60% ± 3.10%

We can test this result by creating a poll ourselves:

def poll(voters,percentage):
    for i in range(voters):
        if random.random()            votes+=1.0
    return votes

def testStandardError(n,p,trials):
    for i in range(trials):
    pylab.xlabel("Votes Gathered")
    pylab.ylabel("Matching Trials")
    pylab.title("Standard Deviation is "+str(standardDeviation(results)/n*100))


As you can see, the actual Standard Deviation gathered from a real test is actually quite close to what we might have gotten from the SE formula.

This shows how good the Standard Error is as an estimation to the actual Standard Deviation.

Monday, June 10, 2013

Computer Science 11

We'll start today's article with an installation of matplotlib from here:

Let's go through an example in graph plotting. The below example shows plotting of a graph that compounds annual interest of principal $100,000 for 50 years:

import pylab

principal = 100000
interestAnnual = 0.05
years = 50
values = []
for i in range(years+1):

You should see a graph that bends upwards. Notice that even though we know exactly what we're looking at, because it's a simple graph and we know what the x and y axes represent, it may not be useful for other people we are showing this to.

Therefore it is important that we include labelled axes. To label the axes, we use the following code before we use title(), xlabel() and ylabel() functions as shown:

import pylab

principal = 100000.0
interestAnnual = 0.05
years = 50
pylab.title(str(interestAnnual)+" Growth of $"+str(principal)+" Compounded Annually over "+str(years)+" years")
pylab.xlabel("Years of Compounding")
pylab.ylabel("Value of Principal (SGD)")
values = []
for i in range(years+1):

Randomness is an integral part of problem solving, because much problems in the real world are based on random (or seemingly random) variables.

Let's look at one of the simplest forms of probability: The Die. If a die is thrown 10 times, what is the probability of it not getting a 1 in ALL the throws?

On the first throw, the probability of rolling a 1 is 1/6, and NOT rolling a 1 on the first throw is 5/6.

The probability of not getting 1's on two rolls is (5/6)2.

Therefore, the probability of not getting 1's at all in 10 rolls is (5/6)10.

We can use the 1 minus rule to find the probability of getting AT LEAST ONE 1. All we have to do is to use 1-(5/6)2.

Blaise Pascal, the same person that made important contributions to the study of fluids, vacuum and pressure, also founded the probability theory. His friend asked him a question: Rolling a pair of dice 24 times, how likely is it that we'll get a pair of 6's.

Let's first look at the probability of rolling a pair of 6's in a single throw. Since the probability of getting a 6 on the first die is 1/6 and the probability of getting it on the second die is 1/6, there should be a 1/36 chance of rolling a pair of 6's on a single throw.

Most people would go ahead and calculate a (1/36)24. However, what we're actually getting over here is the probability of getting ALL rolls to be pairs of 6's.

To calculate the probability of getting AT LEAST ONE pair of 6's, we use the method found in the previous example. We first look at the probability of NOT rolling any pairs of 6's.

On a single throw, this would be defined as 1-(1/36) which is 35/36.

In 24 throws, we would have a chance of (35/36)24 NOT encounter a pair of 6's, which equates to ~0.5086.

If we inverse it, we would have a chance of 1-(35/36)24 to roll AT LEAST ONE pair of 6's. We have a ~0.4914 chance to roll at least a pair of 6's.

Let's write a quick simulation to calculate this probability:

import random

def rollDie():
    return random.randint(1,6)

def rollGame():
    for i in range(24):
        if rollDie()+rollDie()==12:
            return True
    return False

def simulateGame(n):
    for i in range(n):
        if rollGame():
    return wins/n


This code essentially has a rollDie() function which simulates the rolling of 1 die.

The rollGame() function rolls the die for 24 times and evaluates to True when a pair of 6's occurs and False when the pair does not occur.

Finally, simulateGame() runs n number of games and finds out how many of those games did we actually get at least ONE pair of 6's.

In this case, it seems counter-intuitive to have to write a simulation in order to solve this since it is pretty easy to find the probability of this through simple multiplication. However, simulation is important in solving really complex problems. It is also good for us to have at least 2 methods to prove our calculations.

The above simulation we've used is also known as the Monte Carlo Simulation which is named after a casino. This method is known as an Inferential Statistics method. It is based upon the principle that a random sample tends to exhibit the same properties as the population from which it was drawn. In simpler terms, if I draw a sample of people from across the country, the average opinion of this sample would be roughly the same as the average of every single person in the country.

In the Monte Carlo Simulation, the more trials you do, the more stable the value gets. In a sense, there is less jittery, and it gets closer to the actual value of the calculated probability. This is the Bernoulli's Law, or the Law of Large Numbers. This law applies when the tests are independent and does not depend on the results of the previous tests. As the number of trials tends to infinity, the output would converge to the value of the actual probability.

There is a term known as the Gambler's Fallacy where in a game of coin flipping people would tend to bet on heads if tails came up many times in a row. However, the problem here is that since coins have no memory, and each process is independent of each other, the deviation of having many heads will not be evened out as most people think it might.

Here's a demonstration of Bernoulli's Law:

import random
import pylab

def flipCoin():
    return bool(random.randint(0,1))

def exponentialTest(n):
    for i in range(n):
        for j in range(2**i):
    pylab.title("Demonstration of Bernoulli's Law")
    pylab.xlabel("Number of Flips")
    pylab.ylabel("Percentage of Heads")


This flips a coin 20 to n times and displays the results in a table. The output demonstrates that as the number of trials increase, the results converge to the actual probability: