Cauchy’s idea of probability

December 16, 2013

We get used to the idea that as our sample size increases, our model becomes more reliable.

We all ‘know’ that the sample average of most distributions is asymptotically Normal (by the central limit theorem) and that the sample average gets closer to the population mean and such.

This corresponds to a specific type of randomness, let us call it ‘mild’ randomness, as in a sense nothing wild is going on – although a random (stochastic) process is underneath everything, with more data comes convergence and more reliability to our claims. Our models come from this – linear regression and so on. They are not exact and they accept being approximations, but they still do a decent job.

However, what if they were totally wrong?

Cauchy has a different idea of probability and as a result a different idea of randomness, call it ‘wild’ randomness.

Cauchy’s idea

Cauchy’s idea is as follows.

Consider an archer that is blindfolded and has a bow and arrow. He is to shoot at a target located on a wall that is infinite in height and length. We are to measure how far his arrow is away the target. We assume that he always shoots at the wall, somewhere.

For example. if he hits the target, we record 0 as he is exactly 0 units away from the target.

We can formulate a probability distribution based on this example and without loss of generality, our assumptions need not hold.

Deriving Cauchy’s distribution

We can represent the idea about the archer above as a right angled triangle, as seen below.

20131216-164309.jpg

This makes sense when we look at it: we are at some location (labelled archer),  say center to the target (labelled target) and some distance (adjacent to the angle \theta) away. Then our actual shot (labelled actual shot) is just a line segment from the target. Wherever the bow lands – above the target, to the right, left, below the target, etc, without loss of generality we can represent it as a right angled triangle, assuming that our shot hits the wall.

Then say we are interested in how far we are away from the target (the line segment actual shot – target) and how are we ourselves (the archer) are away from the target (the line segment archer- target). This can be represented by trigonometry and the angle \theta can be computed. We have

\tan \theta = \frac{\text{line segment between actual shot and target}}{\text{line segment between archer and target}}.

Let the ratio between the line segment between actual shot and target} and the line segment between archer and target be called y, which is fine as it is just a real number. Then we have

\tan \theta = y.

To measure how far his arrow is away from the target, we are interested in varying \theta by the ratio y. This makes sense – we do not vary y with respect to y as we are already fixing what we are varying y with. We take the inverse tangent to get

\theta = \arctan y.

Varying \theta with respect to y corresponds to our problem – how we change the ratio answers the question of how far away we are from the target. This is just the derivative!The derivative of \theta with respect to y is

\frac{d}{dy} \, \theta = \frac{1}{1+y^2}.

This derivative defines our distribution – the term 1/(1+x^2) is our probability density function of the random variable \Theta that measures how far away we are from the target, with x taking all real number values, which corresponds to us asking, are we x away from the target?

We then get the probability distribution function defined by integrating

\frac{d}{dy} \, \theta = \frac{1}{1+y^2}

over all real values of y and finding a constant to make this integral equal to 1. The constant is 1/\pi and we have the probability density function to be

\mathbf{f} \lbrace \Theta =y \rbrace = \frac{1}{\pi}\frac{1}{1+y^2},

where y is an real number.

This is the Cauchy distribution.

‘Wild’ randomness

Consider a real life process that conforms to ‘mild’ randomness – the heights of humans, for example.

If I collect heights of say five humans, it may not be close to the average. As I collect more heights I should get closer to the average, assuming that I am picking people randomly and not based on geographical location and other factors.

I get an expected value \mu of what the height should be. I also get an idea of how far away from the mean I expect to be – this is the variance \sigma ^2.

Do these ideas hold for the blinded archer? Well.. not really.

We can have a sequence of shots that are close to the target but if the archer’s next shot is miles away, all that ‘work’ is wiped out in the sense that the average from those previous shots, now considering this shot, will be totally different.

Although I have a ‘bare’ idea on what I expect to get – 0 units away from the target, I can go anywhere. This type of randomness is far more wild – I am not building on my earlier, smaller samples. I do not have any expectation of how far away my shots will be, nor do I know how far I fluctuate from my expectation (which I do not know in the first place). This corresponds to me not knowing \mu or \sigma ^2.

Formally this can be shown as the expectation of \Theta is not finite, nor is the variance of \Theta finite.

The Central Limit Theorem and the Law of Large Numbers do not hold here. Taking the sample average and using it to infer information about this distribution is useless, because the next shot can change all of what we are working with. With the Normal distribution, this is not the case.

Difference between ‘mild’ and ‘wild’ randomness

Perhaps the difference between these types of randomness (mild and wild) can be seen in the plots. Consider the plot below of the probability density function of the Cauchy distribution for y between $ latex-10$ and 10 in 0.00001 intervals (which is good enough to get a measure of what the distribution looks like)

cauchy

It does not look so much different to the plot of a probability density function of a (standard) Normal distribution, which is plotted below.

norm

The Cauchy distribution has heavier tails – they do not dip as quickly as they do for the Normal distribution. This corresponds to having a higher chance of an arrow being incredibly away from the target to be more significant in Cauchy’s distribution than in the Normal – this also makes sense. Yet the Cauchy distribution has no (finite) expected value, no (finite) variance and various intuition about it fails – say bye to the Central Limit Theorem and the Law of Large Numbers.


An Introduction to Bayesian Statistics

October 3, 2013

Consider a coin toss.

We “know” that the probability of getting a heads (and tails) is \frac{1}{2}. We know that coin tosses are independent of each other.

In the language of probability, a coin toss is a Bernoulli random variable with parameter p=\frac{1}{2} of getting heads (or tails).

The probability of a heads (or tails) is very simple. Let 1 be the outcome for heads and 0 be the outcome for tails. Then we have the coin toss to be a random variable X with probability (mass) function

\mathbb{P}(X=x_{|x=0,1}) = \frac{1}{2}^x(1-\frac{1}{2})^{1-x} = \frac{1}{2}

Then suppose the probability of getting a heads or tails is no longer symmetric (or fair), i.e we have \frac{1}{2} \mapsto \rho. The probability mass function is now

\mathbb{P}(X=x_{|x=0,1}) = {\rho}^x(1-\rho)^{1-x}.

It seems that we are finished.

Actually we have seen everything from a frequentist (statistician’s) view. A Bayesian statistician looks at this very differently. It is the difference between someone who views probability with objectivity and someone who views probability with subjectivity.

How do we know the probability of getting heads is \rho?

Instead of accepting the probability mass function as it is, we attach another probability to it: the probability of the probability of getting a heads, say \mathbb{P}(\rho).

The answers the question of why we have to assume the probability of getting a heads is \frac{1}{2}?

We no longer do. We also no longer assume it is the probability \rho. Our probability mass function now becomes

\mathbb{P}(X=x_{|x=0,1}) = {\rho}^x(1-\rho)^{1-x}\, \, \mathbb{P}(\rho).

What value can \mathbb{P}(\rho) take? This is the difference in our thinking. The parameter \rho is no longer taken as a constant, but is assumed to have a distribution. We say this is the prior (before) distribution.

Then the random variable X, by definition, is the posterior (after) distribution.

We present the connection between this inference and our usual (frequentist) inference.

Distribution of \mathbb{P}(\rho)

Assume the probability of the probability of getting a heads is equal to one. This means it is the probability of heads being equal to \frac{1}{2} is as likely as the probability of heads being equal to 0 or 1 or \frac{3}{4} or any value between zero and one.

We are assuming that the probability of getting heads follows the unitary continuous uniform distribution. We have

\mathbb{P}(\rho) = 1, for 0 \leq \rho \leq 1.

The mass function is just as before. With no inference, we have the same Bernoulli distribution. Challenge: What happens as we change the probability distribution of \mathbb{P}(\rho)?.

Image