Some Oddities of the Principle of Insufficient Reason

David W.C. Telson · 2019/07/12 · 9 minute read

We are keenly interested in reasoning with incomplete or uncertain information i.e. how we should make a “best guess” given that we have imperfect knowledge. One of the classical tools in our tool belt is the so called “principle of insufficient reason” also known as the “principle of indifference”. More on the history and extensions of the principle, but for now I want to focus on some oddities that occur by trying to reason with the principle.

The principle of insufficient reason essentially states that if we have \(n > 1\) mutually exclusive and collectively exhaustive propositions, and we have no reason to suspect any subset of them being more probable than another, then we should assign a probability of \(\frac{1}{n}\) to each proposition.

In symbols we could rephrase this as follows: suppose \(a = \{a_1,a_2,...,a_n\}\) is a collection of propositions such that \(\bigoplus_{i=1}^{n}a_i\). Here \(\oplus\) is the “exclusive disjunction” i.e. “exclusive or” i.e. “XOR” operator, meaning that one and only one of the propositions is true. If we are unable to distinguish between the propositions except for their label \(i\), then the principle tells us to assign a probability of \(\frac{1}{n}\) to each i.e. \(\forall a_i \in a\), \(p_i = P(a_i) = \frac{1}{n}\).

Even more abstractly, we could say that the principle of indifference is a function \(P\), defined as: \[ P: a \rightarrow \mathbb{R^+} \\ a_i \mapsto \frac{1}{|a|} = \frac{1}{n} \]

With finite sets of propositions, the principle satisfies the Kolmogorov’s axioms of probability. For a refresher, they are:

  1. The probability of an event \(\omega\) is a non-negative real number, i.e. \(P(\omega) \ge 0\).
  2. The probability of any event occurring in the set of all possible events is 1, i.e. \(P(\Omega) = 1\).
  3. The probability of a countable set of mutually exclusive events is equal to the sum of the probability of each of those events i.e. \(P(\bigcup_{i=1}^{\infty}\omega_i) = \sum_{i=1}^{\infty}P(\omega_i)\).

The first two axioms are easily understood, but the last takes some thought. On finite sets, it is a triviality, but its utility is in finding meaningful probability statements within infinite sets, as we shall soon see.

The principle of indifference on a non-empty finite set of propositions meets the requirement of the first axiom easily, as any such set will have a cardinality \(0 < |a| = n < \infty\) with \(n \in N\) so that \(\frac{1}{n}\) is always a positive real number.

The principle of indifference on a non-empty finite set of propositions also meets the requirement of the second axiom. Specifically, if we assume \(\bigoplus_{i=1}^{n}a_i\), then we have:

\[P(a) = P(\bigcup_{i=1}^{n}a_i) = \sum_{i=1}^{n}P(a_i)=\sum_{i=1}^n\frac{1}{n} = \frac{n}{n}=1\] Note: technically this is a case of “abuse of notation” as we stated \(a_i\) were propositions, rather than sets, but we will let this go for now.

Finally let’s choose to partition our set of propositions into two mutually exclusive subsets \(b\) and \(c\) where \(|b| = m < n\) and \(|c| = n - m\). For the first subset we have:

\[P(b) = P(\bigcup_{a_i\in b}a_i) = \sum_{a_i\in b}P(a_i)=\sum_{i=1}^m\frac{1}{n} = \frac{m}{n}\lt1\] The second subset will have a probability equal to \(\frac{n-m}{n} = 1 - \frac{m}{n}\). These statements of the qualities of our principle of indifference are not so much proofs that our principle of indifference meets the axioms of probabilities as they are examples to build an intuition of how the two relate.

Now all of this is perfectly fine w.r.t. non-empty finite sets of mutually exclusive propositions, but what happens when we start to increase the number of propositions towards infinity? In other words, can our principle of indifference be applied to countably infinite sets? Let’s approach this from a different angle, and lets proceed cautiously by defining the limit of a sequence.

Definition 1 The limit \(r\) of a sequence \(f(n) = r_n\) exists if for any \(\epsilon\) > 0, there exists an \(m \in \mathbb{N}\) such that if \(n \ge m\) then \(|r_n - r|<\epsilon\). We write the limit \(r\) of \(f(n)\) as \(\lim_{n \to \infty} f(n) = r\).

So given a sequence (simply a function \(f: \mathbb{N} \to \mathbb{R}\)), we can determine its limit given the definition. Let’s create a second definition for our purposes:

Definition 2 A constant function is a function \(f\) which maps any element \(x\) in its domain \(X\) to the only element \(c\) of singleton set \(C\) i.e. \(f\) is a constant function if for all \(x \in X\), \(f(x) = c\).

When looking at the principle of indifference, we see that if \(a\) is finite with cardinality \(|a| = n\), then \(\sum_{i=1}^{n}P(a_i) = \sum_{i=1}^{n}\frac{1}{n} = \frac{n}{n} = 1\). We can interpret this as a sequence taking a number of propositions and returning the sum of probabilities having invoked the principle of indifference i.e. \(f(n) = 1\). We can easily see that this is a constant function, and that for any \(n \in \mathbb{N}\), the principle of indifference yields probability assignments over all of the propositions that sum to 1, satisfying the first axiom of probability. What happens as \(n \to \infty\)?

Theorem 1 If \(f(n) = c\) is a constant sequence, then its limit is \(c\).

To prove this theorem, we must show that \(\lim_{n \to \infty} f(n) = c\) meets the criteria of Definition 1. We choose \(\epsilon > 0\) and any pair \(n, m\) such that \(n \ge m\) and \(n,m \in \mathbb{N}\). By Definition 2 we have \(f(n) = f(m) = c\), so \(|c_n - c| = |c-c| = 0 < \epsilon\) as required. Thus \(\lim_{n \to \infty} f(n) = c\).

So the sum of the probabilities assigned via the principle of indifference converges to 1 as \(n\) approaches \(\infty\). This is good news, as it tells us that the second axiom is satisfied by the probability assignments in the limit. Does the principle of indifference satisfy the remaining axioms as the number of propositions approaches infinity?

Theorem 2 \(\lim_{n \to \infty} \frac{1}{n} = 0\).

The proof is rather simple. Choose \(m > \frac{1}{\epsilon}\), then for any \(n \ge m\), \(n > \frac{1}{\epsilon}\) which implies \(\frac{1}{n} < \epsilon\) which satisfies \(|\frac{1}{n}-0|< \epsilon\) as required. Thus \(\lim_{n \to \infty} \frac{1}{n} = 0\). This is a legitimate probability assignment to one proposition (because we must have \(P(a_i) \ge 0\)), however it yields a problem in our prior reasoning, as if every proposition is assigned a uniform probability, then \(P(a_i) \ge 0\) for all \(a_i \in a\) i.e. \(\sum_{i=1}^{\infty}P(a_i)=0\) which A) violates the second axiom, and B) contradicts our previous result! How could this have happened?

We are tackling this problem with two different limiting processes, that is:

\[\lim_{n \to \infty}\sum_{i=1}^{n}\frac{1}{n} = 1 \not= \lim_{n \to \infty}\sum_{i=1}^{n}\lim_{n \to \infty}\frac{1}{n} = \sum_{i=1}^{\infty}0=0\]

We are resigned to the fact that we cannot simultaneously take both limits (the limit of the sums and the sums of the limit). We could have encounter a similar result from a different perspective: assuming that \(u\) is a probability assignment applied uniformly on a set of propositions, then \(\sum_{i=1}^{\infty}u\) is either infinite or zero, neither of which satisfies the second axiom. These results are directly related to the fact that a computer cannot sample from a countably infinite set, but can easily sample from an arbitrarily large finite one.

One interesting question to ponder: if we cannot sample uniformly from a countably infinite set like the one described using a computer, how come we can sample from an uncountably infinite one such as a continuous uniform distribution defined on \([0,1]\)? If we can sample from a continuous uniform on \([0,1]\), can we sample just the rational rational numbers in this range? I have two conjectures that I want to explore in the future:

  1. If a set of numbers is bounded, then we can sample from a uniformly distribution defined on it.

  2. If a set is a bounded interval (e.g., \([a,b]\) with \(a, b \in \mathbb{R}\)), then we can sample from a uniform distribution defined on it.

The reason I think 1. and 2. are plausible is that I believe having a cumulative distribution function (CDF) makes it possible to sample using a computer. Since I can sample a number uniformly between \([0,1]\) on a machine, if I have a CDF, I can feed the uniform sample into the inverse CDF to sample from the CDF.

The reason I doubt 1., but think 2. is more plausible is that the set \(\{1, \frac{1}{2}, \frac{1}{3}, \frac{1}{4},...\}\) is bounded below by \(0\) and above by \(1\), but is countably infinite and in fact is just the reciprocal of the natural numbers. Perhaps its the space between rational numbers that make it impossible to sample i.e. finitivity or bounded continuity is a requisite for sampling.

A quick visual of \(\{1, \frac{1}{2}, \frac{1}{3}, \frac{1}{4},...\}\):

require(tidyverse)
require(scales)

qplot(x = 1/(1:100), y = '') + 
  theme_minimal() + 
  labs(x = '1/n') +
  scale_y_discrete(name = NULL, labels = NULL, breaks = NULL)

The reason I think 2. might be too restrictive is the following set construction:

The first few iterations look like this:

  1. \(\{\frac{0}{1},\frac{1}{1}\}\)

  2. \(\{\frac{0}{1},\frac{1}{2},\frac{1}{1}\}\)

  3. \(\{\frac{0}{1},\frac{1}{3},\frac{1}{2},\frac{2}{3},\frac{1}{1}\}\)

  4. \(\{\frac{0}{1},\frac{1}{4},\frac{1}{3},\frac{2}{5},\frac{1}{2},\frac{3}{5},\frac{2}{3},\frac{3}{4},\frac{1}{1}\}\)

Let’s write a quick function to illustrate this set.

f <- function(n){
  
  numerators <- c(0,1)
  
  denominators <- c(1,1)
  
   new_numerators <- numeric(0)
   
   new_denominators <- numeric(0)
  
  for(i in 1:n){
    
    for(j in 1:(length(numerators)-1)){
      
      new_numerators <- c(new_numerators, numerators[j] + numerators[j+1])
      
      new_denominators <- c(new_denominators, denominators[j] + denominators[j+1])
      
    }
    
   new_order <- order(c(numerators, new_numerators)/c(denominators, new_denominators))
    
   numerators <- c(numerators, new_numerators)[new_order]  
    
   denominators <- c(denominators, new_denominators)[new_order] 
    
   new_numerators <- numeric(0)
   
   new_denominators <- numeric(0)
    
  }
   
  return(numerators/denominators)
  
}

qplot(x = f(10), y = '') + 
  theme_minimal() + 
  labs(x = 'x') +
  scale_y_discrete(name = NULL, labels = NULL, breaks = NULL)

We can see the CDF of this function if we run it iteratively 10 times:

tibble(x = f(10), y = cumsum(1:length(x))/sum(1:length(x))) %>%
  ggplot() +
  geom_line(aes(x = x, y = y)) +
  theme_minimal() +
  scale_y_continuous(labels = percent)

We could also see how the function changes for each iteration:

map_df(1:12, function(n){
  
  f(n) %>%
    tibble(n = n, x = .)
  
}) %>%
  group_by(n) %>%
  mutate(y = cumsum(1:length(x))/sum(1:length(x))) %>%
  ggplot() +
  geom_line(aes(x = x, y = y)) +
  facet_wrap(~n) +
  theme_minimal() +
  scale_y_continuous(labels = percent)

Here we can see that a CDF, or at least an approximate CDF, exists. Again, if we assume that we need a CDF to sample from, then this looks like a candidate we can sample from that is not finite nor both bounded and continuous.

That’s all for now. More on this later, also expect this post to get revised and refined!