Markov Chains, Mixing Times and Couplings

Say you’re shuffling a deck of cards. It’s poker night and your friends are watching with gimlet eyes to make sure you shuffle properly. Under pressure, you keep shuffling until your hands are tired. But how much is enough, really? How long does it really take to get a deck of cards from “fully ordered” to “random enough”?

One way to rigorously think about this question is to model the deck of cards as a Markov chain and try to find its mixing time.

A finite Markov chain consists of a set of states S and a set of transition probabilities defining the probability of jumping from one state to another. What state you jump to depends only on the current state and not anything that happened before. In this post, I’m going to skip quickly past the details of how Markov chains work, so you might find https://brilliant.org/wiki/markov-chains/ and https://brilliant.org/wiki/stationary-distributions/ useful if you haven’t come across them before.

In our deck of fifty-two cards, the set of states is the 52! different arrangements the cards might be in. Let’s say the way we shuffle is that we pick one of the cards at random, pull it out, and place it on top of the deck. Then, given a particular state, the transition probabilities would be 1/52 for each of the 52 possible states we could move to (counting remaining in the same state – ‘pick the top card and leave it on top’ – as one of them), and zero for all other possibilities (for a single step – using many steps we can of course get to all possible arrangements).

Intuitively, if you keep doing this shuffling move over and over, after a long time the deck is roughly equally likely to be in any of its 52! possible arrangements. This idea is formally known as convergence to a stationary distribution in the theory of Markov chains. The stationary distribution is defined by the property that if the chain starts in the stationary distribution, it remains in the stationary distribution for every timestep after. Given some technical constraints (satisfied in our case), a Markov chain has a unique stationary distribution, and it converges to it as time goes to infinity.

So far so good. But “it works with infinite shuffles” isn’t all that helpful. We need to know what the rate of convergence is like. If we started with a deck that was sorted by suit, you might worry that even if we did thousands of shuffles, some arrangements might still be a lot more likely than others. So the question really is – what kinds of deviations from perfect randomness are we willing to accept and how long will it take to get to a level we find acceptable?

Total Variation Distance

After a certain amount of shuffles, the deck will be in some probability distribution over the 52! different states. We would like to quantify how different this distribution is from the uniform distribution. A useful metric for defining distance between probability distributions is the total variation distance (TVD).

Let \pi_1 and \pi_2 be two probability distributions over a finite set S. The total variation distance between them is

|| \pi_1 - \pi_2 ||_{TV} : = \frac{1}{2} \sum_{s \in S} |\pi_1(s) - \pi_2(s) |

The factor of 1/2 is thrown in there because it gives us a nice alternative characterisation of TVD:

Let A denote an event, or a subset of the sample space S. Then we denote the probability of A under distribution \pi by \pi(A) = \sum_{s \in A} \pi(s). Then if we pick an event with maximum difference in probability between \pi_1 and \pi_2, that difference is exactly equal to TVD! (It’s pretty easy to prove this by noticing that this event must be the union of all elementary events s \in S with \pi_1(s) \geq \pi_2(s) or its complement)

|| \pi_1 - \pi_2 ||_{TV}  = \max_{A \subseteq S} |\pi_1(A) - \pi_2(A) |


Suppose we left the ace of spades on the bottom of the deck and shuffled the rest properly. How different is this distribution from a fully shuffled deck? In the first case, the event “Ace of Spades on bottom” (ie we count all such arrangements) has probability 1 and in the second case it has probability 1/52. So the TVD between these is at least 51/52, which is quite large! So we can be sure that if the TVD between the distribution when we stop and the uniform distribution is small, no such funny stuff can happen.

Mixing Times

We now need to introduce a lot of (sadly very necessary) notation. Bear with me for a moment.

Let (X_t ) := X_0, X_1, X_2, \dots be a Markov chain with finite state space S, transition matrix P and a unique stationary distribution \pi.
Let P^t_{x \rightarrow y} = \Pr ( X_t = y | X_0 = x), in other words the probability that when starting in state x, after t time steps (or shuffle moves), we are in state y.

Based on this let us use P^t_{x \rightarrow (\cdot)} to denote the probability distribution of the chain at time t given that we started in state x.

We define d_{TV_x} (t) = || P^t_{x \rightarrow (\cdot)} - \pi ||_{TV}. Now given a tolerance level \epsilon \in (0,1), the mixing time from state x is defined to be \tau_x(\epsilon) = \min\{ t | d_{TV_x} (t) \leq \epsilon \}. One can prove that d_{TV_x} (t) decreases monotonically, so if you run the chain for at least \tau_x(\epsilon) steps starting from x, you are guaranteed that the distribution is \epsilon-close to the stationary distribution.

Finally we define the overall mixing time of the chain \tau (\epsilon) := \max_{x \in S} \tau_x(\epsilon), since we don’t want to have to worry about where we started from.

Now that we have our formalism, we know the number of shuffles we need to get within say 1/100 (in TVD) of a fully randomized deck is \tau(1/100). But how do we actually compute this mixing time ?!

Couplings

One extremely cool way of finding mixing times is known as the coupling method. Couplings were invented in 1936 by the 21-year-old mathematician Wolfgang Doeblin, a man I am shocked I’d never heard of before writing this post. Wolfgang was born in Germany, but his Jewish family moved to France in 1933. Wolfgang joined the French army in 1938, the year his paper describing couplings was finally published.

1940 June 21. Four days before the suspension of arms [the Franco-German Armistice of 22 June 1940, which came into effect on 25 June], Doeblin loses
contact with his regiment on a mission to the small village Housseras in the
Vosges. Because he had grown up in Berlin, was a Jew, the son of Alfred
Doblin, and spoke French with a thick accent, Doeblin decided to die by his
own hand rather than give himself up to the Nazi troops that were just a few
minutes away.
Doeblin was decorated with two medals: la croix de guerre avec palme and
la medaille militaire. He is buried in Housseras.

W. Doeblin, 1915-1940
Torgny Lindvall

The Annals of Probability

A coupling of a Markov chain (M_t) with finite state space S and transition
matrix P is a Markov chain (Z_t) = (X_t , Y_t ) with state space S \times S such that:

\Pr(X_{t+1} = x' | X_t = x \wedge Y_t = y ) = P_{x,x'}

\Pr(Y_{t+1} = y' | X_t = x \wedge Y_t = y ) = P_{y,y'}

and X_t = Y_t implies X_{t+1} = Y_{t+1} .

(here P_{x,x'} is the original Markov chain’s probability of transitioning from x to x'.)

So the coupling contains two copies (X_t) and (Y_t) of the Markov chain (M_t), but these
do not necessarily evolve independently. They do however have the property (described in the equations above) that if you ignore one of them, the other looks exactly like the original Markov chain. Once they “couple”, in the sense that X_t = Y_t , they evolve
together, following the transition rules of (M_t). Note that (X_t ) and (Y_t ) are individually still faithful copies of (M_t ) even after they couple and start moving in synchrony.
One obvious coupling is to consider two copies of (M_t) operating independently until
coupling, so that for distinct states x and y, \Pr((X_{t+1} , Y_{t+1} ) = (x' , y' ) | (X_t , Y_t ) = (x, y)) = P_{x,x'} P_{y,y'}. In our card shuffling case, this would correspond to getting a second deck and doing one shuffling move to each deck per time step. This coupling would take a very long time to couple – there’s nothing forcing the two decks to have the same state.

However the whole point is to find a coupling that couples as quickly as possible. The reason this is useful is the following amazing result.

The Coupling Inequality

Irrespective of the state where each half of a coupling starts, the total variation distance between their probability distributions at any time is less than the probability they haven’t coupled by then !!

Let’s prove this. The proof is strikingly simple, you just have to slice up some probabilities.

Using the notation we defined above, we want to show that for any x and y (initial states) and for any time t,

|| P^t_{x \rightarrow (\cdot)} -  P^t_{y \rightarrow (\cdot)} ||_{TV} \leq \Pr( X_t \neq Y_t | X_0 = x \wedge Y_0 = y)

Let A be any subset of our sample space.

P^t_{x \rightarrow (\cdot)} (A) -  P^t_{y \rightarrow (\cdot)} (A) = \Pr( X_t \in A) - \Pr( Y_t \in A)

(Technically we should include ‘given that X_0 = x … ‘to those probabilities but I’m not going to write it since it doesn’t affect the argument.)

Now we consider two cases, either the chains have coupled by time t or not.

\Pr( X_t \in A) - \Pr( Y_t \in A) = \Pr( X_t \in A \wedge X_t = Y_t)  + \Pr( X_t \in A \wedge X_t \neq Y_t)  - \Pr( Y_t \in A \wedge X_t = Y_t) - \Pr( Y_t \in A \wedge X_t \neq Y_t)

Now observe that if X_t = Y_t, then X_t \in A if and only if Y_t \in A. So those probabilities are equal and they cancel out.

\Pr( X_t \in A) - \Pr( Y_t \in A) = \Pr( X_t \in A \wedge X_t \neq Y_t)  - \Pr( Y_t \in A \wedge X_t \neq Y_t)

\Pr( X_t \in A) - \Pr( Y_t \in A) \leq \Pr( X_t \in A \wedge X_t \neq Y_t) (we drop the negative term)

\implies \Pr( X_t \in A) - \Pr( Y_t \in A) \leq \Pr(  X_t \neq Y_t)!

So we have P^t_{x \rightarrow (\cdot)} (A) -  P^t_{y \rightarrow (\cdot)} (A) \leq  \Pr(  X_t \neq Y_t). By symmetry we can run the same argument to get P^t_{y \rightarrow (\cdot)} (A) -  P^t_{x \rightarrow (\cdot)} (A) \leq  \Pr(  X_t \neq Y_t), which means || P^t_{x \rightarrow (\cdot)} (A) -  P^t_{y \rightarrow (\cdot)} (A) || \leq  \Pr(  X_t \neq Y_t). Since this works for any subset A, even one of maximum difference in probabilities, by our alternative definition of total variation, this shows that

|| P^t_{x \rightarrow (\cdot)} -  P^t_{y \rightarrow (\cdot)} ||_{TV} \leq \Pr( X_t \neq Y_t | X_0 = x \wedge Y_0 = y), which is exactly what we wanted.

A nice modification is to pick y (the starting state of the second copy) from the stationary distribution \pi, which means P^t_{y \rightarrow (\cdot)} = \pi always, by the definition of the stationary distribution. Then our result is || P^t_{x \rightarrow (\cdot)} - \pi ||_{TV} \leq \Pr( X_t \neq Y_t | X_0 = x ), and since this works for every x, this means the mixing time \tau(\epsilon) \leq \min \{t | \Pr(X_t \neq Y_t) \leq \epsilon \}

Finding a good coupling for our decks

A natural idea to make our two decks reach the same arrangement quickly is to pick the same position j in each deck to bring to the top. Unfortunately, a little thought shows that if the two decks didn’t start in the same state, they will never couple this way!

Here’s a better idea: Let (X_t, Y_t) denote the states of the two decks.

  • Choose position j uniformly at random from {1, . . . , 52} and obtain X_{t+1} from X_t by moving the j’th card to the top. Call this card C.
  • Obtain Y_{t+1} from Y_t by moving card C to the top.

To see that it is a valid coupling, we have to show that (Y_t ) is a faithful copy of the
original chain. So consider any position k — let’s calculate the probability that the card at this position
gets moved to the top by the transition from Y_t to Y_{t+1} . Let C_0 be the card in position k in
Y_t . Let j be the position of C_0 in X_t. The probability that j is chosen is 1/52, so this is the
probability that position k is chosen in the transition from Y_t to Y_{t+1}, as required.

Now let’s see how long it takes to couple. Note that when a card C is moved to the top
of the deck in (X t ) and (Y t) it will always occupy the same position in both decks afterwards. So it is enough to make sure each card in the deck gets picked at least once. This type of ‘gotta catch them all’ problem is known as a Coupon Collector Problem.

Let’s generalize to n cards for a moment. The probability that a given card C has not
been chosen in t steps is at most (1-1/n)^t, which is always less than \leq \exp(-t/n). Adding up these probabilities across all n cards, the probability
that the chain has not coupled in t steps is at most n \exp(-t/n) = \exp(-(t/n)+\log n).

Setting T: = n (\log(1/\epsilon) + \log(n)) means that the probability of not coupling by time T is at most \epsilon.

Using the coupling inequality, this means that the mixing time \tau(\epsilon) is at most n(\log(1/\epsilon) + \log(n)).

The Final Answer

Let plug in some values. Suppose we decide being 1% off from random is okay. Plugging in \epsilon = 1/100 and n = 52 gives

\tau(1/100)  \leq  52(\log(100) + \log(52))  \approx 450 shuffles!

So now we know all that shuffling was worth it.

Better Shuffles

450 shuffles isn’t too bad, but it isn’t something I want to do every day. The shuffling method in this post was chosen for ease of analysis, not stylishness or speed. Can we analyse better shuffles? One interesting direction is riffle shuffles which interleave the cards.

This paper on Riffles by Persi Diaconis has the following intriguing quote

People used to think that cards were suitably well mixed after three, four or five
[riffle] shuffles. Like many things people believe, this is simply not true.

and looks very interesting (I haven’t read it completely).

Another interesting direction is the following simple extension of our basic shuffling method – which is what I actually use in practice

  • Pick j_1 < j_2 uniformly at random in {1…52}
  • Pull out the block of cards from j_1 to j_2 and place it on top.

Can you figure out if this method is actually faster than the one-card-at-a-time shuffle ? It definitely feels like it, but I don’t know how to prove it.

(Credits: Much of this post was drawn from Leslie Ann Goldberg’s excellent notes for Marc Roth’s Probability and Computing course)

Leave a Reply

Your email address will not be published. Required fields are marked *