# Markov Chains, Mixing Times and Couplings

Say you’re shuffling a deck of cards. It’s poker night and your friends are watching with gimlet eyes to make sure you shuffle properly. Under pressure, you keep shuffling until your hands are tired. But how much is enough, really? How long does it really take to get a deck of cards from “fully ordered” to “random enough”?

One way to rigorously think about this question is to model the deck of cards as a Markov chain and try to find its mixing time.

A finite Markov chain consists of a set of states and a set of transition probabilities defining the probability of jumping from one state to another. What state you jump to depends only on the current state and not anything that happened before. In this post, I’m going to skip quickly past the details of how Markov chains work, so you might find https://brilliant.org/wiki/markov-chains/ and https://brilliant.org/wiki/stationary-distributions/ useful if you haven’t come across them before.

In our deck of fifty-two cards, the set of states is the 52! different arrangements the cards might be in. Let’s say the way we shuffle is that we pick one of the cards at random, pull it out, and place it on top of the deck. Then, given a particular state, the transition probabilities would be 1/52 for each of the 52 possible states we could move to (counting remaining in the same state – ‘pick the top card and leave it on top’ – as one of them), and zero for all other possibilities (for a single step – using many steps we can of course get to all possible arrangements).

Intuitively, if you keep doing this shuffling move over and over, after a long time the deck is roughly equally likely to be in any of its 52! possible arrangements. This idea is formally known as convergence to a stationary distribution in the theory of Markov chains. The stationary distribution is defined by the property that if the chain starts in the stationary distribution, it remains in the stationary distribution for every timestep after. Given some technical constraints (satisfied in our case), a Markov chain has a unique stationary distribution, and it converges to it as time goes to infinity.

So far so good. But “it works with infinite shuffles” isn’t all that helpful. We need to know what the rate of convergence is like. If we started with a deck that was sorted by suit, you might worry that even if we did thousands of shuffles, some arrangements might still be a lot more likely than others. So the question really is – what kinds of deviations from perfect randomness are we willing to accept and how long will it take to get to a level we find acceptable?

### Total Variation Distance

After a certain amount of shuffles, the deck will be in some probability distribution over the 52! different states. We would like to quantify how different this distribution is from the uniform distribution. A useful metric for defining distance between probability distributions is the total variation distance (TVD).

Let and be two probability distributions over a finite set . The total variation distance between them is

The factor of 1/2 is thrown in there because it gives us a nice alternative characterisation of TVD:

Let denote an event, or a subset of the sample space . Then we denote the probability of under distribution by . Then if we pick an event with maximum difference in probability between and , that difference is exactly equal to TVD! (It’s pretty easy to prove this by noticing that this event must be the union of all elementary events with or its complement)

Suppose we left the ace of spades on the bottom of the deck and shuffled the rest properly. How different is this distribution from a fully shuffled deck? In the first case, the event “Ace of Spades on bottom” (ie we count all such arrangements) has probability 1 and in the second case it has probability 1/52. So the TVD between these is at least 51/52, which is quite large! So we can be sure that if the TVD between the distribution when we stop and the uniform distribution is small, no such funny stuff can happen.

### Mixing Times

We now need to introduce a lot of (sadly very necessary) notation. Bear with me for a moment.

Let be a Markov chain with finite state space , transition matrix and a unique stationary distribution .
Let , in other words the probability that when starting in state , after time steps (or shuffle moves), we are in state .

Based on this let us use to denote the probability distribution of the chain at time given that we started in state .

We define . Now given a tolerance level , the mixing time from state is defined to be . One can prove that decreases monotonically, so if you run the chain for at least steps starting from , you are guaranteed that the distribution is -close to the stationary distribution.

Finally we define the overall mixing time of the chain , since we don’t want to have to worry about where we started from.

Now that we have our formalism, we know the number of shuffles we need to get within say 1/100 (in TVD) of a fully randomized deck is . But how do we actually compute this mixing time ?!

### Couplings

One extremely cool way of finding mixing times is known as the coupling method. Couplings were invented in 1936 by the 21-year-old mathematician Wolfgang Doeblin, a man I am shocked I’d never heard of before writing this post. Wolfgang was born in Germany, but his Jewish family moved to France in 1933. Wolfgang joined the French army in 1938, the year his paper describing couplings was finally published.

1940 June 21. Four days before the suspension of arms [the Franco-German Armistice of 22 June 1940, which came into effect on 25 June], Doeblin loses
contact with his regiment on a mission to the small village Housseras in the
Vosges. Because he had grown up in Berlin, was a Jew, the son of Alfred
Doblin, and spoke French with a thick accent, Doeblin decided to die by his
own hand rather than give himself up to the Nazi troops that were just a few
minutes away.
Doeblin was decorated with two medals: la croix de guerre avec palme and
la medaille militaire. He is buried in Housseras.

W. Doeblin, 1915-1940
Torgny Lindvall

The Annals of Probability

A coupling of a Markov chain with finite state space and transition
matrix is a Markov chain with state space such that:

and implies .

(here is the original Markov chain’s probability of transitioning from to .)

So the coupling contains two copies and of the Markov chain , but these
do not necessarily evolve independently. They do however have the property (described in the equations above) that if you ignore one of them, the other looks exactly like the original Markov chain. Once they “couple”, in the sense that , they evolve
together, following the transition rules of . Note that and are individually still faithful copies of even after they couple and start moving in synchrony.
One obvious coupling is to consider two copies of operating independently until
coupling, so that for distinct states and , . In our card shuffling case, this would correspond to getting a second deck and doing one shuffling move to each deck per time step. This coupling would take a very long time to couple – there’s nothing forcing the two decks to have the same state.

However the whole point is to find a coupling that couples as quickly as possible. The reason this is useful is the following amazing result.

### The Coupling Inequality

Irrespective of the state where each half of a coupling starts, the total variation distance between their probability distributions at any time is less than the probability they haven’t coupled by then !!

Let’s prove this. The proof is strikingly simple, you just have to slice up some probabilities.

Using the notation we defined above, we want to show that for any and (initial states) and for any time ,

Let be any subset of our sample space.

(Technically we should include ‘given that … ‘to those probabilities but I’m not going to write it since it doesn’t affect the argument.)

Now we consider two cases, either the chains have coupled by time or not.

Now observe that if , then if and only if . So those probabilities are equal and they cancel out.

(we drop the negative term)

!

So we have . By symmetry we can run the same argument to get , which means . Since this works for any subset , even one of maximum difference in probabilities, by our alternative definition of total variation, this shows that

, which is exactly what we wanted.

A nice modification is to pick (the starting state of the second copy) from the stationary distribution , which means always, by the definition of the stationary distribution. Then our result is , and since this works for every , this means the mixing time

### Finding a good coupling for our decks

A natural idea to make our two decks reach the same arrangement quickly is to pick the same position in each deck to bring to the top. Unfortunately, a little thought shows that if the two decks didn’t start in the same state, they will never couple this way!

Here’s a better idea: Let denote the states of the two decks.

• Choose position uniformly at random from {1, . . . , 52} and obtain from by moving the j’th card to the top. Call this card C.
• Obtain from by moving card C to the top.

To see that it is a valid coupling, we have to show that is a faithful copy of the
original chain. So consider any position — let’s calculate the probability that the card at this position
gets moved to the top by the transition from to . Let be the card in position in
. Let be the position of in . The probability that is chosen is 1/52, so this is the
probability that position is chosen in the transition from to , as required.

Now let’s see how long it takes to couple. Note that when a card C is moved to the top
of the deck in and it will always occupy the same position in both decks afterwards. So it is enough to make sure each card in the deck gets picked at least once. This type of ‘gotta catch them all’ problem is known as a Coupon Collector Problem.

Let’s generalize to cards for a moment. The probability that a given card C has not
been chosen in steps is at most , which is always less than . Adding up these probabilities across all cards, the probability
that the chain has not coupled in steps is at most .

Setting means that the probability of not coupling by time is at most .

Using the coupling inequality, this means that the mixing time is at most .

Let plug in some values. Suppose we decide being 1% off from random is okay. Plugging in and gives

shuffles!

So now we know all that shuffling was worth it.

### Better Shuffles

450 shuffles isn’t too bad, but it isn’t something I want to do every day. The shuffling method in this post was chosen for ease of analysis, not stylishness or speed. Can we analyse better shuffles? One interesting direction is riffle shuffles which interleave the cards.

This paper on Riffles by Persi Diaconis has the following intriguing quote

People used to think that cards were suitably well mixed after three, four or five
[riffle] shuffles. Like many things people believe, this is simply not true.

and looks very interesting (I haven’t read it completely).

Another interesting direction is the following simple extension of our basic shuffling method – which is what I actually use in practice

• Pick uniformly at random in {1…52}
• Pull out the block of cards from to and place it on top.

Can you figure out if this method is actually faster than the one-card-at-a-time shuffle ? It definitely feels like it, but I don’t know how to prove it.

(Credits: Much of this post was drawn from Leslie Ann Goldberg’s excellent notes for Marc Roth’s Probability and Computing course)