Musings on missing mass

In quantum computing, a significant part of the problem of solving some task is extracting the information from your quantum computer. This is because you don’t get to read out all of the information that is (theoretically) encoded into the quantum computer, but rather can only read a stochastic output, whose statistics are determined by the state of the quantum computer. Annoyingly, doing so also destroys the internal state of the computer, so you must repeat the procedure to extract more statistical information.1

This connects, conceptually, to the notion of entropy (of the probability kind): a distribution that’s very “flat” has high entropy, whereas a distribution with always the same outcome has minimal (zero) entropy. So, there is an outcome probability distribution associated to reading from a quantum state, and that probability distribution has some entropy, and so the entropy of that distribution tells me (in some sense, but not absolutely) how informative my measurements are.

This is to say I’ve been thinking about entropy and distribution-free statistics.

Generally, if you measure \(n\) qubits (in, say, the computational basis), there are \(2^n\) possible outcomes, and so performing \(t\) measurements and recording a histogram of what outcomes you observed (and how many times you observed them) forms a multinomial distribution. I.e., after \(t\) measurements, your histogram is itself a random variable: \[ X = (X_1, X_2, \ldots, X_{2^n}) \thicksim \mathrm{Multinomial}[t, p(1), \ldots, p({2^n})]. \]

(Above, \(p(i)\) denotes the probability of observing outcome \(i\).) \(X_i\) are the number of times you observed outcome \(i\) (for \(i=1,\ldots,2^n\)), and so those are also random variables, but they are correlated: \(\sum_i X_i = t\). Despite their correlation, each variable behaves as a binomial random variable: this makes sense because, for each individual outcome \(i\), you either did or didn’t see outcome \(i\) at each draw.

So, now, consider the question: what is the expected “amount of probability” that you will not observe? After a number of draws, you are likely to see the most probable outcomes. But there’s a number of outcomes you’re unlikely to see, especially if your number of draws is significantly smaller than the number of possible outcomes.2 In other words, if one draws \(X\) from a multinomial distribution, as defined above, there is some “missing mass":3 \[ M_0(t) = \sum_{i : X_i = 0} p(i). \]

This missing mass \(M_0\) has some expected value with respect to your draw \(X\); you can use linearity of expectation to show that \(\mathbb{E}M_0(t) = \sum_{i=1}^{2^n} p(i) (1-p(i))^t\), but going beyond that point typically requires some assumption on the underlying distribution \(p\).

The "natural" estimators of \(M_0\) will concentrate about \(M_0\) (with some bias), and \(M_0\) itself will concentrate about its expected value (McAllester and Ortiz (2003) and Ben-Hamou, Boucheron, and Ohannessian (2017)).4 So, one may reason about estimating the missing mass of a distribution by first reasoning about its expected missing mass.

At the same time, conceptually and as I hinted at in the beginning of the post, it’s expected that the missing mass should connect with the entropy somehow. Berend and Kontorovich provided such a connection, back in 2017. Namely, they claimed that \[ \mathbb{E} M_0(t) \leq \frac{H(p)}{\log t} \tag{♣️} \] where \(H(p) = - \sum_{i=1}^n p(i) \log p(i)\) is the entropy of \(p\). Furthermore, they claimed that this bound is “very tight”, in the sense that, for any constant choice of \(\alpha>1\), you can always find a probability distribution \(p\) that satisfies \[\mathbb{E}M_0(t) > \frac{H(p)}{\alpha \log t}.\]

If you think about these results a little, they surprise you somewhat, because it contradicts the intuition given at the beginning of the post. We know from some other work (again Berend and Kontorovich (2012)) that, worst-case scenario, the expected missing mass shrinks “fast” with \(t/{2^n}\) while \(t\) is smaller than \(2^n\), and then linearly after that: \[ \mathbb{E} M_0(t) \leq \begin{cases} e^{-t/{2^n}}, & t \leq 2^n, \\ \frac{2^n}{et}, & t \geq 2^n. \end{cases} \tag{♦️} \]

This should mean (by the concentration arguments of before) that \[ t \approx \min[2^n \log \epsilon^{-1}, 2^n(e\epsilon)^{-1} ] \tag{♠️}\] samples are guaranteed to reduce the missing mass to (at most) \(\epsilon\). But, using (♣️), one finds that \[ t \approx e^{H(p)/\epsilon} \] samples are guaranteed to reduce the missing mass of a distribution with entropy \(H(p)\) to \(\epsilon\)! This means that for both the flattest possible distributions (\(H(p) \sim \log 2^n = n\)), and for the most peaked distributions (constant \(H(p)\)), the entropy-based inequality gives us a bound on the number of samples that is super-exponential in \(\epsilon\) (\(t \sim e^{1/\epsilon}\)). Confront this with (♠️). Of course, this does not mean that these two bounds are incompatible; for example, only (♠️) depends on the size of the support (\(2^n\)). But one should still wonder why (♣️) is a tight bound, in the sense described before. This is the surprising bit; the implication here is that the distribution saturating the (♣️) bound is not the maximal entropy distribution, but a distribution whose entropy asymptotically shrinks. This can be seen as follows: consider \(t \geq 2^n\). Then, from (♣️) there should be a distribution \(p^*\) with \[ \mathbb{E}M_0(t)^* \sim \frac{H(p^*)}{\log t} \] while it is also true from (♦️) that this value must be \[ \mathbb{E}M_0(t)^* \leq \frac{2^n}{t}. \] Since \(t \geq 2^n\) and from the monotonicity of \(\log t / t\), we have that for any \(t\) we can write \(\log t / t = an/2^n\) for some \(a \in (0,1]\). Therefore the saturating distribution has entropy \[ H(p^*) \leq a\frac{n}{e} \qquad \text{where} \qquad a \overset{t \to \infty}{\to} 0. \]

My interpretation of this is that the most adversarial distribution is just peaked enough that you might get the same outcome more than once, but flat enough that getting this repeated outcome doesn’t really tell you much, because the distribution is supported over most of its range. This would make for a particularly “bad” distribution, where you’d often draw the same outcome without learning much from it, therefore maximizing the number of necessary samples to estimate the missing mass.5

I’d like to close this post with an attempt at a nerd snipe.6 I recently published a paper where I argued that \[ \lVert{\mathbf{\alpha}}\rVert_1 \equiv \sum_{i=1}^{2^n} \sqrt{p(i)} \] is the right figure-of-merit to be looking at when discussing missing mass questions (especially in a quantum context). The existence of bounds of the form of (♣️) shows that it is not unreasonable to look into the expected missing mass under a constraint on \(\lVert{\mathbf{\alpha}}\rVert_1\). And such bounds would have very immediate and interesting applications in the context of hybrid (classical+quantum agents) algorithms. Here’s some homework for any reader who arrived late:

Assume a distribution-free setting, and a probability distribution \(p\) with support \(n \in \mathbf{N}\). Take \(t \in \mathbf{N}\), and \(a \in [1, \sqrt{n}]\).

If \(p\) is promised to satisfy the constraint \[ \sum_{i=1,\ldots,n} \sqrt{p(i)} = a,\] is there a non-trivial bound on \[\sum_{i=1,\ldots,n} p(i) (1-p(i))^t?\]


  1. In reality, the consequences of read-out are more subtle, since a suitable choice of read-out may preserve some internal state. See, e.g., gentle measurements

  2. As is usually the case when your number of possible outcomes scales exponentially. 

  3. “Missing mass” is the actual technical term. 

  4. At least, I think this is the case. It has certainly been known for a long time that estimators of the missing mass will concentrate below the missing mass. Arguing that they also concentrate above the missing mass — thus, that they concentrate about the missing mass — is apparently less obvious. I believe that Ben-Hamou et al. (2017) show that the estimators should indeed concentrate about the expectation of the missing mass (give or take some bias). But already much earlier, in 2012, Berend and Kontorovich were stating that “both the missing mass and its estimate are tightly concentrated about their expectations”. (The expectation of the estimator is the expectation of the missing mass, give or take some bias.) So: consider reading the primary sources and taking your own conclusions if you’re going to rely on this fact. 

  5. But your guess is as good as mine, and I’d love to hear your take on it. Email me at [email protected]

  6. And also the reason why I looked into these questions in the first place.