Fast(er) Computational Quantum Lower Bounds

As part of some ongoing work, my colleague Duarte Magano and I were interested in calculating the quantum query complexity lower bounds for total Boolean functions, in the context of being allowed to make some number of queries in parallel. We believed this to be a well known bound, but it turns out that, while the bound seems to be generally accepted in the community, the proposed proof we could find did not hold up, as we could easily generate bounds by a particularly elegant (and well-known) method of Andris Ambainis¹, the “adversarial method"— but not the proposed bound. This led me into a week-long rabbit hole of trying to write a computer program that would brute force the bounds generated by the adversarial method in useful time.

A Short Glossary

First, though, here’s a short glossary in case you’re unfamiliar with some of these concepts:

Quantum query complexity: the term is actually very self-explanatory, because it precisely refers to the complexity of a problem, as measured by how many queries you must make to its input (equivalently, how much of the input you must read), when you’re allowed to use a quantum computer. Query complexity is typically an “easier” measure of complexity to study, because it allows you to forego any considerations on how much work your intermediate calculations are, and instead focus on whether you can get the answer to your problem with less reads to the input.

Boolean functions: these are just any functions that take a number of Boolean values — \(N\), for our purposes here; “bigger” problems will correspond to functions that take in more inputs — and produces a Boolean output. “Total Boolean functions”, then, are a particular kind of Boolean functions, namely those that allow any combination of inputs — because, note, this isn’t necessarily a requirement. An example is talking about a function that outputs whether the given inputs were all-false or all-true; it makes no sense to give this function a mixture of true and false inputs. So, a total Boolean function is a Boolean function that accepts any combination of input values. If you’re mathematically inclined you might write these as \(f:\{0,1\}^N\mapsto\{1,1\}\). Symmetric Boolean functions are special total Boolean functions for which the order of the inputs doesn’t matter; it follows that they must be a function of how many “true” inputs there are.

Lower bounds (I expect you to be well acquainted with the notion if you’re reading this, but just in case you are simply very curious): whenever you find an algorithm to solve a particular problem with a particular scaling — say, for \(N\) inputs you need to read \(\log N\) inputs to get your answer — you’ve shown that you can solve the problem at least that efficiently. But, you haven’t shown that you can’t do any better. For that reason, such a bound is known as an upper bound to the (in this case, query) complexity of the problem. It turns out that you may be able to prove, mathematically, the converse, i.e., that its not possible to do any better than some limit.² This second situation is, then, described as a lower bound to your computational complexity.

An Overview of the Adversarial Method (for Parallel Queries)

As mentioned above, Andris Ambainis gave, in the year 2000, a method to calculating lower bounds to the quantum query complexity of total Boolean functions.³ This bound was improved in the meantime to what is known as the “general adversary method”,⁴ while many other adversary methods appeared in the following years. These were all eventually proven to be equivalent by Spalek and Szegedy,⁵ so that we may just concern ourselves with Ambainis’s method. For simplicity, we can furthermore just look at the simple adversary method, which is much easier to state, and follows as a simplification of the general method (at the cost of maybe underestimating the bound⁶):

The idea of this method is that it somewhat relates to the notion of inputs that are similar, but that produce different outputs. If you can find such a pair of inputs, you’ll have to (worst-case scenario) almost fully characterize the input to distinguish between this pair and produce the right output.⁷

The situation where you are allowed to make \(p\) queries in parallel has a surprisingly similar statement, due to Stacey Jeffery, Frédéric Magniez, and Ronald de Wolf:⁸ ⁹

From the simplicity of these statements follows how it would be straightforward to brute-force search for the inputs that yield \(m\), \(m'\), and \(l_\mbox{max}\). The problem is how many inputs (and pairs of inputs) you need to look at. If we’re considering functions that take \(N\) Boolean values, that means you have \(2^N\) possible inputs — furthermore you need to examine all pairs of inputs, so there’s actually \(2^{2N}\) pairs to look at. And if you don’t want your memory usage to also grow exponentially, you’ll actually have to iterate over these pairs twice, that leaves you with a cool \(2^{2N+1}\) iterations to run over.¹⁰

Why would it be interesting to brute-force these bounds? Essentially to test possible relations, or proof-check your proposal for a lower bound. Here’s a practical example: consider the \(k\)-threshold Boolean function, which evaluates to “true” if at least \(k\) of the inputs are “true”. By choosing \(A=\{x:{|x|=k}\}\)¹¹, \(B=\{y:{|y|=k+1}\}\) and relation “\(x\) relates to \(y\) if they differ only by one input”, you might think that \(l_\mbox{max}=p\),¹² but brute forcing the problem would yield an instance where \(l_\mbox{max}=p(p-1)/4\).

Keep It Simple, Stupid

I’ll be completely honest with you: I didn’t do this, because I was too excited to try my hand at some high-performance computing.¹³ But, had I been less imprudent, the first thing to do would have been to try the search in Python. Taking the example from the previous section, that would have been something like

from math import inf
from exercise_for_the_reader_smiley_face import windows

n = ...
k = ...
p = ...

def in_set_a(x):
    return x.count_ones() == k

def in_set_b(y):
    return y.count_ones() == k+1

def related(x,y):
    return (x ^ y).count_ones() == 1

m = inf
m_prime = inf
l_max = 0

for window in windows(of_size=n, with_weight=p):
    l = 0
    l_prime = 0

    # Look at all the `x`s, find the `x` with least number of relations, and
    # what is the `x` with most relations (under the window condition)
    for x in range(2**n):
        if not in_set_a(x):
            continue

        relations_count = 0

        for y in range(2**n):
            if not in_set_b(y):
                continue

            if not related(x, y):
                continue

            relations_count += 1

            if (x & window) != (y & window):
                l += 1

        # This `x` has less relations than all `x` we've looked at so far
        if relations_count < m:
            m = relations_count

    for y in range(2**n):
        if not in_set_b(y):
            continue

        relations_count = 0

        for x in range(2**n):
            if not in_set_a(x):
                continue

            if not related(x, y):
                continue

            relations_count += 1

            if (x & window) != (y & window):
                l_prime += 1

        # This `y` has less relations than all `y` we've looked at so far
        if relations_count < m_prime:
            m_prime = relations_count

    # Finally, see if we can get a better `l_max` under this window than
    # what we've seen so far
    l_max_candidate = l * l_prime
    if l_max_candidate < l_max:
        l_max = l_max_candidate

We can immediately optimize some of this away by focusing on symmetric Boolean functions: for these you can focus on a single window that selects the first \(p\) entries of the inputs.¹⁴ This lets us get rid of the outer loop, and reduce the number of iterations from \(2^{2N+1}({N}\,\mbox{choose}\,{p})\) to the promised \(2^{2N+1}\), which is a very big improvement.

Then we pull up hyperfine to time this for \(N=16\), \(K=6\), \(p=2\), rather conservatively small parameters, and…

…which isn’t that bad at all, when considering that there’s about \(10^9\) iterations plus some work for each one to be dispatched! But this will get quickly out of control if we wish to try for larger \(N\), due the exponential nature of the problem.¹⁵ ¹⁶

Going Fast

I don’t know a lot about high-performance computing, seeing as I’m a physicist, so I follow these guidelines:

Following guideline 1, I decided to use Rust (which is what I’m comfortable with in terms of low-level languages). To avoid, however, spending unnecessary time and mental effort in things that didn’t need to go fast, I decided to write the brute-force search code as a Python module. Using PyO3, this is an absolute breeze, amounting to a few annotations and some extra types to interface with the Python layer.

Now, technically, we could write an almost direct analogue of the Python code in Rust, but this would require re-compiling the program for a different choice of sets \(A\) and \(B\) or relation between the two, which isn’t ideal, and goes against the flexibility I’m going for when developing this as a Python module. On the other hand, you could pass functions in_set_a, in_set_b, and related (as in the Python example) as arguments to the Rust call, but yielding the Rust execution to the Python layer, especially so often as once per iteration, would be unnecessarily (and excruciatingly) slow.¹⁷ The fact is that we don’t need all the expressibility of Python to define our sets and relations, and by exploring those constraints we can remain flexible as to their definition, without sacrificing the near-metal speed.

A Micro Programming Language

One way to guarantee these constraints (so that we may then make use of them) is by defining a micro programming language¹⁸ that embodies them. Keeping in mind that we want to be able to write, essentially, Boolean formulas and arithmetic expressions, I defined the following parsing expression grammar for my language:¹⁹

This allows us to write something like

< (ham x) (+
  ham y 2)

to denote \(|x|<(|y|+2)\), or, to keep in line with the example we’ve been using, the set \(A\) as

= (ham
  x) k

, the set \(B\) as = (ham y) (+ k 1), and the relation between the two as

= ham (^ x y)
  1

.²⁰

To parse an expression in this grammar into something the computer can understand (and we can work with programmatically), we need a parser, and in my case I used pest, the parsing expression grammar notation for which you already found above. Given such a grammar, this crate produces the necessary code to provide a function parse that transforms an incoming string into a series of “pairs”, denoting which string slices relate to which grammar elements, or, in case the input is malformed, diagnostics on the errors, indicating where parsing failed and what was expected or unexpected in the input. With a little more work, I further processed this result to define the following function…

…which either returns the root element of an abstract syntax tree²¹, or a human-friendly message, pointing out where the error in the expression is.

A Verily Virtual Machine

We could take the AST and parse it into numbers once we have specific values for \(N\), \(p\), and \(k\), but, per guideline 2, this would be slow, as each element of the AST lives sort of wherever in memory, and you’d be needing to skip around.²² So, the next well-established step in evaluating your ASTs as fast as possible²³ (barring compilation) is producing instructions for a virtual machine, and then executing those instructions sequentially whenever you want to evaluate the program.²⁴

Since our grammar just defines boolean and arithmetic expressions, bytecode laid out in Polish notation is straightforwardly evaluated (as you’ll see below). So, for example, in this virtual machine, the program

and (and
  (< x 1) (!= 1 y)) (< ham x k)

compiles nicely into the short bytecode that follows.

Now to construct the actual virtual machine: we’ll be dealing with Boolean and arithmetic (integer and floating point) values, so let’s go ahead and give our VM a stack for each type, as well as a register to hold particular values for \(x\), \(y\), \(N\), \(k\), and \(p\).

Note how Vm doesn’t own it’s stack, but rather maintains a mutable reference to it. This is to follow guideline 3 via a simple observation: each OpCode can put at most one value on the stack. Since we’ll be running the same code for multiple settings of the Register, we can just allocate a stack as big as the number of instructions (at initialization), and clear it before each run. Likewise, we keep a reference to the bytecode, because the same memory can be re-used between iterations.

It remains to actually run the bytecode in our virtual machine, but, as hinted before, this is easy: if our VM iterates over the code back-to-front, pushing and popping values as it moves along, by the end it will hold the final output in the corresponding stack (because you’re essentially evaluating a reverse Polish notation expression).

Let’s put our virtual machine to the test, to check all the moving pieces; hopefully the following also gives you an idea on how to reuse the compiled bytecode and the stack:

Back to the Problem

Having completed our virtual machine detour, we can go back to our find_bounds function. Here’s a proposal, based on the Python program, but using the virtual machine:

impl Prover {
    #[new]
    fn py_new(a: String, b: String, relation: String) -> PyResult<Self> {
        /* ------------------------ 8< -------------------------
                                   ...
           Compile and save the bytecodes for a, b, and relation,
           reporting errors to the user if necessary.
           Save the compiled bytecode to the `Prover` struct.
                                   ...
           ------------------------ 8< ------------------------- */
    }

    fn find_bounds(&self, n: u32, k: u32, p: u32) -> PyResult<BoundsResult> {
        let mut vm_stack = vm::VmStack::from_code(&self.relation);

        let mut m = u64::MAX;
        let mut m_prime = u64::MAX;
        let mut l_x = 0_u64;
        let mut l_y = 0_u64;
        // l_max will be l_x * l_y

        let window = (1_u64 << p) - 1;

        // First, look at every x and how many relations each one has...
        for x in 0..(1_u64 << n) {
            let in_set_a = vm::Vm::load(
                &self.a_bytecode,
                vm::Registers::load(x, 0, n, p, k),
                &mut vm_stack,
            );
            if !in_set_a {
                continue;
            }

            let mut m_candidate = 0_u64;
            let mut l_x_candidate = 0_u64;

            for y in 0..(1_u64 << n) {
                let in_set_b = vm::Vm::load(
                    &self.b_bytecode,
                    vm::Registers::load(0, y, n, p, k),
                    &mut vm_stack,
                );
                if !in_set_b {
                    continue;
                }

                let related = vm::Vm::load(
                    &self.relation,
                    vm::Registers::load(x, y, n, p, k),
                    &mut vm_stack,
                )
                .run()
                .output_bool();
                if related {
                    m_candidate += 1;

                    if (x & window) != (y & window) {
                        l_x_candidate += 1;
                    }
                }
            }

            if m_candidate < m {
                m = m_candidate;
            }
            if l_x_candidate > l_x {
                l_x = l_x_candidate;
            }
        }

        // ...Then, look at each y and how many relations each one has.
        for y in 0..(1_u64 << n) {
            let in_set_b = vm::Vm::load(
                &self.b_bytecode,
                vm::Registers::load(0, y, n, p, k),
                &mut vm_stack,
            );
            if !in_set_b {
                continue;
            }

            let mut m_prime_candidate = 0_u64;
            let mut l_y_candidate = 0_u64;

            for x in 0..(1_u64 << n) {
                let in_set_a = vm::Vm::load(
                    &self.a_bytecode,
                    vm::Registers::load(x, 0, n, p, k),
                    &mut vm_stack,
                );
                if !in_set_a {
                    continue;
                }

                let related = vm::Vm::load(
                    &self.relation,
                    vm::Registers::load(x, y, n, p, k),
                    &mut vm_stack,
                )
                .run()
                .output_bool();
                if related {
                    m_prime_candidate += 1;

                    if (x & window) != (y & window) {
                        l_y_candidate += 1;
                    }
                }
            }

            if m_prime_candidate < m_prime {
                m_prime = m_prime_candidate;
            }

            if l_y_candidate > l_y {
                l_y = l_y_candidate;
            }
        }

        let l_max = l_x as u128 * max_y_relations as u128;

        /* Return the results... */
    } // find_bounds
} // impl Prover

The code duplication here is pretty annoying; we’re checking multiple times if a given x belongs to \(A\) and a y belongs to \(B\); after the first outer iteration of x, we’ve already looked at all values y in \(B\). Ideally we’d remember these, but as noted, this would mean memory growing at-worst with \(2^{N}\). But, this doesn’t mean we shouldn’t remember anything at all;²⁵ we can introduce a cache that remembers elements of \(A\) up to a given size limit, and re-calculates them only from that point on. This plays well with guideline 4, as long as we keep this cache compact in memory. In my case, I chose this cache to take a maximum of 10,000,000 64-bit values, meaning a comfortable 2 gigabytes of RAM used per cache — one for \(A\), one for \(B\) — therefore 4 gigabytes of RAM, total.²⁶ On the first double iteration I save as many values as I can within the cache size (defaulting afterwards to filtering them on the fly), preventing re-calculation of these values the second time around.

Speed?

But does it go fast now? We package our newly developed python extension with maturin develop --release, and write a simple shim in Python…

By performing some extra benchmarking, we can get an idea of how much time we’re spending per iteration, on average, for this choice of \(A\), \(B\), and relation; below I’ve plotted the times for varying parameter \(N\), and fit them to them to the curve \(o+t\times2^{2N+1}\):

You can see that \(t\) comes out to \(t=7.13\times{10}^{-10}\pm(0.65\%)\), which, give or take a couple of orders of magnitude due to our lack of statistics, is still an extremely short amount of time per iteration, in the same order of magnitude as the time the CPU²⁸ takes to evaluate one instruction.

In Conclusion

My approach is, in some sense, quite old fashioned, as I’m not making use of two important aspects of modern HPC: multithreading and GPUs. Nonetheless, we’ve gone pretty fast — close to bare-metal speed on average, if our fit is to be trusted. But, regardless of how much we optimize and high-performance compute, it’s clear that there’s just not much to do about the exponential nature of the problem, which eventually (violently) dominates. That’s the thing about asymptotic scaling — sometimes, and at scale, scaling matters, and when it does you’d better optimize on paper before writing code.

Much of Ambainis’s work is characterized by conceptual clarity and elegance. ↩
Usually implying “or Mathematics would break”, or sometimes “or Computational Complexity Theory as we believe it to be would break”. ↩
“Quantum lower bounds by quantum arguments”, A. Ambainis, arXiv:quant-ph/0002066 ↩
Or “spectral adversary method”, depending on who you ask. ↩
R. Spalek, M. Szegedy, Theory of Computing, 2(1):1-18, 2006 (or arXiv:quant-ph/0409116 if you’re looking to actually read it). ↩
Though there seems to be some consensus that in practice this simplified method is often good enough for total symmetric boolean functions. ↩
The name “adversary method” coming from the idea that an adversary chooses these worst possible query answers on-the-fly. ↩
Stacey Jeffery, Frédéric Magniez, and Ronald de Wolf. Optimal parallel quantum query algorithms. Algorithmica, 79(2):509–529, 2017 (or arXiv:1309.6116 if you’re looking to actually read it). ↩
Again, this statement may be an underestimation of the lower bound, at the benefit of simplicity. ↩
When writing this article I realized that allowing memory to grow exponentially might actually have been more tractable than offloading everything to iterations — provided you allowed your memory to grow to \(2^N\) but not to \(2^{2N}\). For reference, for \(N=20\), this is the difference between requiring about one megabyte of RAM versus requiring 8796 gigabytes of RAM. Exponential scales are wild. ↩
\(|x|\) denotes “how many elements of \(x\) are “true”. ↩
As far as I can tell, that’s what Paul Bruchard claims in arXiv:quant-ph/1910.04555. ↩
You know when sometimes you think “I could totally do that”? That. ↩
Not sure how to convince you this is true, except that it has to do with the fact that for some choice of window you could consider an input permuted to undo the window’s configuration. ↩
Not exactly in a slow machine. The data given was obtained using a beefy AMD Ryzen 5 PRO 4650G. ↩
The choice of the parameters will also be relevant here, because for some combinations of \(N\) and \(k\), the continues will do a lot of the heavy lifting, and iterating over a range is surprisingly fast. The problem really is the function calls. You can get a sense of the worst-case scenario by choosing \(k\sim{N/2}\) — for me, choosing \(N=16\), \(k=8\) brought the runtime up to about 380 seconds. In this case, the particular relation and sets chosen are also more forgiving than a relation like “\(x\) and \(y\) relate if they differ by a number of inputs less or equal than \(p\)”. ↩
Relatively speaking, and without data here to back it up. But I’m confident in saying that a lot of the benefit would be lost, from experience. ↩
Or Domain-Specific Language if you’re feeling less cool. ↩
This kind of looks like the Backus-Nauer Form format, but not quite. It’s the format that pest, the PEG parser generator I used, takes in. ~ denotes “and then”, | denotes “or try to match”, and { } are just delimiting tokens for a rule. ↩
I used Polish notation (where the operators appear before the operands) because it’s easier to parse. ↩
The expression represented as a bunch of enums and structs pointing to each other. ↩
In particular because Rust doesn’t allow for infinitely sized objects, so an expression can never contain an expression — but for, say, + (+ 1 1) 1 you require an arithmetic expression that has an arithmetic expression as an operand. Therefore you’re instead forced to have your arithmetic expressions somewhere in memory, and allow at most an arithmetic expression to point to another expression. ↩
In technical lingo, ASTASAP. ↩
As far as I know, that’s what Python is doing as well, and the reason why you get those pesky __pycahce__ folders in your projects’ directories. ↩
Especially as we may have a lot less than \(2^{2N}\) elements in our sets \(A\) and \(B\). ↩
Per guideline 3, this memory is allocated only once, at the start of the program, regardless of how much of it does end up being used. ↩
Did you know you can add a couple of zeroes if you write a number in percentage? ↩
Assuming 5GHz. ↩