Update: An extended version of this post will appear in the upcoming book Quantum Machine Learning: What Quantum Computing Means to Data Mining.
A fascinating paper recently appeared on arXiv that proves the exponential speedup of least-squares support vector machines (SVMs) using quantum computing (Rebentrost et al., 2013). While the five-page eprint is fairly accessible to readers who are not well-versed in quantum information theory, I had a hard time understanding some steps. This write-up is a diluted interpretation of the content, bringing it down to my level.
The paper is all the more interesting because it is barely a frame: it connects some recent results with a formulation of SVMs. Least square SVMs translate an optimization problem to a set of linear equations. The linear equations require the quick calculation of the kernel matrix -- this is one source of the speedup. The other source of the speedup is the efficient solution of the linear equations on quantum hardware.
Least Squares Support Vector Machines
A support vector machine (SVM) is a supervised learning algorithm which learns a given independent and identically distributed training example set
where




In its simplest, linear form, a support vector machine is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin. The formula for the output of a linear support vector machine is
, where
is the
th training example,
is the output of the support vector machine for the
th training example,
is the normal vector to the hyperplane, and parameter
helps determine the offset of the hyperplane the origin. The margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples. Support vectors are the training data that lie on the margin.
SVMs allow a nonlinear kernel mapping that maps the training examples from an input space into a feature space. This is important for problems that are not linearly separable. SVMs with a
embedding function require the solution of the following optimization problem:
subject to
Here



In most practical applications, we solve the dual formulation of the optimization problem. To obtain the dual problem, we introduce Lagrangian multipliers to accommodate the constraints in the minimalization problem. The partial derivatives in
,
, and
define a saddle point of Lagrangian, with which the dual formulation becomes the following quadratic programming problem:
subject to
The function
is the kernel function, the dot product of the embedding space. The kernel function bypasses calculating the embedding itself, we use it directly to find the optimum and subsequently to classify new instances. The decision function for a binary decision problem becomes
On classical hardware, the quadratic problem is solved by sequential minimum optimization.
Least squares SVMs modify the goal function of the primal problem by using the
norm in the regularization term (Suykens and Vandewalle, 1999):
subject to the equality constraints
The parameter


where



Given a linear or polynomial kernel, the calculation of entry in the kernel matrix takes
time, thus calculating the whole kernel matrix has
time complexity. Solving the quadratic dual problem or the least squares formulation has
complexity. Combining the two steps, the classical SVM algorithm has at least
complexity. The quantum formulation yields an exponential speedup in these two steps, leading to an overall complexity of
Calculating the Kernel Matrix
In the quantum algorithm, the training instances are presented as quantum states
. We do not require the training instances to be normalized, but the normalization must be given separately. To reconstruct a state from quantum random access memory (QRAM, Giovannetti et al., 2008), we need to query the QRAM
times.
To evaluate the dot product of two training instances, we need to do the following (Lloyd et al., 2013):
- Generate two states,
and
, with an ancilla variable;
- Estimate the parameter
-- the sum of the squared norms of the two instances;
- Perform a projective measurement on the ancilla alone, comparing the two states.
times the probability of the success of the measurement yields the square of the Euclidean distance between the two training instances:
. We calculate the dot product in the linear kernel as
.
The state
is easy to construct by querying the QRAM. We estimate the other state,
, and the parameter,
, together. We evolve the state
with the Hamiltonian
The resulting state is
By an appropriate choice of
(
,
), and by measuring the ancilla bit, we get the state
with probability
, which in turns allows the estimation of
. If the desired accuracy of the estimation is
, then the complexity of constructing
and
is
If we have
and
, we perform a swap test on the ancilla alone. A swap test is a sequence of a Fredkin gate (CSWAP gate) and an Hadamard gate which checks the equivalence of two states
and
. With an ancilla state
, the schematic diagram is as follows:

A swap test
The first transformation swaps
and
if the ancilla bit is
. The Hadamard transformation is a one-qubit rotation mapping the qubit states
and
to two superposition states. With basic algebraic manipulation, we conclude that if
and
are equal, then the measurement in the end will always give us zero.
With the QRAM accesses and the estimations, the overall complexity of evaluating a single dot product
is
Obtaining the Optimum of the Least Squares Dual Formulation
If you only calculate the kernel matrix by quantum means, you have a complexity of
. There is more to gain, exponential speedup in the number of training examples is also possible.
The algorithm hinges on three ideas:
- Quantum matrix inversion is fast (Harrow et al., 2009).
- Simulation of sparse matrixes is efficient (Berry et al., 2007);
- Non-sparse density matrices reveal the eigenstructure exponentially faster than in classical algorithms (Lloyd et al., 2009);
To solve the linear equation (1), we need to invert
. The matrix inversion algorithm needs to simulate the matrix exponential of
. We split
as
, where
-- the adjacency matrix of a star graph, and
We normalize
with its trace:
By using the Lie product formula, or, more precisely, the Baker-Campbell-Hausdorff formula, we get the exponential by
To obtain the exponentials, the sparse matrices
and the constant multiply of the identity matrix are easy to simulate (Berry et al., 2007).
On the other hand, the kernel matrix
is not sparse. This is where quantum self analysis helps: given multiple copies of a density matrix
, it is possible to perform
(Lloyd et al., 2013). It is self analysis because a state plays an active role in its measurement, by exponentiation it functions as a Hamiltonian. With the quantum calculation of the dot product and access to the QRAM, we obtain the normalized kernel matrix
is a normalized Hermitian matrix, which makes it a prime candidate for quantum self analysis. The exponentiation is done in
steps. We obtain
by
where
We use (2) to perform quantum phase estimation to get the eigenvalues and eigenvectors. If the desired accuracy is
, we need
copies of the state. Then we express the
in (1) in the eigenbasis, and invert the eigenvalues to obtain the solution
of the linear equation (Harrow et al., 2009).
With this inversion algorithm, the overall time complexity of training the support vector parameters is
.
Limitations and Open Questions
The approach only works for dense training vectors, for obvious reasons. In many classical algorithms, execution speed -- and not complexity -- is improved by exploiting sparse data structures. Such structures are common in applications, such as text mining and recommendation engines. It is unclear how a sparse vector maps to a quantum state.
The kernel function is restricted. Linear and polynomial kernels are easy to implement, in fact, the inner product is evaluated directly in the embedding space. Radial basis function kernels are much harder to imagine in this framework, and these are the kernels the most common.
Non-symmetric kernels are out of reach for the time being. One restriction is the linear or polynomial kernel. The other constraint is the matrix inversion, although the generic method extends to non-Hermitian matrices, so this problem is easier to sort out.
states are required to perform classification, which is both good and bad. It is good because it compresses the kernel exponentially. It is bad because the trained model is not sparse. A pivotal point in the success of SVMs is structural risk minimization: the learned model should not overfit the data, otherwise its generalization performance will be poor. The quantum SVM massively overfits the data, every single data instance will become a support vector. None of the
-s will be zero. It would be interesting to see how this translates to actual classification performance on real-world data sets.
References
Berry, D. W.; Ahokas, G.; Cleve, R. & Sanders, B. C. Efficient quantum algorithms for simulating sparse Hamiltonians. Communications in Mathematical Physics, 2007, 270, 359-371.
Giovannetti, V.; Lloyd, S. & Maccone, L. Quantum random access memory. Physical Review Letters, 2008, 100, 160501.
Harrow, A. W.; Hassidim, A. & Lloyd, S. Quantum algorithm for linear systems of equations. Physical Review Letters, 2009, 103, 150502.
Lloyd, S.; Mohseni, M. & Rebentrost, P. Quantum self analysis. arXiv:1307.0401, 2013.
Lloyd, S.; Mohseni, M. & Rebentrost, P. Quantum algorithms for supervised and unsupervised machine learning. arXiv:1307.0411, 2013.
Rebentrost, P.; Mohseni, M. & Lloyd, S.Quantum support vector machine for big feature and big data classification. arXiv:1307.0471, 2013.
Suykens, J. A. & Vandewalle, J. Least squares support vector machine classifiers. Neural Processing Letters, 1999, 9, 293--300.