A fascinating paper recently appeared on arXiv that proves the exponential speedup of least-squares support vector machines (SVMs) using quantum computing (Rebentrost et al., 2013). While the five-page eprint is fairly accessible to readers who are not well-versed in quantum information theory, I had a hard time understanding some steps. This write-up is a diluted interpretation of the content, bringing it down to my level.
The paper is all the more interesting because it is barely a frame: it connects some recent results with a formulation of SVMs. Least square SVMs translate an optimization problem to a set of linear equations. The linear equations require the quick calculation of the kernel matrix -- this is one source of the speedup. The other source of the speedup is the efficient solution of the linear equations on quantum hardware.
Least Squares Support Vector Machines
A support vector machine (SVM) is a supervised learning algorithm which learns a given independent and identically distributed training example set
where are the data points, and are binary classes to which a data point belongs. Here is the number of features that represent a single data instance, and is the number of training instances.
In its simplest, linear form, a support vector machine is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin. The formula for the output of a linear support vector machine is , where is the th training example, is the output of the support vector machine for the th training example, is the normal vector to the hyperplane, and parameter helps determine the offset of the hyperplane the origin. The margin is defined by the distance of the hyperplane to the nearest of the positive and negative examples. Support vectors are the training data that lie on the margin.
SVMs allow a nonlinear kernel mapping that maps the training examples from an input space into a feature space. This is important for problems that are not linearly separable. SVMs with a
embedding function require the solution of the following optimization problem:
Here stands for the classification error: this formulation is a soft-margin classifier, allowing classes to mingle. The cost parameter acts as a form of regularization: if the cost of misclassification is higher, a more accurate model is sought with increased complexity, i.e., with higher number of support vectors. The decision function becomes
In most practical applications, we solve the dual formulation of the optimization problem. To obtain the dual problem, we introduce Lagrangian multipliers to accommodate the constraints in the minimalization problem. The partial derivatives in , , and define a saddle point of Lagrangian, with which the dual formulation becomes the following quadratic programming problem:
The function is the kernel function, the dot product of the embedding space. The kernel function bypasses calculating the embedding itself, we use it directly to find the optimum and subsequently to classify new instances. The decision function for a binary decision problem becomes On classical hardware, the quadratic problem is solved by sequential minimum optimization.
Least squares SVMs modify the goal function of the primal problem by using the norm in the regularization term (Suykens and Vandewalle, 1999):
subject to the equality constraints
The parameter plays the same role as the cost parameter Seeking the saddle point of the corresponding Langrangian, we obtain the following least-squares problem:
where The least squares SVM trades off zero -s for nonzero error terms -s, leading to increased model complexity.
Given a linear or polynomial kernel, the calculation of entry in the kernel matrix takes time, thus calculating the whole kernel matrix has time complexity. Solving the quadratic dual problem or the least squares formulation has complexity. Combining the two steps, the classical SVM algorithm has at least complexity. The quantum formulation yields an exponential speedup in these two steps, leading to an overall complexity of
Calculating the Kernel Matrix
In the quantum algorithm, the training instances are presented as quantum states . We do not require the training instances to be normalized, but the normalization must be given separately. To reconstruct a state from quantum random access memory (QRAM, Giovannetti et al., 2008), we need to query the QRAM times.
To evaluate the dot product of two training instances, we need to do the following (Lloyd et al., 2013):
- Generate two states, and , with an ancilla variable;
- Estimate the parameter -- the sum of the squared norms of the two instances;
- Perform a projective measurement on the ancilla alone, comparing the two states.
times the probability of the success of the measurement yields the square of the Euclidean distance between the two training instances: . We calculate the dot product in the linear kernel as .
is easy to construct by querying the QRAM. We estimate the other state,
, and the parameter,
, together. We evolve the state
with the Hamiltonian
The resulting state is
By an appropriate choice of ( , ), and by measuring the ancilla bit, we get the state with probability , which in turns allows the estimation of . If the desired accuracy of the estimation is , then the complexity of constructing and is
If we have and , we perform a swap test on the ancilla alone. A swap test is a sequence of a Fredkin gate (CSWAP gate) and an Hadamard gate which checks the equivalence of two states and . With an ancilla state , the schematic diagram is as follows:
The first transformation swaps and if the ancilla bit is . The Hadamard transformation is a one-qubit rotation mapping the qubit states and to two superposition states. With basic algebraic manipulation, we conclude that if and are equal, then the measurement in the end will always give us zero.
With the QRAM accesses and the estimations, the overall complexity of evaluating a single dot product is
Obtaining the Optimum of the Least Squares Dual Formulation
If you only calculate the kernel matrix by quantum means, you have a complexity of . There is more to gain, exponential speedup in the number of training examples is also possible.
The algorithm hinges on three ideas:
- Quantum matrix inversion is fast (Harrow et al., 2009).
- Simulation of sparse matrixes is efficient (Berry et al., 2007);
- Non-sparse density matrices reveal the eigenstructure exponentially faster than in classical algorithms (Lloyd et al., 2009);
To solve the linear equation (1), we need to invert
. The matrix inversion algorithm needs to simulate the matrix exponential of
. We split
-- the adjacency matrix of a star graph, and
with its trace:
By using the Lie product formula, or, more precisely, the Baker-Campbell-Hausdorff formula, we get the exponential by
To obtain the exponentials, the sparse matrices and the constant multiply of the identity matrix are easy to simulate (Berry et al., 2007).
On the other hand, the kernel matrix is not sparse. This is where quantum self analysis helps: given multiple copies of a density matrix , it is possible to perform (Lloyd et al., 2013). It is self analysis because a state plays an active role in its measurement, by exponentiation it functions as a Hamiltonian. With the quantum calculation of the dot product and access to the QRAM, we obtain the normalized kernel matrix is a normalized Hermitian matrix, which makes it a prime candidate for quantum self analysis. The exponentiation is done in steps. We obtain by where
We use (2) to perform quantum phase estimation to get the eigenvalues and eigenvectors. If the desired accuracy is , we need copies of the state. Then we express the in (1) in the eigenbasis, and invert the eigenvalues to obtain the solution of the linear equation (Harrow et al., 2009).
With this inversion algorithm, the overall time complexity of training the support vector parameters is .
Limitations and Open Questions
The approach only works for dense training vectors, for obvious reasons. In many classical algorithms, execution speed -- and not complexity -- is improved by exploiting sparse data structures. Such structures are common in applications, such as text mining and recommendation engines. It is unclear how a sparse vector maps to a quantum state.
The kernel function is restricted. Linear and polynomial kernels are easy to implement, in fact, the inner product is evaluated directly in the embedding space. Radial basis function kernels are much harder to imagine in this framework, and these are the kernels the most common.
Non-symmetric kernels are out of reach for the time being. One restriction is the linear or polynomial kernel. The other constraint is the matrix inversion, although the generic method extends to non-Hermitian matrices, so this problem is easier to sort out.
states are required to perform classification, which is both good and bad. It is good because it compresses the kernel exponentially. It is bad because the trained model is not sparse. A pivotal point in the success of SVMs is structural risk minimization: the learned model should not overfit the data, otherwise its generalization performance will be poor. The quantum SVM massively overfits the data, every single data instance will become a support vector. None of the -s will be zero. It would be interesting to see how this translates to actual classification performance on real-world data sets.
Berry, D. W.; Ahokas, G.; Cleve, R. & Sanders, B. C. Efficient quantum algorithms for simulating sparse Hamiltonians. Communications in Mathematical Physics, 2007, 270, 359-371.
Giovannetti, V.; Lloyd, S. & Maccone, L. Quantum random access memory. Physical Review Letters, 2008, 100, 160501.
Harrow, A. W.; Hassidim, A. & Lloyd, S. Quantum algorithm for linear systems of equations. Physical Review Letters, 2009, 103, 150502.
Lloyd, S.; Mohseni, M. & Rebentrost, P. Quantum self analysis. arXiv:1307.0401, 2013.
Lloyd, S.; Mohseni, M. & Rebentrost, P. Quantum algorithms for supervised and unsupervised machine learning. arXiv:1307.0411, 2013.
Rebentrost, P.; Mohseni, M. & Lloyd, S.Quantum support vector machine for big feature and big data classification. arXiv:1307.0471, 2013.
Suykens, J. A. & Vandewalle, J. Least squares support vector machine classifiers. Neural Processing Letters, 1999, 9, 293--300.