Accelerating Large-Scale Inference with Anisotropic Vector Quantization

Ruiqi Guo*, Philip Sun*, Erik Lindgren*, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar Google Research
[email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

$^{*}$Equal contributions.

Abstract

Quantization based techniques are the current state-of-the-art for scaling maximum inner product search to massive databases. Traditional approaches to quantization aim to minimize the reconstruction error of the database points. Based on the observation that for a given query, the database points that have the largest inner products are more relevant, we develop a family of anisotropic quantization loss functions. Under natural statistical assumptions, we show that quantization with these loss functions leads to a new variant of vector quantization that more greatly penalizes the parallel component of a datapoint's residual relative to its orthogonal component. The proposed approach, whose implementation is open-source, achieves state-of-the-art results on the public benchmarks available at ann-benchmarks.com.

Executive Summary: In today's data-driven world, large-scale machine learning systems, such as recommendation engines and classifiers handling millions of categories, rely on finding the maximum inner product between a user query and database items—essentially identifying the best matches quickly. This process, known as maximum inner product search (MIPS), is vital for applications like personalized suggestions on streaming platforms or rapid predictions in AI models with vast output spaces. However, with databases containing billions of high-dimensional vectors, exhaustive searches are computationally prohibitive and slow, leading to high costs and delays. Traditional compression techniques, called quantization, shrink these vectors to speed up searches but often prioritize overall accuracy over the most relevant matches, resulting in suboptimal performance for real-world priorities.

This document introduces a new quantization approach to better tackle MIPS by focusing on the most important matches. Specifically, it aims to evaluate and demonstrate a "score-aware" loss function that weights quantization errors based on the strength of inner products, ensuring higher-scoring pairs (likely top results) are approximated more precisely. The goal is to boost search accuracy and speed without increasing resource demands.

The authors developed this method through theoretical analysis and practical adaptations to common quantization techniques. They assume queries are uniformly distributed on a unit sphere—a reasonable starting point for many embedding-based systems—and derive a loss function that decomposes errors into parallel (aligned with data points) and orthogonal (perpendicular) components, penalizing parallel errors more heavily to preserve key similarities. This "anisotropic" weighting is applied to vector quantization (grouping vectors into codebooks) and product quantization (splitting vectors into subspaces for finer compression), using iterative algorithms similar to k-means clustering but optimized for the new loss. Experiments tested this on standard benchmarks like Glove-1.2M (1.2 million 100-dimensional word embeddings) and Amazon-670k (extreme classification data), comparing against baselines over periods from 2014-2019 datasets, with sample sizes up to millions of points and codebook sizes yielding 64-2048 bits per vector. Key assumptions include equal vector norms in some cases, handled by normalization.

The core findings highlight substantial improvements. First, the score-aware loss increased recall— the rate at which the true top match is found in the top results—by 4-13% at bitrates from 256 to 2048, outperforming traditional reconstruction-error methods across low to high compression levels. Second, it reduced relative errors in estimated inner product values by 20-50% for top matches, crucial for tasks like probability computations in AI models. Third, at fixed bitrates of 100 and 200, it surpassed state-of-the-art competitors like QUIPS and LSQ by achieving 5-15% higher recall at all ranks (e.g., top-1 to top-100). Fourth, in end-to-end speed-recall tests on public benchmarks, the method was the fastest at high recall rates (above 90%), processing queries 1.5-2 times quicker than 11 rivals including FAISS and HNSWlib on a standard CPU. Finally, adaptations to binary quantization yielded 2-3x recall gains on image datasets like SIFT1M.

These results mean more accurate and efficient MIPS in practice, cutting computation time and memory use by 50-90% through compression while maintaining or exceeding retrieval quality—directly impacting costs in cloud-based AI (potentially millions in savings for billion-scale systems) and enabling real-time performance in user-facing apps. Unlike prior work focused on uniform error reduction, this prioritizes decision-critical matches, differing from expectations by showing orthogonal errors can be tolerated more, which matters for scaling without accuracy trade-offs in neural embedding tasks.

Leaders should integrate this open-source method into existing MIPS libraries like Google's ScaNN or FAISS to upgrade recommendation and classification pipelines, starting with pilot tests on production datasets. For systems with non-standard query distributions, estimate and adjust the weighting parameter (e.g., threshold T=0.2 for typical cases). Trade-offs include slightly longer training times (2-5x for codebooks) versus gains in inference speed; if embeddings vary widely in norms, full normalization is advised. Further work could validate on ultra-high-dimensional (e.g., 10,000+) or non-embedding data via pilots.

While robust on neural-derived benchmarks, limitations include reliance on isotropic query assumptions—real queries might cluster, requiring empirical tweaks—and less testing on non-word or low-redundancy datasets, where gains could be smaller. Overall confidence is high for standard large-scale retrieval (90%+ on validated metrics), but caution is warranted for specialized domains until more data confirms universality.

1. Introduction

Section Summary: Maximum inner product search, or MIPS, is a technique used in large-scale tasks like recommending items or classifying data among millions of categories by finding the vector in a database that best matches a user's query through simple dot-product math. While exact searches are too slow for huge datasets, approximate methods like quantization speed things up but traditionally focus on minimizing overall errors in recreating the data, which isn't ideal for MIPS since mistakes in high-matching items hurt results more. To fix this, the authors introduce a new "score-aware" approach that prioritizes errors based on how well vectors align with queries, leading to better performance and more accurate matches on standard benchmarks.

Maximum inner product search (MIPS) has become a popular paradigm for solving large scale classification and retrieval tasks. For example, in recommendation systems, user queries and documents are embedded into a dense vector space of the same dimensionality and MIPS is used to find the most relevant documents given a user query ([1]). Similarly, in extreme classification tasks ([2]), MIPS is used to predict the class label when a large number of classes, often on the order of millions or even billions are involved. Lately, MIPS has also been applied to training tasks such as scalable gradient computation in large output spaces ([3]), efficient sampling for speeding up softmax computation ([4]) and sparse updates in end-to-end trainable memory systems ([5]).

To formally define the Maximum Inner Product Search (MIPS) problem, consider a database $X = {x_i}_{i=1, 2, \dots, n}$ with $n$ datapoints, where each datapoint $x_i \in {\mathbb{R}}^d$ in a $d$-dimensional vector space. In the MIPS setup, given a query $q \in {\mathbb{R}}^d$, we would like to find the datapoint $x \in X$ that has the highest inner product with $q$, i.e., we would like to identify

$ \begin{align*} x_i^* := \operatorname{arg, max}_{x_i \in X} \langle q, x_i \rangle. \end{align*} $

Exhaustively computing the exact inner product between $q$ and $n$ datapoints is often expensive and sometimes infeasible. Several techniques have been proposed in the literature based on hashing, graph search, or quantization to solve the approximate maximum inner product search problem efficiently, and the quantization based techniques have shown strong performance ([6, 7, 8]).

In most traditional quantization works, the objective in the quantization procedures is to minimize the reconstruction error for the database points. We show this is a suboptimal loss function for MIPS. This is because for a given query, quantization error for database points that score higher, or have larger inner products, is more important. Using this intuition, we propose a new family of score-aware quantization loss functions and apply it to multiple quantization techniques. Our contributions are as follows:

We propose the score-aware quantization loss function. The proposed loss can work under any weighting function of the inner product and regardless of whether the datapoints vary in norm.
Under natural statistical assumptions, we show that the score-aware quantization loss can be efficiently calculated. The loss function leads to an anisotropic weighting that more greatly penalizes error parallel with the datapoint than error orthogonal to the datapoint.
The proposed loss is generally applicable to many quantization methods. We demonstrate the codebook learning and quantization procedures for product quantization and vector quantization can be efficiently adapted to the proposed loss function.
We show that anisotropic quantization leads to large MIPS performance gains over reconstruction loss-based techniques. Our method achieves state-of-the-art performance on standard large-scale benchmarks such as Glove-1.2M. In addition to recall gains, anisotropic quantization gives significantly more accurate inner product value approximations.

2. Background and Related Works

Section Summary: In machine learning systems like recommender engines and information retrieval, a key challenge is efficiently finding the best matches between user queries and items by computing maximum inner products in a shared vector space, where items are represented as embeddings. To speed this up, techniques divide the search space into buckets to evaluate fewer items, using methods like tree searches, hashing, or graph navigation, while quantization compresses data to enable faster calculations, reduce memory use, and save storage. This work builds on various quantization approaches by introducing a new loss function tailored to improve the overall inner product search performance, rather than just minimizing data reconstruction errors.

2.1 Inference as Maximum Inner Product Search

Efficient maximum inner product search (MIPS) is necessary for many large-scale machine learning systems. One popular approach to information retrieval systems and recommender systems uses representation learning in the embedding space. In this framework, we learn embedding functions to map items to be retrieved in a common vector space, where the items can be words, images, users, audio, products, web pages, graph nodes, or anything of interest ([1, 9, 10, 11, 12]).

In recommender systems, two networks are jointly trained to generate query (user) vectors and item vectors, such that embedding vectors of queries and relevant items have high inner product when computed in the embedding space. To perform inference, we first pre-compute a database of embedding vectors for items to be recommended. When a query arrives, we compute the query embedding then return the items with the highest inner product. In extreme classification, a neural network classifier is trained, where each row of the weight matrix of the classification layer corresponds to the embedding of a class label ([2, 13]). In both settings, the computationally expensive operation is finding the item embedding that has the largest inner product with the query embedding, which can be efficiently solved by Maximum Inner Product Search (MIPS).

2.2 Methods for accelerating MIPS

There is a large body of similarity search literature on max inner product and nearest neighbor search. We refer readers to ([14, 15]) for a comprehensive survey. We include a brief summary here.

There are two main tasks required to develop an efficient MIPS system. One task is to reduce the number of items that are scored to identify the top result. This is typically done with a space partitioning method. The other task is improving the rate at which items are scored. This is typically done with quantization, and is where the main contribution of our work lies. Successful implementation of MIPS systems requires good performance in both tasks.

Many researchers have developed high quality implementations of libraries for nearest neighbor search, such as SPTAG [16], FAISS [8], and hnswlib [17]. We compare with the ones available on ANN-Benchmarks in Section 5.

2.2.1 Reducing the Number of Evaluations

One class of approaches to reducing the number of items scored is space partitioning. These approaches partition the space into different buckets. To perform MIPS in this setting, we find the relevant buckets for a given query and score only the items in these buckets.

Examples of this approach include tree search methods and locality sensitive hashing. Tree search methods such as ([18, 19]) partition the space recursively, forming a tree. Locality sensitive hashing ([20, 21, 22, 23]) partitions the space using a similarity-preserving hash function. There is also a class of approaches based on graph search ([17, 24]). These methods work by navigating a graph by greedily selecting the neighbor with the highest dot product.

2.2.2 Quantization

Quantization is an important technique for building state-of-the-art MIPS systems in large scale settings. Below we describe the several ways that quantization improves performance.

Efficient dot product computations: We can calculate the dot product of a $d$ dimensional query vector with $n$ quantized points in time $O(dk + mn)$ using look up tables, where $k$ is the size of each quantization codebook and $m$ is the number of codebooks. For typical choices of $k$ and $m$ this is faster than the $O(n d)$ complexity required for exact computation.
Memory bandwidth: modern processors need workloads with a high amount of computation per memory read in order to fully utilize their resources. Quantization compresses datapoints, resulting in less memory bandwidth usage and higher processor utilization.
Storage: quantized datapoints take up less space in memory or on disk. For large-scale datasets, this allows more datapoints to be stored on a single machine.

One approach to quantization is with random projections ([25, 26, 27]). One issue with random projections is that quantization is oblivious to the data, and it may be more efficient to use a quantization method that is able to exploit structure in the data. Quantization methods of this form are available for binary quantization ([28, 29, 30]), product quantization ([31, 32, 33, 34]), additive quantization ([7, 35]), and ternary quantization ([36]). We discuss product quantization in more detail in Section 4. There are also lines of work that focus on learning transformations before quantization ([37, 6, 38]). Learning quantization from the observed data distribution also has been studied in [39, 40, 41].

Our work differs from the above methods as they essentially focus on minimizing reconstruction error as a loss function, while we develop an approach in the following section where we minimize a novel loss function that is designed to improve the downstream MIPS objective.

We also highlight the work [42], where they consider quantization objectives for word embeddings that improve the downstream performance of training models for natural language processing tasks.

3. Problem Formulation

Section Summary: Traditional quantization methods aim to reduce errors in approximating data points, which helps minimize mistakes in calculating similarities between queries and database items, assuming queries are evenly distributed in direction. However, not all similarities are equally critical—those with high values, which likely rank highest in searches, deserve more accurate approximation to avoid skewing results. To address this, the authors introduce a score-aware quantization loss that prioritizes errors in high-similarity pairs by weighting them according to their true similarity score, breaking down the error into components aligned and perpendicular to the data point for better control.

Common quantization techniques focus on minimizing the reconstruction error (sum of squared error) when $x$ is quantized to $\tilde{x}$. It can be shown that minimizing the reconstruction errors is equivalent to minimizing the expected inner product quantization error under a mild condition on the query distribution without assumption on the database point distribution. Indeed, consider the quantization objective of minimizing the expected total inner product quantization errors over the query distribution:

$ \begin{align} \mathbb{E}q\sum{i=1}^n (\langle q, x_i \rangle - \langle q, \tilde{x_i} \rangle)^2= \mathbb{E}q \sum{i=1}^n \langle q, x_i - \tilde{x_i} \rangle^2. \end{align}\tag{1} $

Under the assumption that $q$ is isotropic, i.e., $\mathbb{E} [q q^T] = c I$, where $I$ is the identity matrix and $c \in {\mathbb{R}}^{+}$, the objective function becomes

$ \begin{align*} \sum_{i=1}^n \mathbb{E}q \langle q, x_i - \tilde{x_i} \rangle^2 &= \sum{i=1}^n \mathbb{E}q (x_i - \tilde{x_i})^T q q^T (x_i - \tilde{x_i}) \ &= c \sum{i=1}^n |x_i - \tilde{x_i} |^2 \end{align*} $

Therefore, the objective becomes minimizing the reconstruction errors of the database points $\sum_{i=1}^n |x_i-\tilde{x_i} |^2$, and this has been considered extensively in the literature.

One key observation about the above objective function Equation 1 is that it takes expectation over all possible combinations of datapoints $x$ and queries $q$. However, it is easy to see that not all pairs of $(x, q)$ are equally important. The approximation error on the pairs which have a high inner product is far more important since they are likely to be among the top ranked pairs and can greatly affect the search result, while for the pairs whose inner product is low the approximation error matters much less. In other words, for a given datapoint $x$, we should quantize it with a bigger focus on its error with those queries which have high inner product with $x$. See Figure 1 for the illustration.

Following this key observation, we propose the score-aware quantization loss. This is a new loss function for quantization that weighs the inner product approximation error by $w$, an arbitrary function of our choice that returns a weight based on the value of the true inner product. Specifically, we define the loss function as the following:

########## {caption="Definition 1"}

Given a datapoint $x_i$, its quantization $\tilde{x_i}$, and a weight function $w: \mathbb{R} \mapsto \mathbb{R}^+$ of the inner product score, the score-aware quantization loss with respect to a query distribution $\mathcal{Q}$ is defined as

$ \ell(x_i, \tilde{x_i}, w) = \mathbb{E}_{q \sim \mathcal{Q}}[w(\langle q, x_i \rangle) \langle q, x_i-\tilde{x_i} \rangle^2].\tag{2} $

$**Figure 1:** (a) Not all pairs of $q$ and $x$ are equally important: for $x$, it is more important to accurately quantize the inner product of $\langle q_1, x \rangle$ than $\langle q_2, x \rangle$ or $\langle q_3, x \rangle$, because $\langle q_1, x \rangle$ has a higher inner product and thus is more likely to be the maximum; (b) Quantization error of $x$ given one of its quantizer $c_2$ can be decomposed to a parallel component $r_{\parallel}$ and an orthogonal component $r_{\perp}$. (c) Graphical illustration of the intuition behind Equation 2. Even if $c_3$ is closer to $x$ in terms of Euclidean distance, $c_2$ is a better quantizer than $c_3$ in terms of inner product approximation error of $\langle q_1, x-c \rangle$. Notice that $c_3$ incur more parallel loss ($r_{\parallel}$), while $c_2$ incur more orthogonal loss ($r_{\perp}$).$

Since the norm of $q$ does not matter to the ranking result, we can assume $||q||=1$ without loss of generality. Similarly, assuming we have no prior knowledge of the query distribution $\mathcal{Q}$, we trivially assume $q$ is uniformly spherically distributed. The expectation can be recomputed if $\mathcal{Q}$ is known or estimated empirically.

3.1 Analyzing Score-Aware Quantization Loss

We show that regardless of the choice of $w$, a score-aware quantization loss $\ell(x_i, \tilde{x_i}, w)$ always decomposes into an anisotropic weighted sum of the magnitudes of the parallel and orthogonal residual errors. These two errors are defined as follows: first, define the residual error of a quantization $\tilde{x_i}$ as $x_i-\tilde{x_i}$. The parallel residual error is the component of the residual error parallel to the datapoint $x_i$; it can be computed as

$ r_{\parallel}(x_i, \tilde{x_i})=\dfrac{\langle (x_i-\tilde{x_i}), x_i \rangle x_i}{||x_i||^2}. $

Orthogonal residual error is defined analogously, and can be computed as

$ r_\perp(x_i, \tilde{x}_i) = (x_i - \tilde{x}i) - r\parallel(x_i, \tilde{x}_i). $

These two components are illustrated in Figure 1b. The relative weights of these two error components in contributing to the score-aware loss are determined by the choice of $w$.

########## {caption="Theorem 2"}

Suppose we are given a datapoint $x_i$, its quantization $\tilde{x_i}$, and a weight function $w$. Assuming that query $q$ is uniformly distributed in the $d$-dimensional unit sphere, the score-aware quantization loss equals

$ \begin{aligned} \ell(x_i, \tilde{x_i}, w) &= h_{\parallel}(w, ||x_i||)||r_{\parallel}(x_i, \tilde{x_i})||^2 \ &+ h_{\perp}(w, ||x_i||)||r_{\perp}(x_i, \tilde{x_i})||^2 \end{aligned} $

with $h_{\parallel}$ and $h_{\perp}$ defined as follows:

$ \begin{aligned} h_{\parallel} &:= (d-1)\int_0^\pi w(||x_i||\cos\theta)(\sin^{d-2}\theta-\sin^{d}\theta) d\theta \ h_{\perp} &:= \int_0^\pi w(||x_i||\cos\theta)\sin^d\theta d\theta. \end{aligned} $

Proof: See Section 7.1.

Any weight function would work for the above proposed loss. For the MIPS problem, it is intuitive to choose $w$ so that it puts greater weight on larger inner products. For such $w$, we show that parallel quantization error is weighted more heavily than orthogonal quantization error. This is formalized below and illustrated in Figure 1.

########## {caption="Theorem 3"}

For any $w$ such that $w(t)=0$ for $t<0$ and $w(t)$ is monotonically non-decreasing for $t\ge0$,

$ h_{\parallel}(w, ||x_i||)\ge h_{\perp}(w, ||x_i||) $

with equality if and only if $w(t)$ is constant for $t\in[-||x_i||, ||x_i||].$

Proof: See Section 7.2.

3.2 Special case of $w(t)=\mathbf{I}(t\ge T)$

One particular $w$ of interest is the function $w(t)=\mathbf{I}(t\ge T)$. This weight function only considers quantization loss when the dot product is above a threshold $T$. Since $\mathbf{I}(t\ge T)$ satisfies the conditions for Theorem 3, it effectively penalizes parallel quantization error more greatly than orthogonal error. With this weight function, our expressions for $h_{\parallel}$ and $h_{\perp}$ simplify to:

$ \begin{aligned} h_{\parallel} &= (d-1)\int_0^{\arccos (T/||x_i||)} \sin^{d-2}\theta-\sin^{d}\theta d\theta \ h_{\perp} &= \int_0^{\arccos (T/||x_i||)} \sin^d\theta d\theta \end{aligned} $

With $w(t)=\mathbf{I}(t\ge T)$, we have

$ \begin{align*} \ell(x_i, \tilde{x}i, w) =;& h{\parallel}(w, ||x_i||)||r_{\parallel}(x_i, \tilde{x}i)||^2 + \ & h{\perp}(w, ||x_i||) ||r_{\perp}(x_i, \tilde{x}i)||^2 \ \propto; &\eta(w, ||x_i||)||r{\parallel}(x_i, \tilde{x}i)||^2 + ||r{\perp}(x_i, \tilde{x}_i)||^2 \end{align*} $

where $\eta(w, ||x_i||) := \dfrac{h_{\parallel}(w, ||x_i||)}{h_{\perp}(w, ||x_i||)}$.

$**Figure 2:** The ratio $\eta(\mathbf{I}(t\ge T=0.2), \|x\|=1)/(d-1)$ in Equation 3 computed analytically as a function of $d$ quickly approaches the limit defined in Equation 3.$

We can recursively compute $\eta(w=\mathbf{I}(t\ge T), ||x_i||)$ as a function of $d$ analytically. Furthermore we can prove that $\frac{\eta}{d-1}$ has an limit as $d\to\infty$, as demonstrated empirically in Figure 2. We can use this limit, which is easy to evaluate, as a proxy of $\eta$ in computing the proposed loss.

########## {caption="Theorem 4"}

$ \lim_{d\to\infty}\dfrac{\eta(\mathbf{I}(t\ge T), ||x_i||)}{d-1}=\dfrac{(T/||x_i||)^2}{1-(T/||x_i||)^2}\tag{3} $

Proof: See Section 7.3.

As special cases, when $T=0$, $\eta(\mathbf{I}(t\ge 0), ||x_i||)=1$ which implies parallel and orthogonal errors are weighted equally. When $T=||x_i||$, we have $\eta(\mathbf{I}(t\ge ||x_i||), ||x_i||)=\infty$ which indicates we should only consider parallel error.

Theorem 2 shows that the weight of each datapoint's parallel and orthogonal quantization errors are dependent on $||x_i||$. However, when the database has constant norm, i.e. $||x_i|| = c$, we can use the following simplified form:

$ \begin{align*} &\sum_{i=1}^{n} \ell(x_i, \tilde{x}i, \mathbf{I}(t\ge T)) \ &\propto \eta(w, c) \sum{i=1}^n ||r_{\parallel}(x_i, \tilde{x}i)||^2 + \sum{i=1}^n ||r_{\perp}(x_i, \tilde{x}_i)||^2 \end{align*} $

4. Application to Quantization Techniques

Section Summary: This section explores how a special weighted loss function, which balances errors along and perpendicular to data directions, can improve vector quantization for faster database searches. It describes building a dictionary of codewords by optimizing this loss through an iterative process similar to k-means clustering, where points are assigned to the nearest codeword and codewords are updated using a precise formula that accounts for the weights, leading to quicker computations of similarities between queries and stored data. The approach also extends to product quantization, which divides data into subspaces and quantizes them separately to handle much larger effective dictionaries without slowing down performance.

In this section we consider the codebook learning and quantization procedure for our proposed anisotropic loss function. In the previous sections, we established that the loss function, $\ell(x_i, \tilde{x}_i, w)$ leads to a weighted combination of parallel quantization error and orthogonal quantization error. In practice, we can choose a fixed $\eta$ according to the choice of $w$ such as the one suggested in Section 3.2.

In vector quantization, we first construct a dictionary $C = {c_1, c_2, \ldots, c_k}$. To quantize a vector $x$ we replace $x$ with one of the codewords. Typically, the quantized vector $\tilde{x}$ minimizes some loss function: $\tilde{x} = \operatorname{arg, min}_{c_1, c_2, \ldots, c_k} L(x_i, c_i)$.

After we quantize a database of $n$ points, we can calculate the dot product of a query vector $q$ with all quantized points in $O(k d + n)$ time. This is much better than the $O(n d)$ time required for the original unquantized database. We achieve the $O(kd+n)$ runtime by computing a lookup table containing the inner product of the $q$ with each of the $k$ codewords in $O(k d)$ time. We then do a table lookup for each of the $n$ datapoints to get their corresponding inner products.

In order to construct the dictionary $C$, we need to optimize the choice of codewords over the loss function. For $\ell_2$-reconstruction loss, the optimization problem becomes

$ \min_{c_1, c_2, \ldots, c_k \in \mathbb{R}^d}\sum_{x_i}\min_{\tilde{x}_i \in {c_1, c_2, \ldots, c_k}}|x_i - \tilde{x}_i|^2. $

This is exactly the well-studied $k$-means clustering objective, which is often solved using Lloyd's algorithm.

If, as in the previous section, we have our loss function $\ell(x, \tilde{x}) = h_{i, \parallel} |r_\parallel(x_i, \tilde{x_i})|^2 + h_{i, \perp}|r_\perp(x_i, \tilde{x_i})|^2$ for appropriate scaling parameters $h_{i, \parallel}$, $h_{i, \perp}$, we obtain a new objective function we call the anisotropic vector quantization problem.

########## {caption="Definition"}

Given a dataset $x_1, x_2, \ldots, x_n$ of points in $\mathbb{R}^d$, scaling parameters $h_{i, \parallel}$, $h_{i, \perp}$ for every datapoint $x_i$, and $k$ codewords, the anisotropic vector quantization problem is finding the $k$ codewords that minimize the objective function

$ \begin{align*} \min_{c_1, \ldots, c_k}\sum_{x_i}\min_{\tilde{x}i \in {c_1, \ldots, c_k}} &h{i, \parallel}|r_\parallel(x_i, \tilde{x}_i)|^2 \

&h_{i, \perp}|r_\perp(x_i, \tilde{x}_i)|^2. \end{align*} $

Next we develop an iterative algorithm to optimize the anisotropic vector quantization problem. Similar to Lloyd's algorithm ([43]), our algorithm iterate between partition assignment step and codebook update step:

(Initialization Step) Initialize codewords $c_1, c_2, \ldots, c_k$ to be random datapoints sampled from $x_1 \ldots x_n$.
(Partition Assignment Step) For each datapoint $x_i$ find its codeword $\tilde{x_i} = \operatorname{arg, min}_{\tilde{x}_i \in {c_1, \ldots, c_k}} \ell(x_i, \tilde{x}_i)$. This can be done by enumerating all $k$ possile choices of codewords.
(Codebook Update Step) For every codeword $c_j$, let $X_j$ be all datapoints $x_i$ such that $\tilde{x}_i = c_j$. Update $c_j$ by

$ c_j \gets \operatorname{arg, min}{c \in \mathbb{R}^d}\sum{x_i \in X_j} \ell(x_i, c). $

Repeat Step 2 and Step 3 until convergence to a fixed point or maximum number of iteration is reached.

In each iteration, we need perform update step for each of the codeword. Given a partition of the datapoints $X_j$, we can find the optimal value of the codeword $c_j$ that minimizes the following objective:

$ c_j = \operatorname{arg, min}{c \in \mathbb{R}^d}\sum{x \in X_j} h_{i, \parallel} |r_\parallel(x_i, c)|^2 + h_{i, \perp}|r_\perp(x_i, c)|^2.\tag{4} $

By setting gradient respect to $c_j$ to zero, we obtain the following update rule:

########## {caption="Theorem 5"}

Optimal codeword $c_j$ can be obtained in closed form by solving the optimization problem in Equation 4 for a partition $X_j$. The update rule for the codebook is

$ \begin{aligned} c_j^* = \Bigg(I&\sum_{x_i\in X_j}h_{i, \perp} + \ &\sum_{x_i \in X_j} \dfrac{h_{i, \parallel}-h_{i, \perp}}{||x_i||^2}x_ix_i^T\Bigg)^{-1}\sum_{x_i \in X_j}h_{i, \parallel} x_i \end{aligned} $

Proof: See Section 7.4 of the Appendix for the proof.

As expected, we see that when all $h_{i, \parallel} = h_{i, \perp}$, our codeword update is equivalent to finding the weighted average of the partition. Furthermore, if $w(t)=1$, the update rule becomes finding the average of datapoints in the partition, same as standard $k$-means update rule. Additionally, since there are only a finite number of partitions and at every iteration the loss function decreases or stays constant, our solution will eventually converge to a fixed point.

4.1 Product Quantization

In vector quantization with a dictionary of size $k$, we quantize each datapoint into one of $k$ possible codewords. We can think of this as encoding each datapoint with one dimension with $k$ possible states.

With product quantization we encode each datapoint into an $M$ dimensional codeword, each with $k$ possible states. This allows us to represent $k^M$ possible codewords, which would not be scalable with vector quantization. To do this, we split each datapoint $x$ into $M$ subspaces each of dimension $d / M$: $x = (x^{(1)}, x^{(2)}, \ldots, x^{(m)})$. We then create $M$ dictionaries $C^{(1)}, C^{(2)}, \ldots, C^{(m)}$, each with $k$ codewords of dimension $d / M$. Each datapoint would then be encoded with $M$ dimensions, with every dimension taking one of $k$ states.

To calculate distances with product quantization, for every dictionary $C^{(m)}$ we calculate the partial dot product of the relevant subspace of the query with every codeword in the dictionary. The final dot product is obtain by sum up all $M$ partial dot product. We can then calculate the dot product with $m$ quantized datapoints in time $O(kd + m n)$.

Using our anisotropic loss function $\ell(x_i, \tilde{x}i) = h{i, \parallel} | r_\parallel(x_i, \tilde{x}i)|^2 + h{i, \perp}|r_\perp(x_i, \tilde{x}_i)|^2$ we obtain a new objective function for product quantization we call the anisotropic product quantization problem.

########## {caption="Definition"}

Given a dataset $x_1, x_2, \ldots, x_n$ of points in $\mathbb{R}^d$, a scaling parameter $\eta$, a number $M$ of dictionaries each with elements of size $d / M$ and $k$ codewords in each dictionary, the anisotropic product quantization problem is to find the $M$ dictionaries that minimizes

$ \begin{align*} \min_{\substack{C^{(m)} \subseteq \mathbb{R}^{d / M} \ \vert C^{(m)} \vert = k}}\sum_{x_i}\min_{\tilde{x}i \in \prod_m C^{(m)}} &h{i, \parallel} |r_\parallel(x_i, \tilde{x}i)|^2 \ &+ h{i, \perp}|r_\perp(x_i, \tilde{x}_i)|^2. \end{align*} $

We again consider an iterative algorithm for the problem. We first initialize all quantized datapoints with some element from every dictionary. We then consider the following iterative procedure:

(Initialization Step) Select a dictionary $C^{(m)}$ by sampling from ${x^{(m)}_1, \ldots x^{(m)}_n}$.
(Partition Assignment Step) For each datapoint $x_i$, update $\tilde{x}_i$ by using the value of $c \in C^{(m)}$ that minimizes the anisotropic loss of $\tilde{x}_i$.
(Codebook Update Step) Optimize the loss function over all codewords in all dictionaries while keeping every dictionaries partitions constant.
Repeat Step 2 and Step 3 until convergence to a fixed point or maximum number of iteration is reached.

We can perform the update step efficiently since once the partitions are fixed the update step minimizes a convex loss, similar to that of vector quantization. We include details in Section 7.5 of the Appendix. Additionally, since there are a finite number of partition assignment and at every step the loss function decreases or stays constant, our solution will eventually converge to a fixed point. We note that we can also optionally initialize the codebook by first training the codebook under regular $\ell_2$-reconstruction loss, which speed up training process.

5. Experiments

Section Summary: This section describes experiments testing a new quantization technique for maximum inner product search, a method used in fast database searches for similar items. Researchers compare their score-aware loss function to the traditional reconstruction loss and find it delivers better retrieval accuracy and more precise estimates of top similarities on datasets like Glove1.2M word embeddings. They also show their approach outperforms leading methods like QUIPS and LSQ in recall tests at fixed data compression levels, and in speed-recall benchmarks against 11 competitors, it achieves top performance as the fastest option for high-accuracy searches.

In this section, we show our proposed quantization objective leads to improved performance on maximum inner product search. First, we fix the quantization mechanism and compare traditional reconstruction loss with our proposed loss to show that score-aware loss leads to better retrieval performance and more accurate estimation of maximum inner product values. Next, we compare in fixed-bit-rate settings against QUIPS and LSQ, which are the current state-of-the-art for many MIPS tasks. Finally, we analyze the end-to-end MIPS retrieval performance of our algorithm in terms of its speed-recall trade-off in a standardized hardware environment. We used the benchmark setup from ann-benchmarks.com, which provides 11 competitive baselines with pre-tuned parameters. We plot each algorithm's speed-recall curve and show ours achieves the state-of-the-art.

$**Figure 3:** (a) The retrieval Recall1@10 for different values of the threshold $T$. We see that for $T = 0.2$ (corresponding to $\eta = 4.125$) our proposed score-aware quantization loss achieves significantly better Recall than traditional reconstruction loss. (b) The relative error of inner product estimation for true Top-1 on `Glove1.2M` dataset across multiple number of bits settings. We see that our proposed score-aware quantization loss reduces the relative error compared to reconstruction loss.$

5.1 Direct comparison with reconstruction loss

We compare our proposed score-aware quantization loss with the traditional reconstruction loss by fixing all parameters other than the loss function in the following experiments.

We use Glove1.2M which is a collection of 1.2 million 100-dimensional word embeddings trained as described in [45]. See Section 7.8 of the Appendix for our rationale for choosing this dataset. For all experiments we choose $w(t)=\mathbf{I}(t\ge T)$. The Glove dataset is meant to be used with a cosine distance similarity metric, while our algorithm is designed for the more general MIPS task. MIPS is equivalent to cosine similarity search when all datapoints are equal-norm, so we adopt our technique to cosine similarity search by unit-normalizing all datapoints at training time.

We first compare the two losses by their Recall1@10 when used for product quantization on Glove1.2M, as shown in Figure. Figure 3a. We learn a dictionary by optimizing product quantization with reconstruction loss. We then quantize datapoints two ways, first by minimizing reconstruction loss and then by minimizing score-aware loss. We see that score-aware quantization loss achieves significant recall gains as long as $T$ is chosen reasonably. For all subsequent experiments, we set $T = 0.2$, which by the limit in Equation 3 corresponds to a value of $\eta = 4.125$.

Next we look at the accuracy of the estimated top-1 inner product as measured by relative error: $|\frac{\langle q, x \rangle - \langle q, \tilde{x} \rangle}{\langle q, x \rangle}|$. This is important in application scenarios where an accurate estimate of $\langle q, x \rangle$ is needed, such as softmax approximation, where the inner product values are often logits later used to compute probabilities. One direct consequence of score-aware loss functions is that the objective weighs pairs by their importance and thus leads to lower estimation error on top-ranking pairs. We see in Figure. Figure 3b that our score-aware loss leads to smaller relative error over all bitrate settings.

Datasets other than Glove demonstrate similar performance gains from score-aware quantization loss. See Section 7.6 of the Appendix for results on the Amazon-670k extreme classification dataset.

5.2 Maximum inner product search retrieval

Next, we compare our MIPS retrieval performance against other quantization techniques at equal bitrate. We compare to LSQ [35] and all three variants of QUIPS [32]. In Figure 4 we measure the performance at fixed bitrates of 100 and 200 bits per datapoint. Our metric is Recall 1@N, which corresponds to the proportion of queries where the top $N$ retrieved results contain the true top-1 datapoint. Our algorithm using score-aware loss outperforms other algorithms at both bitrates and all ranges of $N$.

Other quantization methods may also benefit from using score-aware quantization loss. For example, binary quantization techniques such as [30] use reconstruction loss in their original paper, but can be easily adapted to the proposed loss by a one line change to the loss objective. We show results which illustrate the improvement of such a change in Section 7.7 of Appendix.

5.3 Recall-Speed benchmark

Fixed-bit-rate experiments mostly compare asymptotic behavior and often overlook preprocessing overhead such as learned rotation or lookup table computation, which can be substantial. To evaluate effectiveness of MIPS algorithms in a realistic setting, it is important to perform end-to-end benchmarks and compare speed-recall curves. We adopted the methodology of public benchmark ANN-Benchmarks [44], which plots a comprehensive set of 11 algorithms for comparison, including faiss [8] and hnswlib [17].

Our benchmarks are all conducted on an Intel Xeon W-2135 with a single CPU thread, and followed the benchmark's protocol. Our implementation builds on product quantization with the proposed quantization and SIMD based ADC [32] for distance computation. This is further combined with a vector quantization based tree [34]. Our implementation is open-source and available at https://github.com/google-research/google-research/tree/master/scann and furthermore the exact configurations used to produce our benchmark numbers are part of the ANN-Benchmarks GitHub repository. Figure 4 shows our performance on Glove1.2M significantly outperforms competing methods in the high-recall region.

6. Conclusion

Section Summary: This paper introduces a new way to measure errors in compressing data for inner product searches, focusing on similarity scores rather than just recreating the original data. The method weights the errors higher for pairs of data points that are more similar, and it's backed by math while working with common compression techniques like product and binary quantization. Tests on public datasets reveal it delivers better accuracy in finding matches and estimating similarities, and it even surpasses top existing methods in balancing speed and performance.

In this paper, we propose a new quantization loss function for inner product search, which replaces traditional reconstruction error. The new loss function is weighted based on the inner product values, giving more weight to the pairs of query and database points with higher inner product values. The proposed loss function is theoretically proven and can be applied to a wide range of quantization methods, for example product and binary quantization. Our experiments show superior performance on retrieval recall and inner product value estimation compared to methods that use reconstruction error. The speed-recall benchmark on public datasets further indicates that the proposed method outperforms state-of-the-art baselines which are known to be hard to beat.

7. Appendix

Section Summary: This appendix contains mathematical proofs for two theorems related to data quantization and loss functions in high-dimensional spaces. The first proof establishes a lemma that breaks down the expected squared inner product between a random direction and the quantization error into components parallel and perpendicular to the original data point, then uses this to express the overall loss as a weighted sum of those error components. The second proof shows that the ratio of certain integrals involving the weight function is at least 1, achieving equality only when the weights are constant, ensuring the parallel error is at least as significant as the perpendicular one under the model's assumptions.

7.1 Proof of

Theorem 2

We first prove the following lemma:

########## {caption="Lemma"}

Suppose we are given a datapoint $x$ and its quantization $\tilde{x}$. If $q$ is uniformly spherically distributed, then

$ \mathbb{E}q[\langle q, x-\tilde{x} \rangle^2 | \langle q, x \rangle=t] = \dfrac{t^2}{||x||^2}||r\parallel(x, \tilde{x})||^2 + \dfrac{1-\frac{t^2}{||x||^2}}{d-1}||r_\perp(x, \tilde{x})||^2 $

with $r_\parallel$ and $r_\perp$ defined as in Section 3.1.

Proof: First, we can decompose $q :=q_{\parallel}+q_{\perp}$ with $q_{\parallel} :=\langle q, x \rangle \cdot \frac{x}{||x||}$ and $q_{\perp} :=q - q_{\parallel}$ where $q_{\parallel}$ is parallel to $x$ and $q_{\perp}$ is orthogonal to $x$. Then, we have

$ \begin{align} \mathbb{E}{q} [\langle q, x - \tilde{x} \rangle ^2 | \langle q, x \rangle=t] &= \mathbb{E}q [\langle q{\parallel}+q{\perp}, r_{\parallel}(x, \tilde{x})+r_{\perp}(x, \tilde{x}) \rangle ^2 | \langle q, x \rangle=t] \nonumber \ &= \mathbb{E}q [(\langle q{\parallel}, r_{\parallel}(x, \tilde{x}) \rangle + \langle q_{\perp}, r_{\perp}(x, \tilde{x}) \rangle)^2 | \langle q, x \rangle=t] \nonumber \ &= \mathbb{E}q [\langle q{\parallel}, r_{\parallel}(x, \tilde{x}) \rangle^2 | \langle q, x \rangle=t] + \mathbb{E}q [\langle q{\perp}, r_{\perp}(x, \tilde{x}) \rangle^2 | \langle q, x \rangle=t], \end{align}\tag{5} $

The last step uses the fact that $\mathbb{E}q [\langle q{\parallel}, r_{\parallel}(x, \tilde{x}) \rangle \langle q_{\perp}, r_{\perp}(x, \tilde{x}) \rangle | \langle q, x \rangle=t] = 0$ due to symmetry. The first term of Equation 5, $\mathbb{E}q [\langle q{\parallel}, r_{\parallel}(x, \tilde{x}) \rangle^2 | \langle q, x \rangle=t] = |r_{\parallel}(x, \tilde{x})|^2 \mathbb{E}q [|q{\parallel}|^2 | \langle q, x \rangle=t] = \frac{|r_{\parallel}|^2 t^2}{|x|^2}$. For the second term, since $q_{\perp}$ is uniformly distributed in the $(d-1)$ dimensional subspace orthogonal to $x$ with the norm $\sqrt{1-\frac{t^2}{|x|^2}}$, we have $\mathbb{E}q [\langle q{\perp}, r_{\perp}(x, \tilde{x}) \rangle^2 | \langle q, x \rangle=t] = \frac{1-\frac{t^2}{|x|^2}}{d-1} ||r_{\perp}(x, \tilde{x})||^2$. Therefore

$ \mathbb{E}{q} [\langle q, r(x, \tilde{x}) \rangle ^2 | \langle q, x \rangle=t] = \frac{t^2}{|x|^2} ||r{\parallel}(x, \tilde{x})||^2+\frac{1-\frac{t^2}{|x|^2}}{d-1} ||r_{\perp}(x, \tilde{x})||^2. $

Proof of Theorem 2: We can expand $\ell(x_i, \tilde{x}_i, w)$ as

$ \int_{-||x_i||}^{||x_i||} w(t)\mathbb{E}_q[\langle q, x_i-\tilde{x}_i \rangle^2 | \langle q, x_i \rangle=t] d\text{P}(\langle q, x_i \rangle\le t) $

Let $\theta :=\arccos\dfrac{t}{||x_i||}$ so $t=||x_i||\cos\theta$. Because we are assuming $q$ is uniformly spherically distributed, $\frac{d\text{P}(\langle q, x \rangle \le t)}{d t}$ is proportional to the surface area of $(d-1)$-dimensional hypersphere with a radius of $\sin \theta$. Thus we have $\frac{d\text{P}(\langle q, x \rangle=t)}{dt} \propto S_{d-1} \sin^{d-2}\theta$, where $S_{d-1}$ is the surface area of $(d-1)$-sphere with unit radius. Our integral can therefore be written as:

$ \int_0^\pi w(||x_i||\cos\theta)\mathbb{E}_q[\langle q, x_i-\tilde{x}_i \rangle^2 | \langle q, x_i \rangle=||x_i||\cos\theta] \sin^{d-2}\theta d\theta. $

Using our above lemma this simplifies to

$ \int_0^\pi w(||x_i||\cos\theta)\left(\cos^2\theta ||r_{\parallel}(x, \tilde{x})||^2+\frac{\sin^2\theta}{d-1} ||r_{\perp}(x, \tilde{x})||^2\right) \sin^{d-2}\theta d\theta. $

From here we can clearly see that

$ \begin{aligned} \ell(x_i, \tilde{x}i, w) &= h\parallel(w, ||x_i||)||r_\parallel(x_i, \tilde{x}i)||^2+h\perp(w, ||x_i||)||r_\perp(x_i, \tilde{x}i)||^2, \ h\parallel &:=\int_0^\pi w(||x_i||\cos\theta)(\sin^{d-2}\theta-\sin^d\theta)d\theta, \ h_\perp &:=\dfrac{1}{d-1}\int_0^\pi w(||x_i||\cos\theta)\sin^d\theta d\theta \end{aligned} $

as desired.

7.2 Proof of

Theorem 3

Proof of Theorem 3: Note that $h_\parallel$ and $h_\perp$ equal zero if and only if $w(t)=0$ for $t\in[-||x_i||, ||x_i||]$. Otherwise both quantities are strictly positive so it is equivalent to prove that $\dfrac{h_\parallel(w, ||x_i||)}{h_\perp(w, ||x_i||)}\ge1$ with equality if and only if $w$ is constant.

$ \begin{aligned} \dfrac{h_\parallel(w, ||x_i||)}{h_\perp(w, ||x_i||)} &= \dfrac{\displaystyle\int_0^\pi w(||x_i||\cos\theta)(\sin^{d-2}\theta-\sin^d\theta)d\theta}{\dfrac{1}{d-1}\displaystyle\int_0^\pi w(||x_i||\cos\theta)\sin^d\theta d\theta} \ &= (d-1)\left(\dfrac{\int_0^\pi w(||x_i||\cos\theta)\sin^{d-2}\theta d\theta}{\int_0^\pi w(||x_i||\cos\theta)\sin^d\theta d\theta}-1\right) \end{aligned} $

Define $I_d:=\int_0^\pi w(||x_i||\cos\theta)\sin^d\theta d\theta$. Our objective is to prove $(d-1)\left(\dfrac{I_{d-2}}{I_d}-1\right)\ge1$ or equivalently $\dfrac{I_{d-2}}{I_d}\ge\dfrac{d}{d-1}$. To do this we use integration by parts on $I_d$:

$ \begin{aligned} I_d=&-w(||x_i||\cos\theta)\cos\theta\sin^{d-1}\theta\Big|0^\pi+\ &\int_0^\pi \cos\theta\left[w(||x_i||\cos\theta)(d-1)\sin^{d-2}\theta\cos\theta-w'(||x_i||\cos\theta)||x_i||\sin^d\theta\right] d\theta\ =&(d-1)\int_0^\pi w(||x_i||\cos\theta)\cos^2\theta\sin^{d-2}\theta-||x_i||\int_0^\pi w'(||x_i||\cos\theta)\cos\theta\sin^d\theta d\theta\ =&(d-1)I{d-2}-(d-1)I_d-||x_i||\int_0^\pi w'(||x_i||\cos\theta)\cos\theta\sin^d\theta d\theta \end{aligned}\tag{6} $

We now show that $\int_0^\pi w'(||x_i||\cos\theta)\cos\theta\sin^d\theta d\theta\ge0$ with equality if and only if $w$ is constant. As a prerequisite for this theorem $w(t)=0$ for $t<0$ so our integral simplifies to $\int_0^{\pi/2}w'(||x_i||\cos\theta)\cos\theta\sin^d\theta d\theta\ge0$. From 0 to $\pi/2$ both sine and cosine are non-negative. Since $w$ is non-decreasing in this range, $w'\ge0$ and therefore our integral is non-negative. The integral equals zero if and only if $w'=0$ over the entire range of $t$ which implies $w$ is constant.

Applying our inequality to equation 6 we get $\dfrac{I_{d-2}}{I_d}\ge\dfrac{d}{d-1}$ as desired.

7.3 Proof of Results for $w(t)=\mathbf{I}(t\ge T)$

Proof of Equation 3: Let $\alpha:=\arccos(T/||x_i||)$. If we do the same analysis as Section 7.2 but specialized for $w(t)=\mathbf{I}(t\ge T)$ we find that $I_d=\int_0^\alpha \sin^d\theta d\theta$ and

$ dI_d=(d-1)I_{d-2}-\cos\alpha\sin^{d-1}\alpha.\tag{7} $

From the Cauchy–Schwarz inequality for integrals, we have

$ \Big(\int_0^\alpha{\sin^{\frac{d+2}{2}}\theta \sin^{\frac{d-2}{2}}\theta d\theta}\Big)^2 \le \int_0^\alpha{\sin^{d+2}}\theta d\theta \int_0^\alpha{\sin^{d-2}\theta d\theta} $

Rearranging this we have $\frac{I_d}{I_{d+2}} \le \frac{I_{d-2}}{I_d}$, which proves that $\frac{I_{d-2}}{I_d}$ is monotonically non-increasing as $d$ increases. From Section 7.2 we already have a lower bound $\frac{I_{d-2}}{I_d}>1$. Since the ratio is monotonically non-increasing, $\lim_{d\to\infty}\frac{I_d}{I_{d+2}}$ exists.

Dividing both sides of equation 7 by $dI_d$, we have

$ 1=\frac{-\cos \alpha \sin^{d-1} \alpha}{d I_d} + \frac{(d-1) I_{d-2}}{d I_d} $

Using our above analysis we know that $\displaystyle\lim_{d\to\infty}\frac{(d-1)I_{d-2}}{dI_d}$ exists so therefore $\displaystyle \lim_{d\to \infty} \frac{\cos \alpha \sin^{d-1} \alpha}{d I_d} > 0$ also exists. Furthermore,

$ \lim_{d\to \infty} \frac{\frac{\cos \alpha \sin^{d-1} \alpha}{d I_d}}{\frac{\cos \alpha \sin^{d-3} \alpha}{(d-2) I_{d-2}}}=1 \Rightarrow \lim_{d\to \infty} \frac{(d-2) I_{d-2}}{dI_d} = \frac{1}{\sin^2 \alpha} $

Finally we have $\displaystyle \lim_{d \to \infty} \frac{\eta(\mathbf{I}(t\ge T), ||x_i||)}{d-1}=\frac{1}{\sin^2 \alpha} - 1=\frac{(T/||x_i||)^2}{1-(T/||x_i||)^2}$, and this proves equation 3.

7.4 Proof of

Theorem 5

Proof of Theorem 5: Consider a single point $x_i$ with $r_\parallel := r_\parallel(x_i, \tilde{x}i) = \frac{1}{|x|^2}x_i x_i^T (x_i - \tilde{x}i)$ and $r\perp := r\perp(x_i, \tilde{x}_i) = x_i - \tilde{x}i - r\parallel$. We have that

$ \begin{align} |r_\perp|^2 &= (x_i - \tilde{x}i - r\parallel)^T(x_i - \tilde{x}i - r\parallel) \notag \ &= |x_i|^2 + |\tilde{x}_i|^2 - 2 x_i^T \tilde{x}i - 2 r\parallel^T(x_i - \tilde{x}i) + |r\parallel|^2 \notag\ &= |x_i|^2 + |\tilde{x}_i|^2 - 2 x_i^T \tilde{x}i - |r\parallel|^2, \end{align}\tag{8} $

where we use the fact that $x_i - \tilde{x}i = r\parallel + r_\perp$ and $r_\parallel$ is orthogonal to $r_\perp$.

We also have

$ \begin{align} |r_\parallel|^2 &= \frac{1}{|x_i|^4} \left (x_i(x - \tilde{x}_i)^T x_i \right)^T \left (x_i(x - \tilde{x}_i)^T x_i \right) \notag \ &= \frac{1}{|x_i|^4} x_i^T (x_i - \tilde{x}_i)x_i^T x_i (x_i - \tilde{x}_i)^T x_i \notag\ &= \frac{1}{|x_i|^2} x_i^T (x_i - \tilde{x}_i)(x_i - \tilde{x}_i)^T x_i \notag \ &= |x_i|^2 + \frac{\tilde{x}_i^T x_i x_i^T \tilde{x}_i}{|x_i|^2} - 2 x_i^T \tilde{x}_i. \end{align}\tag{9} $

Combining Equations 8 and 9, we have that

$ h_{i, \parallel} |r_\parallel|^2 + h_{i, \perp} |r_\perp|^2 = \tilde{x}i^T\left((h{i, \parallel} - h_{i, \perp})\frac{x_i x_i^T}{|x_i|^2} + h_{i, \perp} I\right)\tilde{x}i - 2 h{i, \parallel}x_i^T \tilde{x}i + h{i, \parallel}|x_i|^2. $

Ignoring the constant term $h_{i, \parallel}|x_i|^2$ and summing over all datapoints $x_i$ that have $\tilde{x}$ as a center, we have that the total loss is equivalent to

$ \tilde{x}^T \left (\sum_i (h_{i, \parallel} - h_{i, \perp})\frac{x_i x_i^T}{|x_i|^2} + h_{i, \perp} I\right)\tilde{x} - 2 \left(\sum_i h_{i, \parallel}x_i \right)^T \tilde{x}.\tag{10} $

Since we established in Theorem 3 that $h_{i, \parallel} \geq h_{i, \perp}$, we have that the loss function is a convex quadratic function and thus we can calculate the optimal value of $\tilde{x}$ as

$ \tilde{x} = \left (\sum_i (h_{i, \parallel} - h_{i, \perp})\frac{x_i x_i^T}{|x_i|^2} + h_{i, \perp} I\right)^{-1}\left(\sum_i h_{i, \parallel}x_i \right). $

7.5 Codebook Optimization in Product Quantization

Let $c$ be a vector with all dictionary codewords. We can get a quantized point $\tilde{x}_i$ by calculating $B c$, where $B$ is a ${0, 1}$-matrix with dimensions $d \times dk$ that selects the relevant codewords.

For example, suppose ${(-1, -1), (1, 1)}$ are our codewords for the first two dimensions and ${(-2, -2), (2, 2)}$ are our codewords for the next two dimensions. We have our vectorized dictionary $c = (-1, -1, 1, 1, -2, -2, 2, 2)$. If we want to represent $(-1, -1, 2, 2)$, we set $B$ to be

$ \begin{pmatrix} 1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \ 0 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{pmatrix}. $

Similarly, if we want to represent $(1, 1, -2, -2)$ we set $B$ to be

$ \begin{pmatrix} 0 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \end{pmatrix}. $

We can now write $\tilde{x}_i$ as $\tilde{x}_i = B_i c$ for some matrix $B_i$.

To minimize our loss function over $c$, we start by summing over Equation 10 and ignoring all constant terms to get

$ c^T \left (\sum_i B_i^T\left ((h_{i, \parallel} - h_{i, \perp})\frac{x_i x_i^T}{|x_i|^2} + h_{i, \perp} I\right) B_i\right)c - 2 \left (\sum_i h_{i, \parallel}B_ix_i \right)c. $

This is again a convex quadratic minimization problem over $c$ and can be solved efficiently. Specifically the matrix

$ \sum_i B_i^T\left ((h_{i, \parallel} - h_{i, \perp})\frac{x_i x_i^T}{|x_i|^2} + h_{i, \perp} I\right)B_i $

will be full rank if we observe every codeword at least once. We can then find the optimal value of $c$ with

$ c = \left (\sum_i B_i^T\left ((h_{i, \parallel} - h_{i, \perp})\frac{x_i x_i^T}{|x_i|^2} + h_{i, \perp} I\right)B_i\right)^{-1}\left (\sum_i h_{i, \parallel}B_ix_i \right). $

7.6 Results on the `Amazon-670k` Extreme Classification Dataset

Extreme classification with a large number of classes requires evaluating the last layer (classification layer) with all possible classes. When there are $\mathcal{O}(M)$ classes, this becomes a major computation bottleneck as it involves a huge matrix multiplication followed by Top-K. Thus this is often solved using Maximum Inner Product Search to accelerate inference. We evaluate our methods on extreme classification using the Amazon-670k dataset [46]. An MLP classifier is trained over 670, 091 classes, where the last layer has a dimensionality of 1, 024. The retrieval performance of product quantization with traditional reconstruction loss and with score-aware quantization loss are compared in Table 1.

: Table 1: Amazon-670k extreme classification performance. The benefits of anisotropic vector quantization on Recall 1@ $N$ are especially evident at lower bitrates and lower $N$.

Bitrate	1@1	1@10	1@100	Bitrate	1@1	1@10	1@100
256 bits, PQ	0.652	0.995	0.999	512 bits, PQ	0.737	0.998	1.000
256 bits, Ours	0.656	0.996	1.000	512 bits, Ours	0.744	0.997	1.000
1024 bits, PQ	0.778	1.000	1.000	2048 bits, PQ	0.782	1.000	1.000
1024 bits, Ours	0.812	1.000	1.000	2048 bits, Ours	0.875	1.000	1.000

7.7 Results on Binary Quantization

Another popular family of quantization function is binary quantization. In such a setting, a function $h(x): {\mathbb{R}}^d \rightarrow {0, 1}^{h}$ is learned to quantize datapoints into binary codes, which saves storage space and can speed up distance computation. There are many possible ways to design such a binary quantization function, and some [47, 30] uses reconstruction loss.

We can apply our score-aware quantization loss to these approaches. We follow the setting of Stochastic Generative Hashing (SGH) [30], which explicitly minimizes reconstruction loss and has been shown to outperform earlier baselines. In their paper, a binary auto-encoder is learned to quantize and dequantize binary codes:

$ \tilde{x} = g(h(x)); \text{where~} h(x) \in {0, 1}^{h} $

where $h(\cdot)$ is the "encoder" part which binarizes original datapoint into binary space and $g(\cdot)$ is the "decoder" part which reconstructs the datapoints given the binary codes. The authors of the paper uses $h(x) = sign(W_h^T x+ b_h)$ as the encoder function and $g(h) = W_g^T h$ as the decoder functions. The learning objective is to minimize the reconstruction error of $||x-\tilde{x}||^2$, and the weights in the encoder and decoder are optimized end-to-end using standard stochastic gradient descent. We can instead use our score-aware quantization loss. We show below the results of SGH and SGH with our score-aware quantization loss in Table 2 on the SIFT1M dataset ([31]). We see that adding our score-aware quantization loss greatly improves performance.

: Table 2: We compare Stochastic Generative Hashing ([30]) trained with reconstruction loss (SGH) and Stochastic Generative Hashing trained with our score-aware quantization loss (SGH-score-aware) on the SIFT1M dataset. We see that using our score-aware loss greatly improves the recall of Stochastic Generative Hashing.

Recall $k@k$	1@1	1@10	10@10	10@100
64 bits, SGH	0.028	0.096	0.053	0.220
64 bits, SGH-score-aware	0.071	0.185	0.093	0.327
128 bits, SGH	0.073	0.195	0.105	0.376
128 bits, SGH-score-aware	0.196	0.406	0.209	0.574
256 bits, SGH	0.142	0.331	0.172	0.539
256 bits, SGH-score-aware	0.362	0.662	0.363	0.820

7.8 Dataset Selection for MIPS evaluation

In this section we consider dataset choices for benchmarking MIPS systems. In modern large-scale settings, the vectors in the database are often created with neural network embeddings learned by minimizing some training task. This typically leads to the following nice properties:

Low correlation across dimensions.
Equal variance in each dimension.

Since our target application is retrieval in such settings, we want our benchmarking dataset to have these properties. This will allow our metrics to better inform how our approach will work in practice.

Datasets that have been widely used for evaluating MIPS systems include SIFT1M/1B, GIST1M, Glove1.2M, Movielens, and Netflix. We see in Figure 5 that only Glove1.2M has the properties we want in a benchmarking dataset.

SIFT1M, SIFT1B, and GIST1M are introduced by [31] to illustrate the use of product quantization. SIFT is a keypoint descriptor while GIST is image-level descriptor which have been hand-crafted for image retrieval. These vectors have a high correlation between dimensions and have a high degree of redundancy. Thus the intrinsic dimensions of SIFT1M and GIST are much lower than its dimensionality.

Movielens and Netflix dataset are formed from the SVD of the rating matrix of Movielens and Netflix websites, respectively. This is introduced by [20] for MIPS retrieval evaluation. Following SVD of $X = (U \Lambda^{1/2T})(\Lambda^{1/2}V)$, the dimension of these two datasets correspond to the eigenvalues of $X$. Thus the variance of dimensions are sorted by eigenvalues, and the first few dimensions are much more important than later ones. Additionally, the datasets are 10k - 20k in size and thus should not be considered large-scale.

Glove1.2M is a word embeddings dataset similar to word2vec, which use neural-network style training with a bottleneck layer. This datasets exhibits less data distribution problems. It is our general observation that bottleneck layers lead to independent dimensions with similar entropy, making them good datasets for benchmarking for our target retrieval tasks.

References

Section Summary: This section lists numerous academic references, primarily papers from conferences and journals, focusing on advanced topics in machine learning such as recommender systems, efficient similarity searches, and data compression techniques like hashing and quantization. The citations span from 1998 to 2019, drawing from prestigious venues like ACM, IEEE, ICML, and NeurIPS, along with some arXiv preprints and open-source libraries. These works explore practical methods for handling large-scale data in areas like computer vision, natural language processing, and artificial intelligence.

[1] Cremonesi et al. (2010). Performance of Recommender Algorithms on Top-n Recommendation Tasks. In Proceedings of the Fourth ACM Conference on Recommender Systems. pp. 39–46.

[2] Thomas Dean et al. (2013). Fast, Accurate Detection of 100,000 Object Classes on a Single Machine: Technical Supplement. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[3] Yen et al. (2018). Loss Decomposition for Fast Learning in Large Output Spaces. In Proceedings of the 35th International Conference on Machine Learning. pp. 5640–5649.

[4] Stephen Mussmann and Stefano Ermon (2016). Learning and Inference via Maximum Inner Product Search. In Proceedings of The 33rd International Conference on Machine Learning. pp. 2587–2596.

[5] Alexander Pritzel et al. (2017). Neural Episodic Control. In Proceedings of the 34th International Conference on Machine Learning. pp. 2827–2836.

[6] T. Ge et al. (2014). Optimized Product Quantization. IEEE Transactions on Pattern Analysis and Machine Intelligence. 36(4). pp. 744-755. doi:10.1109/TPAMI.2013.240.

[7] Babenko, Artem and Lempitsky, Victor (2014). Additive quantization for extreme vector compression. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. pp. 931–938.

[8] Johnson et al. (2017). Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734.

[9] Weston et al. (2010). Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning. 81(1). pp. 21–35.

[10] Guo et al. (2016). A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. pp. 55–64.

[11] Gillick et al. (2019). Learning Dense Representations for Entity Retrieval. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). pp. 528–537.

[12] Wu et al. (2017). StarSpace: Embed All The Things!. arXiv preprint arXiv:1709.03856.

[13] Reddi et al. (2019). Stochastic Negative Mining for Learning with Large Output Spaces. In The 22nd International Conference on Artificial Intelligence and Statistics. pp. 1940–1949.

[14] Wang et al. (2014). Hashing for similarity search: A survey. arXiv preprint arXiv:1408.2927.

[15] Wang et al. (2016). Learning to hash for indexing big data survey. Proceedings of the IEEE. 104(1). pp. 34–57.

[16] Qi Chen et al. (2018). SPTAG: A library for fast approximate nearest neighbor search. https://github.com/Microsoft/SPTAG.

[17] Yury A. Malkov and D. A. Yashunin (2016). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. CoRR. abs/1603.09320. http://arxiv.org/abs/1603.09320. arXiv:1603.09320.

[18] Muja, Marius and Lowe, David G (2014). Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence. 36(11). pp. 2227–2240.

[19] Dasgupta, Sanjoy and Freund, Yoav (2008). Random projection trees and low dimensional manifolds. In Proceedings of the fortieth annual ACM symposium on Theory of computing. pp. 537–546.

[20] Shrivastava, Anshumali and Li, Ping (2014). Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In Advances in Neural Information Processing Systems. pp. 2321–2329.

[21] Neyshabur, Behnam and Srebro, Nathan (2015). On symmetric and asymmetric lshs for inner product search. In International Conference on Machine Learning.

[22] Indyk, Piotr and Motwani, Rajeev (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. pp. 604–613.

[23] Andoni et al. (2015). Practical and optimal LSH for angular distance. In Advances in Neural Information Processing Systems. pp. 1225–1233.

[24] Ben Harwood and Tom Drummond (2016). FANNG: Fast Approximate Nearest Neighbour Graphs. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. pp. 5713-5722.

[25] Charikar, Moses S (2002). Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. pp. 380–388.

[26] Vempala, Santosh S (2005). The random projection method. American Mathematical Soc..

[27] Li, Xiaoyun and Li, Ping (2019). Random projections with asymmetric quantization. In Advances in Neural Information Processing Systems. pp. 10857–10866.

[28] He et al. (2013). K-means hashing: An affinity-preserving quantization method for learning binary compact codes. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2938–2945.

[29] Liong et al. (2015). Deep Hashing for Compact Binary Codes Learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Dai et al. (2017). Stochastic generative hashing. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 913–922.

[31] Jegou et al. (2011). Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence. 33(1). pp. 117–128.

[32] Ruiqi Guo et al. (2016). Quantization based Fast Inner Product Search. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016. pp. 482–490. http://jmlr.org/proceedings/papers/v51/guo16a.html.

[33] Zhang et al. (2014). Composite Quantization for Approximate Nearest Neighbor Search.. In ICML. pp. 3.

[34] Wu et al. (2017). Multiscale Quantization for Fast Similarity Search.

[35] Martinez et al. (2018). LSQ++: Lower running time and higher recall in multi-codebook quantization. In Proceedings of the European Conference on Computer Vision (ECCV). pp. 491–506.

[36] Zhu et al. (2016). Trained ternary quantization. arXiv preprint arXiv:1612.01064.

[37] Gong et al. (2013). Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence. 35(12). pp. 2916–2929.

[38] Alexandre Sablayrolles et al. (2019). Spreading vectors for similarity search. In International Conference on Learning Representations. https://openreview.net/forum?id=SkGuG2R5tm.

[39] E. Marcheret et al. (2009). Optimal quantization and bit allocation for compressing large discriminative feature space transforms. In 2009 IEEE Workshop on Automatic Speech Recognition Understanding. pp. 64-69.

[40] Morozov, Stanislav and Babenko, Artem (2019). Unsupervised Neural Quantization for Compressed-Domain Similarity Search. In The IEEE International Conference on Computer Vision (ICCV).

[41] Babenko et al. (2016). Pairwise Quantization. arXiv preprint arXiv:1606.01550.

[42] May et al. (2019). On the downstream performance of compressed word embeddings. In Advances in neural information processing systems. pp. 11782–11793.

[43] Lloyd, Stuart (1982). Least squares quantization in PCM. IEEE transactions on information theory. 28(2). pp. 129–137.

[44] Aumüller et al. (2019). ANN-benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems.

[45] Jeffrey Pennington et al. (2014). GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543.

[46] Bhatia et al. (2015). Sparse local embeddings for extreme multi-label classification. In Advances in neural information processing systems. pp. 730–738.

[47] Carreira-Perpinán, Miguel A and Raziperchikolaei, Ramin (2015). Hashing with binary autoencoders. In Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 557–566.