Approximate Nearest Neighbors#
In the realm of big data, traditional methods like k-NN (k-nearest neighbors) falter due to the sheer volume of information. The solution?
Approximate Nearest Neighbors (ANN) algorithms, which aim for close-enough results most of the time, offering a balance between precision and performance.
ANNOY#
Understanding ANNOY#
ANNOY is designed to efficiently search through large datasets for items that are similar but not necessarily the closest match. It’s especially useful when processing time and storage are key considerations.
Key Features#
Distance Metrics Supported: Euclidean, Manhattan, Cosine, Hamming, and Dot Product.
Hamming Distance: Efficiently handles binary data by operating on 64-bit integers.
Dot Product Distance: Transforms vectors to a cosine space for better query performance, based on research by Microsoft Research.
Optimized for Lower Dimensions: Best under 100 dimensions but effective up to 1,000.
Efficiency: Minimal memory usage, supports memory sharing across processes.
Flexibility: Separate index creation and lookup, with fixed indexing post-creation.
Scalability: Can build indexes on disk for datasets too large for memory.
Configuration Parameters#
n_trees
: Influences build time and index size. Higher values improve accuracy but increase the index size.search_k
: Adjusts search performance. Higher values offer more accurate results but are slower. Defaults ton * n_trees
if unspecified.
By adjusting parameters like n_trees
and search_k
, and choosing the appropriate distance metric, ANNOY can be finely tuned to balance between accuracy and efficiency, making it a powerful tool for handling large-scale datasets.
Trade-offs: You can opt for slower searches to reduce load times, memory usage, and disk IO. The index can be loaded in memory upfront or read from disk on-demand, depending on your system’s resources and needs.
How ANNOY Works#
ANNOY uses random projections and tree structures to navigate the search space.
It selects a random hyperplane at each node, dividing the space into two.
This process repeats, creating a “forest” of trees, with the number of trees (
k
) tailored to your precision and performance requirements.
# generate a random dataset
import pandas as pd
import numpy as np
d = 100 # number of dimensions
np.random.seed(1234)
# 10000 rows and 1000 columns
data = np.random.randn(10000, d)
df = pd.DataFrame(data)
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.471435 | -1.190976 | 1.432707 | -0.312652 | -0.720589 | 0.887163 | 0.859588 | -0.636524 | 0.015696 | -2.242685 | ... | 0.079842 | -0.399965 | -1.027851 | -0.584718 | 0.816594 | -0.081947 | -0.344766 | 0.528288 | -1.068989 | -0.511881 |
1 | 0.291205 | 0.566534 | 0.503592 | 0.285296 | 0.484288 | 1.363482 | -0.781105 | -0.468018 | 1.224574 | -1.281108 | ... | 0.209395 | -0.592886 | -1.473116 | -0.896581 | 1.104352 | -0.431550 | -0.161137 | 0.889157 | 0.288377 | -1.051539 |
2 | -0.319561 | -0.619993 | 0.156998 | -0.571455 | 1.057633 | -0.791489 | -0.524627 | 0.071878 | 1.910759 | 0.787965 | ... | 0.386254 | 0.822775 | -0.683790 | 1.057203 | 0.031880 | 1.343182 | -0.050540 | -0.364010 | -1.553342 | -0.319298 |
3 | 0.527046 | 0.711112 | -0.217545 | 2.637791 | -1.742138 | -0.094435 | 1.431184 | 0.592758 | 0.170297 | -1.751706 | ... | 0.393892 | -0.950026 | 0.332507 | 0.528944 | -1.120521 | 0.048264 | 0.061988 | -1.027516 | -0.238335 | 1.932178 |
4 | -0.226632 | -0.923831 | 0.355839 | -1.270063 | -0.195472 | -0.463419 | 0.989415 | 1.388647 | 1.087714 | 0.438801 | ... | 0.725714 | 0.916976 | -0.563890 | -1.522180 | -0.014279 | -0.246721 | -0.165329 | 0.119114 | -2.074980 | -1.002755 |
5 rows × 100 columns
from annoy import AnnoyIndex
import random
print(d) # Length of item vector that will be indexed
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 from annoy import AnnoyIndex
2 import random
4 print(d) # Length of item vector that will be indexed
ModuleNotFoundError: No module named 'annoy'
# create an Annoy index
t = AnnoyIndex(d, 'angular')
An Annoy index t is created for 40-dimensional vectors, using the ‘angular’ distance metric. The angular distance is useful for measuring similarity based on the angle between vectors (often used in text and other types of normalized data).
# Add items to the index
for i, row in df.iterrows():
t.add_item(i, row.tolist())
t.build(10) # 10 trees
True
The build method constructs the index using 10 trees. More trees can give more accurate results, but also take more memory and make querying slower.
t.save("my_index.ann")
True
The index is saved to disk with the filename ‘test.ann’. This allows the index to be reloaded later without needing to rebuild it.
# Loading the Index
u = AnnoyIndex(d, 'angular')
u.load('my_index.ann')
True
A new Annoy index u is created and loaded from the saved file ‘my_index.ann’. Loading is typically very fast because it uses memory mapping (mmap), which maps the file directly into memory.
print(u.get_nns_by_item(0, 10)) # will find the 10 nearest neighbors
[0, 9505, 1842, 9822, 8443, 4612, 9341, 7156, 8157, 8829]
This retrieves the 10 nearest neighbors to the item with ID 0 in the index. This is useful for finding items similar to a given item in terms of their vector representation.
new_df = pd.DataFrame(np.random.randn(5, d))
print(new_df)
0 1 2 3 4 5 6 \
0 -1.135752 -0.392021 -0.009388 -1.701116 0.902751 -1.273385 0.557219
1 0.062232 1.074329 -1.487609 0.719791 2.051713 2.573426 0.637057
2 -2.539508 1.180998 -0.195334 0.016965 -0.486517 0.404461 1.063968
3 0.596636 -0.037158 0.770473 -0.131670 -0.060697 -1.282795 -1.418144
4 0.875012 -1.511678 -0.880919 1.462521 1.197618 0.585693 0.608432
7 8 9 ... 90 91 92 93 \
0 -0.592336 0.150354 0.812610 ... 0.987659 -0.688899 -0.533054 0.517042
1 -0.392337 1.431712 -2.386010 ... -1.522066 -0.124569 -1.359972 -1.633190
2 -0.719671 -0.733332 -0.945896 ... 0.205499 -0.015161 -0.811451 -0.751192
3 1.328655 -1.042352 0.325995 ... -0.565471 -1.220053 2.048062 0.413671
4 -0.560415 -0.608675 -0.897332 ... 0.545073 0.155986 1.964137 1.114200
94 95 96 97 98 99
0 0.334228 -2.308518 0.373696 -0.014302 -0.752266 0.328937
1 1.471845 1.503514 -0.978254 0.310408 0.783144 -0.053862
2 -0.230587 0.807796 -0.636896 -1.010718 0.482161 0.350891
3 1.589411 -1.054384 1.458149 -0.128367 0.056496 1.598106
4 0.030172 -0.405211 0.192811 0.759990 0.116835 0.543080
[5 rows x 100 columns]
# Number of nearest neighbors to find
num_neighbors = 5
# Iterate over each new vector in new_df and find its nearest neighbors
for index, row in new_df.iterrows():
vector = row.tolist()
# print(vector)
neighbors = u.get_nns_by_vector(vector, num_neighbors, include_distances=False)
print(neighbors)
# print(f"Neighbors for new observation {index}: {neighbors}")
[1425, 8873, 9243, 631, 3727]
[5502, 8966, 5267, 4805, 8387]
[2884, 2980, 3972, 1967, 1322]
[8226, 6752, 4102, 7047, 2271]
[2205, 5225, 9304, 1505, 4193]
FAISS#
Introduction#
Developed by Facebook AI, Faiss represents a groundbreaking approach to similarity search, particularly for multimedia documents. Traditional search engines and database systems struggle with the complexity and scale involved in finding similar high-dimensional vectors, which is crucial for processing and understanding multimedia content. Faiss introduces a solution that is not only faster but also more efficient than any existing technology, boasting an 8.5x speed improvement over the state-of-the-art and establishing new benchmarks in the process.
Challengages:
Similarity search involves identifying multimedia documents that resemble each other within huge datasets, which traditional databases, designed for structured, symbolic information, fail to handle effectively.
AI-driven tools like word embeddings and CNN descriptors have made high-dimensional vectors a powerful means for representing multimedia content. However, efficiently querying these vectors for similarity remains a significant challenge, given the sheer volume of data and the computational complexity involved.
Key Features#
Faiss addresses these challenges head-on by providing an efficient and scalable library tailored for similarity search across billion-scale datasets. It offers several advantages:
Speed and Efficiency: Faiss is optimized to deliver unparalleled search speeds, making it possible to process queries against billions of vectors in a fraction of the time previously required.
Memory Optimization: The library is designed to be light on memory usage, facilitating faster access and processing of large datasets without compromising performance.
GPU Acceleration: With its state-of-the-art GPU implementation, Faiss leverages the power of modern hardware to further enhance search speeds and efficiency.
Software Innovation and Applications#
Beyond its core functionality, Faiss represents a significant advancement in software engineering for AI applications. It offers:
Flexibility: With various similarity search methods, Faiss caters to a wide range of use cases and datasets, providing users with the tools to tailor their search operations according to specific needs.
Scalability: Designed to handle databases of billions of vectors, Faiss breaks new ground in scalability, enabling applications that were previously unimaginable due to computational constraints.
Specifically,
CPU Optimizations:
Multi-threading is utilized to leverage multiple cores for parallel searches across multiple GPUs.
BLAS Libraries are essential for exact distance computations, enabling efficient brute-force implementations via matrix multiplication.
SIMD Vectorization and Popcount techniques accelerate distance computations for individual vectors.
GPU Enhancements:
K-Selection Algorithm: A significant advancement in GPU implementations is the development of a highly efficient k-selection algorithm for finding the k-minimum or maximum elements, crucial for similarity searches. This algorithm operates close to peak GPU memory bandwidth efficiency and is designed to work in a single pass, keeping all intermediate states in registers, which allows for integration with other kernels for faster search operations.
Efficient Tiling and Kernel Implementation: The library focuses on effective tiling strategies and kernel functions for approximate search, optimizing performance.
Multi-GPU Support: Faiss allows for data sharding or replication across multiple GPUs, not limiting operations to the memory capacity of a single GPU.
Half-Precision Floating-Point Support: It includes float16 support for both computation and storage, which enhances speed with minimal accuracy loss on supported GPU architectures.
import faiss
# Convert DataFrame to numpy array
xb = df.to_numpy(dtype='float32') # FAISS needs float32 data
# Normalize the vectors (optional)
norm = np.linalg.norm(xb, axis=1, keepdims=True)
xb = xb / norm # Avoid division by zero issues in case of zero vectors
# Create a FAISS index - using L2 distance for simplicity
index = faiss.IndexFlatL2(d) # d is the dimensionality of the vectors
# Add vectors to the index
index.add(xb)
# Perform a search
k = 4 # Number of nearest neighbors to find
xq = xb[:5] # Query the first 5 vectors from the dataset itself as an example
D, I = index.search(xq, k) # D is the distance matrix, I is the index matrix
# Display the results
print("Indices of Nearest Neighbors:")
print(I)
print("Distances to Nearest Neighbors:")
print(D)
Indices of Nearest Neighbors:
[[ 0 2240 2975 9327]
[ 1 8826 4435 1533]
[ 2 4493 6811 9874]
[ 3 3371 4861 2566]
[ 4 9884 4025 3933]]
Distances to Nearest Neighbors:
[[0. 1.1695111 1.2628028 1.2849629]
[0. 1.3047677 1.347751 1.3648174]
[0. 1.2633741 1.2682862 1.2734882]
[0. 1.2850516 1.3241501 1.3330061]
[0. 1.2195594 1.2292215 1.2797881]]
You can also perform the search process for a new dataset
# Convert new DataFrame to numpy array
xq = new_df.to_numpy(dtype='float32')
# Normalize the vectors (only if the original data was normalized)
norm = np.linalg.norm(xq, axis=1, keepdims=True)
xq = xq / norm # Avoid division by zero issues in case of zero vectors
# Perform a search
k = 4 # Number of nearest neighbors to find
D, I = index.search(xq, k) # D is the distance matrix, I is the index matrix
# Display the results
print("Indices of Nearest Neighbors:")
print(I)
print("Distances to Nearest Neighbors:")
print(D)
Indices of Nearest Neighbors:
[[6278 492 9733 4285]
[2629 1507 9788 9428]
[3151 572 5659 9585]
[8226 6498 5378 7976]
[5881 8343 80 546]]
Distances to Nearest Neighbors:
[[1.240731 1.2674764 1.3170264 1.3596915]
[1.3235697 1.3448924 1.3456169 1.3456538]
[1.1796645 1.1832618 1.2379076 1.2531517]
[1.2839512 1.2922028 1.2952569 1.306025 ]
[1.274986 1.3125253 1.327332 1.3359702]]