> The Hierarchical navigable small world (HNSW) algorithm is a graph-based approximate nearest neighbor search technique used in many vector databases.[1] Nearest neighbor search without an index involves computing the distance from the query to each point in the database, which for large datasets is computationally prohibitive. For high-dimensional data, tree-based exact vector search techniques such as the k-d tree and R-tree do not perform well enough because of the curse of dimensionality.
At very high scale, there's less usage of graphs. Or there's a set of clustering on top of graphs.
Graphs can be complex to build and rebalance. Graph-like data structures with a thing, then a pointer out to another thing, aren't that cache friendly.
Add to that, people almost always want to *filter* vector search results. And this is a huge blindspot for consumers and providers. It's where the ugly performance surprises come from. Filtered HNSW isn't straightforward, and requires you to just keep traversing the graph looking for results that satisfy your filter.
HNSW came out of a benchmark regime where we just indexed some vectors and tried to only maximize recall for query latency. It doesn't take into account the filtering / indexing almost everyone wants.
Turbopuffer, for example, doesn't use graphs at all, it uses SPFresh. And they recently got 200ms latency on 100B vectors.
There is an entire section of the post about that. I believe that's more the illusion of a problem because of product design issues than a real challenge since far results that match the filter are totally useless.
Doesn't this depend on your data to a large extent? In a very dense graph "far" results (in terms of the effort spent searching) that match the filters might actually be quite similar?
The "far" here means "with vectors having a very low cosine similarity / very high distance". So in vector use cases where you want near vectors matching a given set of filters, far vectors matching a set of filters are useless. So in Redis Vector Sets you have another "EF" (effort) parameter just for filters, and you can decide in case not enough results are collected so far how much efforts you want to do. If you want to scan all the graph, that's fine, but Redis by default will do the sane thing and early stop when the vectors anyway are already far.
I'm facing the problem you describe daily. It's especially bad because it's very difficult for me to predict if the set of filters will reduce the dataset by ~1% (in which case following the original vector index is fine) or by 99.99% (in which case you just want to brute force the remaining vectors).
Tried a million different things, but haven't heard of Turbopuffer yet. Any references on how they perform with such additional filters?
Lucene and ES implement a shortcut for filters that are restrictive enough. Since it's already optimized for figuring out if something falls into your filter set, you first determine the size of that. You traverse the HNSW normally, then if you have traversed more nodes than your filter set's cardinality, you just switch to brute forcing your filter set distance comparisons. So worst case scenario is you do 2x your filter set size vector distance operations. Quite neat.
Oh that's nice! Any references on this shortcut? How do you activate that behavior? I was playing around with ES, but the only suggestion I found was to use `count` on filters before deciding (manually) which path to take.
Not really. The thing Vespa solved was taking existing ANN methods and fixing a disk/RAM tradeoff (and some other niceties). That's nowhere close to adequte when:
1. As softwaredoug mentioned, you might want to filter results, potentially with a high filtration rate.
2. ANN isn't good enough. Suppose you need bounded accuracy with meaningfully sublinear time on a high-dimensional dataset. You're hosed.
Point (1) is just a repeat of a general observation that composition of nice data structures doesn't usually give you a nice data structure, even if it technially works. Creating a thing that does what you want without costing both arms and the president's leg requires actually understanding DS&A and applying it in your solution from the ground up.
Point (2) might seem irrelevant (after all, people are "building" stuff with RAG and whatnot nowadays aren't they?), but it's crucial to a lot of applications. Imagine, e.g., that there exists one correct result in your database. The guarantees provided by SOTA ANN solutions (on high-dimensional data) have a steep compute/correctness tradeoff, giving you an abysmal chance of finding your document without searching an eye-watering fraction of your database. I usually work around that with the relaxation that the result needs to be the best one but that its information can be fuzzy (admitting solutions which merge a bunch of low-dimensional queries, corrected via some symmetry in a later step), but if you actually need good KNN results then you're kind of hosed.
Just curious what the state of the art around filtered vector search results is? I took a quick look at the SPFresh paper and didn't see it specifically address filtering.
This is well worth reading in full. The section about threading is particularly interesting: most of Redis is single-threaded, but antirez decided to use threads for the HNSW implementation and explains why.
two great points here:
(1) quantization is how you speed up vector indexes, and
(2) how your build your graph matters much much less*
These are the insights behind DiskANN, which has replaced HNSW in most production systems.
past that, well, you should really go read the DiskANN paper instead of this article, product quantization is way way way way way more effective than simple int8 or binary quant.
and if you want to skip forward several years to the cutting edge, check out https://arxiv.org/abs/2509.18471 and the references list for further reading
* but it still matters more than a lot of people thought circa 2020
Hi! I worked with product quantization in the past in the context of a library I released to read LLMs stored in llama.cpp format (GUFF). However, in the context of in-memory HNSWs, I found them to make a small difference. The recall is already almost perfect with int8. Of course it is very different in the case you are quantizing an actual neural network with, for instance 4 bit quants. There it will make a huge difference. But in my use case I picked what would be the fastest, given that both performed equally well. What could be potentially done with PQ in the case of Redis Vector Sets is to make 4 bit quants work decently (but not as well as int8 anyway), however given how fat the data structure nodes are per-se, I don't think this is a great tradeoff.}
All this to say: the blog post tells mostly the conclusions, but to reach that design, many things were tried, including things that looked cooler but in the practice were not the best fit. It's not by chance that Redis HNSWs are easily able to go 50k full queries/sec in decent hardware.
if you're getting near-perfect recall with int8 and no reranking then you're either testing an unusual dataset or a tiny one, but if it works for you then great!
Near perfect recall VS fp32, not in absolute terms: TLDR, it's not int8 to ruin it, at least if the int8 quants are computed per-vector and not with global centroids. And also, recall is a very illusionary metric, but this is an argument for another blog post (In short, what really matters is that the best candidates are collected: the long tail is full of elements that are anyway far enough or practically equivalent, since this happens under the illusion that the embedding model already captures the similarity our application demands. This is, indeed, already an illusion, so if the 60th result is 72th, it normally does not matter. The reranking that really matters (if there is the ability to do that) is the LLM picking / reranking: that, yes, makes all the difference.
> Similarly, there are, right now, efforts in order to really check if the “H” in the HNSWs is really needed, and if instead a flat data structure with just one layer would perform more or less the same (I hope I’ll cover more about this in the future: my feeling is that the truth is in the middle, and that it makes sense to modify the level selection function to just have levels greater than a given threshold).
Small world networks have been my obsession for almost a decade, from a biological and sociological evolution angle. Love this stuff.
This is the paper he's referring to, which I was just reading again yesterday :)
As someone also interested in not just repair of the social fabric, but the evolution of gendered social dynamics in social species (biologically and culturally), it's interesting to me that there might be some property of networks that:
1. Makes any random point as close as possible to any other random point (all solutions are equally likely to be close to the unknown solution somewhere in the network)
2. Minimally edges must be maintained (valuable in social networks, where edges are relationships that have maintenance cost)
3. Involves some negotiation between hierarchy (aka entrepreneurial networks, with opportunity for value extraction thru rent-seeking traffic) and hubness (which dissolve capture points in networks via "weaving" them out of existence). All the short paths in a small-world pass through hubs, which necessarily maintain many more weaker connections in a much more "gossip" comms protocols
Am working on this stuff related to collective intelligence and democracy work, if anyone wants to be in touch https://linkedin.com/in/patcon-
> many programmers are smart, and if instead of creating a magic system they have no access to, you show them the data structure, the tradeoffs, they can build more things, and model their use cases in specific ways. And your system will be simpler, too.
Basically my entire full-time job is spent prosecuting this argument. It is indeed true that many programmers are smart, but it is equally true that many programmers _are not_ smart, and those programmers have to contribute too. More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
>> More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
If you are working on open source databases, or something close to the metal I agree with antirez, if you are working at some established tech business (e.g: a very old ecommerce site), I agree with you
To be clear, I'm not disagreeing with antirez at all. I feel his argument in my bones. I am a smart programmer. I want simple, powerful systems that leave the kid gloves in the drawer.
The unfortunate reality is that a large cadre of people cannot handle such tools, and those people still have extremely valuable contributions to make.
I say this as a full-time research engineer at a top-10 university. We are not short on talent, new problems, or funding. There is ample opportunity to make our systems as simple/"pure" as possible, and I make that case vigorously. The fact remains that intentionally limiting scope for the sake of the many is often better than cultivating an elite few.
I'm really curious about the shape of your problem. I was an early hire at a tech unicorn and helped build many high scale systems and definitely found that our later stage hires really had a harder time dealing with the complexity tradeoff than earlier hires (though our earlier hires were limited by a demand to execute fast or lose business which added other, bad constraints.) I'm curious what your iteration of the problem looks like. We managed it by only trusting the bedrocks of our systems to engineers who demonstrated enough restraint to architect those.
What the blog post calls Vector Quantization is just numeric vector component quantization. Vector quantization typically uses K-means to quantize the entire vector, or parts of the vector in product quantization, into K-means indices.
> Actually, you can implement HNSWs on disk, even if there are better data structures from the point of view of disk access latencies
I built a KV-backed HNSW in Rust using LMDB to address this exact usecase (https://github.com/nnethercott/hannoy). The trickiest part for sure is being clever with your IO ops since HNSW search requires tons of random reads to find the nearest neighbours. Especially if you're not on NVMe's (think EBS-based HNSW) you really start to feel the cost of those reads, even with all the fancy SIMD.
I write blog posts into TextEdit, just the white page to fill with text. Normally this is fine as I end writing and publish the blog post. This time after writing it I left it there for weeks: I wanted to refine it a bit, the post felt not "ready". Then I had to reboot the computer for an upgrade, and I magically quit the application hitting cancel when there was to save the document :-|
However rewriting it was fast, and the second version was better. So, it's fine. But starting from now I'll use the less pleasant (to me) Google Docs.
Google Docs with "pageless" mode and the markdown shortcuts is pretty close to my ideal writing experience - especially if you're using the enterprise edition that unlocks code blocks (please bring that to the free version Google - could help lots of college students take notes)
Redis does not flush pages to disk. When the RDB / AOF are generated by the saving child, it is a point-in-time operation not bound to the speed data is added in the parent.
I may not remember this correctly, but normalizing to -127..127 may be counterproductive. When normalized to -1..1, all of Euclidian, dot product and cosine similarities produce the same ranking, allowing you to calculate similarity without having to calculate cosines.
Redis stores the magnitude of the vector and stores the vector in normalized form, so indeed we just do dot product :) But yet we can reconstruct the vector back. I totally forgot to mention this in the blog post!
From Wikipedia:
> The Hierarchical navigable small world (HNSW) algorithm is a graph-based approximate nearest neighbor search technique used in many vector databases.[1] Nearest neighbor search without an index involves computing the distance from the query to each point in the database, which for large datasets is computationally prohibitive. For high-dimensional data, tree-based exact vector search techniques such as the k-d tree and R-tree do not perform well enough because of the curse of dimensionality.
We're all suffering from the curse of dimensionality.
vptrees
At very high scale, there's less usage of graphs. Or there's a set of clustering on top of graphs.
Graphs can be complex to build and rebalance. Graph-like data structures with a thing, then a pointer out to another thing, aren't that cache friendly.
Add to that, people almost always want to *filter* vector search results. And this is a huge blindspot for consumers and providers. It's where the ugly performance surprises come from. Filtered HNSW isn't straightforward, and requires you to just keep traversing the graph looking for results that satisfy your filter.
HNSW came out of a benchmark regime where we just indexed some vectors and tried to only maximize recall for query latency. It doesn't take into account the filtering / indexing almost everyone wants.
Turbopuffer, for example, doesn't use graphs at all, it uses SPFresh. And they recently got 200ms latency on 100B vectors.
https://turbopuffer.com/docs/vector
There is an entire section of the post about that. I believe that's more the illusion of a problem because of product design issues than a real challenge since far results that match the filter are totally useless.
Doesn't this depend on your data to a large extent? In a very dense graph "far" results (in terms of the effort spent searching) that match the filters might actually be quite similar?
The "far" here means "with vectors having a very low cosine similarity / very high distance". So in vector use cases where you want near vectors matching a given set of filters, far vectors matching a set of filters are useless. So in Redis Vector Sets you have another "EF" (effort) parameter just for filters, and you can decide in case not enough results are collected so far how much efforts you want to do. If you want to scan all the graph, that's fine, but Redis by default will do the sane thing and early stop when the vectors anyway are already far.
I'm facing the problem you describe daily. It's especially bad because it's very difficult for me to predict if the set of filters will reduce the dataset by ~1% (in which case following the original vector index is fine) or by 99.99% (in which case you just want to brute force the remaining vectors).
Tried a million different things, but haven't heard of Turbopuffer yet. Any references on how they perform with such additional filters?
Lucene and ES implement a shortcut for filters that are restrictive enough. Since it's already optimized for figuring out if something falls into your filter set, you first determine the size of that. You traverse the HNSW normally, then if you have traversed more nodes than your filter set's cardinality, you just switch to brute forcing your filter set distance comparisons. So worst case scenario is you do 2x your filter set size vector distance operations. Quite neat.
Oh that's nice! Any references on this shortcut? How do you activate that behavior? I was playing around with ES, but the only suggestion I found was to use `count` on filters before deciding (manually) which path to take.
Here you go https://github.com/apache/lucene/pull/656 - no need to do anything from the user side to trigger it as far as I know.
Just lookup how vespa.ai does it, it's open source.
Hybrid search with vector similarity and filtering I think has mostly been solved by Vespa and not even recently.
https://blog.vespa.ai/vespa-hybrid-billion-scale-vector-sear...
Not really. The thing Vespa solved was taking existing ANN methods and fixing a disk/RAM tradeoff (and some other niceties). That's nowhere close to adequte when:
1. As softwaredoug mentioned, you might want to filter results, potentially with a high filtration rate.
2. ANN isn't good enough. Suppose you need bounded accuracy with meaningfully sublinear time on a high-dimensional dataset. You're hosed.
Point (1) is just a repeat of a general observation that composition of nice data structures doesn't usually give you a nice data structure, even if it technially works. Creating a thing that does what you want without costing both arms and the president's leg requires actually understanding DS&A and applying it in your solution from the ground up.
Point (2) might seem irrelevant (after all, people are "building" stuff with RAG and whatnot nowadays aren't they?), but it's crucial to a lot of applications. Imagine, e.g., that there exists one correct result in your database. The guarantees provided by SOTA ANN solutions (on high-dimensional data) have a steep compute/correctness tradeoff, giving you an abysmal chance of finding your document without searching an eye-watering fraction of your database. I usually work around that with the relaxation that the result needs to be the best one but that its information can be fuzzy (admitting solutions which merge a bunch of low-dimensional queries, corrected via some symmetry in a later step), but if you actually need good KNN results then you're kind of hosed.
For sure. But its "solved" differently by every vector database. You have to pay attention to how its solved.
Just curious what the state of the art around filtered vector search results is? I took a quick look at the SPFresh paper and didn't see it specifically address filtering.
Full text search has this same issue.
This is well worth reading in full. The section about threading is particularly interesting: most of Redis is single-threaded, but antirez decided to use threads for the HNSW implementation and explains why.
Thanks! Appreciate your words.
two great points here: (1) quantization is how you speed up vector indexes, and (2) how your build your graph matters much much less*
These are the insights behind DiskANN, which has replaced HNSW in most production systems.
past that, well, you should really go read the DiskANN paper instead of this article, product quantization is way way way way way more effective than simple int8 or binary quant.
here's my writeup from a year and a half ago: https://dev.to/datastax/why-vector-compression-matters-64l
and if you want to skip forward several years to the cutting edge, check out https://arxiv.org/abs/2509.18471 and the references list for further reading
* but it still matters more than a lot of people thought circa 2020
Hi! I worked with product quantization in the past in the context of a library I released to read LLMs stored in llama.cpp format (GUFF). However, in the context of in-memory HNSWs, I found them to make a small difference. The recall is already almost perfect with int8. Of course it is very different in the case you are quantizing an actual neural network with, for instance 4 bit quants. There it will make a huge difference. But in my use case I picked what would be the fastest, given that both performed equally well. What could be potentially done with PQ in the case of Redis Vector Sets is to make 4 bit quants work decently (but not as well as int8 anyway), however given how fat the data structure nodes are per-se, I don't think this is a great tradeoff.}
All this to say: the blog post tells mostly the conclusions, but to reach that design, many things were tried, including things that looked cooler but in the practice were not the best fit. It's not by chance that Redis HNSWs are easily able to go 50k full queries/sec in decent hardware.
if you're getting near-perfect recall with int8 and no reranking then you're either testing an unusual dataset or a tiny one, but if it works for you then great!
Near perfect recall VS fp32, not in absolute terms: TLDR, it's not int8 to ruin it, at least if the int8 quants are computed per-vector and not with global centroids. And also, recall is a very illusionary metric, but this is an argument for another blog post (In short, what really matters is that the best candidates are collected: the long tail is full of elements that are anyway far enough or practically equivalent, since this happens under the illusion that the embedding model already captures the similarity our application demands. This is, indeed, already an illusion, so if the 60th result is 72th, it normally does not matter. The reranking that really matters (if there is the ability to do that) is the LLM picking / reranking: that, yes, makes all the difference.
> Similarly, there are, right now, efforts in order to really check if the “H” in the HNSWs is really needed, and if instead a flat data structure with just one layer would perform more or less the same (I hope I’ll cover more about this in the future: my feeling is that the truth is in the middle, and that it makes sense to modify the level selection function to just have levels greater than a given threshold).
Small world networks have been my obsession for almost a decade, from a biological and sociological evolution angle. Love this stuff.
This is the paper he's referring to, which I was just reading again yesterday :)
Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs" (2025) https://arxiv.org/abs/2412.01940
Also:
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data (2010) https://www.jmlr.org/papers/v11/radovanovic10a.html
As someone also interested in not just repair of the social fabric, but the evolution of gendered social dynamics in social species (biologically and culturally), it's interesting to me that there might be some property of networks that:
1. Makes any random point as close as possible to any other random point (all solutions are equally likely to be close to the unknown solution somewhere in the network)
2. Minimally edges must be maintained (valuable in social networks, where edges are relationships that have maintenance cost)
3. Involves some negotiation between hierarchy (aka entrepreneurial networks, with opportunity for value extraction thru rent-seeking traffic) and hubness (which dissolve capture points in networks via "weaving" them out of existence). All the short paths in a small-world pass through hubs, which necessarily maintain many more weaker connections in a much more "gossip" comms protocols
Am working on this stuff related to collective intelligence and democracy work, if anyone wants to be in touch https://linkedin.com/in/patcon-
Fascinating things about other small worlds applications, very far from what I do, that I did ignore. Thanks.
> many programmers are smart, and if instead of creating a magic system they have no access to, you show them the data structure, the tradeoffs, they can build more things, and model their use cases in specific ways. And your system will be simpler, too.
Basically my entire full-time job is spent prosecuting this argument. It is indeed true that many programmers are smart, but it is equally true that many programmers _are not_ smart, and those programmers have to contribute too. More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
>> More hands is usually better than simpler systems for reasons that have nothing to do with technical proficiency.
If you are working on open source databases, or something close to the metal I agree with antirez, if you are working at some established tech business (e.g: a very old ecommerce site), I agree with you
To be clear, I'm not disagreeing with antirez at all. I feel his argument in my bones. I am a smart programmer. I want simple, powerful systems that leave the kid gloves in the drawer.
The unfortunate reality is that a large cadre of people cannot handle such tools, and those people still have extremely valuable contributions to make.
I say this as a full-time research engineer at a top-10 university. We are not short on talent, new problems, or funding. There is ample opportunity to make our systems as simple/"pure" as possible, and I make that case vigorously. The fact remains that intentionally limiting scope for the sake of the many is often better than cultivating an elite few.
I'm really curious about the shape of your problem. I was an early hire at a tech unicorn and helped build many high scale systems and definitely found that our later stage hires really had a harder time dealing with the complexity tradeoff than earlier hires (though our earlier hires were limited by a demand to execute fast or lose business which added other, bad constraints.) I'm curious what your iteration of the problem looks like. We managed it by only trusting the bedrocks of our systems to engineers who demonstrated enough restraint to architect those.
What the blog post calls Vector Quantization is just numeric vector component quantization. Vector quantization typically uses K-means to quantize the entire vector, or parts of the vector in product quantization, into K-means indices.
https://en.wikipedia.org/wiki/Vector_quantization
> Actually, you can implement HNSWs on disk, even if there are better data structures from the point of view of disk access latencies
I built a KV-backed HNSW in Rust using LMDB to address this exact usecase (https://github.com/nnethercott/hannoy). The trickiest part for sure is being clever with your IO ops since HNSW search requires tons of random reads to find the nearest neighbours. Especially if you're not on NVMe's (think EBS-based HNSW) you really start to feel the cost of those reads, even with all the fancy SIMD.
I too believe there are ways to make HNSWs work well enough on disk.
Two resources that helped me understand HNSWs: https://blog.wilsonl.in/graph-vector-search/ https://github.com/brtholomy/hnsw
> long, sad story about MacOS and bad habits – I hadn’t lost something like that since the 90s, during blackouts
would love to hear this story as well now!
Well TLDR I'm an idiot :D
I write blog posts into TextEdit, just the white page to fill with text. Normally this is fine as I end writing and publish the blog post. This time after writing it I left it there for weeks: I wanted to refine it a bit, the post felt not "ready". Then I had to reboot the computer for an upgrade, and I magically quit the application hitting cancel when there was to save the document :-|
However rewriting it was fast, and the second version was better. So, it's fine. But starting from now I'll use the less pleasant (to me) Google Docs.
Google Docs with "pageless" mode and the markdown shortcuts is pretty close to my ideal writing experience - especially if you're using the enterprise edition that unlocks code blocks (please bring that to the free version Google - could help lots of college students take notes)
Sublime Text has a fullscreen mode, and an autosave mode if you need something local. https://lucybain.com/resources/setting-up-sublime-autosave/
why not vim?
I code with vim, but to write prose, I want a white big window without anything else.
How does redis deal with the case of an in-memory HNSW growing faster than it can comfortably flush pages to disk?
Redis does not flush pages to disk. When the RDB / AOF are generated by the saving child, it is a point-in-time operation not bound to the speed data is added in the parent.
[dead]
I may not remember this correctly, but normalizing to -127..127 may be counterproductive. When normalized to -1..1, all of Euclidian, dot product and cosine similarities produce the same ranking, allowing you to calculate similarity without having to calculate cosines.
Redis stores the magnitude of the vector and stores the vector in normalized form, so indeed we just do dot product :) But yet we can reconstruct the vector back. I totally forgot to mention this in the blog post!