I tried to find some use cases, this paper has listed some, although I think it's not obvious to me what makes the trees uniquely useful compared to other schemes (https://users.dcc.uchile.cl/~gnavarro/ps/cpm12.pdf seems to be the same Navarro as referenced in the article).
The use cases listed in that pdf are revolving around compression, e.g. graph adjacency list is listed as one. I myself found the last use case listed as smelling interesting (colored range queries), but I would need to dig into the references on that one to see what's actually going on with that one and is it truly anything interesting.
I would be interested in things like what's the unique advantage wavelets trees have compared to e.g. stuffing roaring bitmaps or other kinds of bitmaps into a tree. The RRR has rank-and-select queries which I think roaring bitmap won't do, so that might tie into something. Maybe a problem where the wavelet tree is the only known efficient way to solve it, or maybe it is uniquely really easy to throw at some types of problems or something else.
Anyone know real-world examples of wavelet trees used somewhere? I got interested enough to dig a bit deeper but on the spot as I'm writing this comment, I'm not smart enough to immediately see do these things have killer applications in any niches.
Succinct data structures such as wavelet trees are widely used in bioinformatics. There you often have strings that cannot be tokenized or parsed meaningfully, so you just have to deal with sequences of symbols. And because the strings are often very long and/or there can be a huge number of them, the data structures have to be space-efficient.
A wavelet tree is best seen as an intermediate data structure. It doesn't do anything particularly interesting on its own, but it can be used as a building block for higher-level data structures. For example, you can create an FM-index by storing the Burrows–Wheeler transform in a wavelet tree. (Though there are better options when the alphabet is small.) And then you can use the FM-index to find exact matches of any length between the pattern and the indexed strings.
People working with succinct data structures often talk about bitvectors rather than bitmaps. The difference is that bitmaps tend to focus on set operations, while bitvectors are more about random access with rank, select, and related queries. Then you could see wavelet trees as a generalization of bitvectors from a binary alphabet to larger alphabets. And then you have people talking about wavelet trees, when they really mean a wider class of conceptually and functionally similar data structures, regardless of the actual implementation.
I think what's changed since this was posted in 2011 is the emergence of embeddings and the need to take advantage of its higher dimensional space. While embeddings expose more underlying structure that can be used for tensor math, ranking systems often are still good ol' trees. This project to me points at a new major "hinge" of information architecture.
(Sorry for being a dick if im wrong) - was this an AI generated comment that got confused by the domain specific meaning of the word "rank" in this context?
That is a pretty badly overloaded word for it, and I didn't even pay much attention to the notion of "rank" anyways. I'm mainly interested in how the text is represented with bit vectors. It's very reminiscent of how vector math plays out in other ML domains, but I would bet that many people working with text have never heard of it.
Can you explain how this is useful for those problems though? I'm struggling to come up with a way to use rank queries on embeddings in order to get back useful information.
Interestingly, this algo has absolutely nothing whatsoever to do with "Wavelets" or even waves. The name "wavelet" stuck to it mostly only because it uses a recursive decomposition approach which happens to be something that Wavelet does in actual wave processing. It got collectively labeled "Wavelet" when what was really meant was just "Recursive".
The name derives from an analogy with the wavelet transform for signals, which recursively decomposes a signal into low-frequency and high-frequency components (Wikipedia).
"Wavelet tree" is not just a collective label but the name explicitly given by the authors of the paper where the data structure was first described in. At least Vitter had worked in image/video compression, where wavelet transforms and similar techniques are common. I believe the original idea was adapting those techniques for representing strings, and the wavelet tree data structure was the final outcome.
I tried to find some use cases, this paper has listed some, although I think it's not obvious to me what makes the trees uniquely useful compared to other schemes (https://users.dcc.uchile.cl/~gnavarro/ps/cpm12.pdf seems to be the same Navarro as referenced in the article).
The use cases listed in that pdf are revolving around compression, e.g. graph adjacency list is listed as one. I myself found the last use case listed as smelling interesting (colored range queries), but I would need to dig into the references on that one to see what's actually going on with that one and is it truly anything interesting.
I would be interested in things like what's the unique advantage wavelets trees have compared to e.g. stuffing roaring bitmaps or other kinds of bitmaps into a tree. The RRR has rank-and-select queries which I think roaring bitmap won't do, so that might tie into something. Maybe a problem where the wavelet tree is the only known efficient way to solve it, or maybe it is uniquely really easy to throw at some types of problems or something else.
Anyone know real-world examples of wavelet trees used somewhere? I got interested enough to dig a bit deeper but on the spot as I'm writing this comment, I'm not smart enough to immediately see do these things have killer applications in any niches.
Succinct data structures such as wavelet trees are widely used in bioinformatics. There you often have strings that cannot be tokenized or parsed meaningfully, so you just have to deal with sequences of symbols. And because the strings are often very long and/or there can be a huge number of them, the data structures have to be space-efficient.
A wavelet tree is best seen as an intermediate data structure. It doesn't do anything particularly interesting on its own, but it can be used as a building block for higher-level data structures. For example, you can create an FM-index by storing the Burrows–Wheeler transform in a wavelet tree. (Though there are better options when the alphabet is small.) And then you can use the FM-index to find exact matches of any length between the pattern and the indexed strings.
People working with succinct data structures often talk about bitvectors rather than bitmaps. The difference is that bitmaps tend to focus on set operations, while bitvectors are more about random access with rank, select, and related queries. Then you could see wavelet trees as a generalization of bitvectors from a binary alphabet to larger alphabets. And then you have people talking about wavelet trees, when they really mean a wider class of conceptually and functionally similar data structures, regardless of the actual implementation.
I think what's changed since this was posted in 2011 is the emergence of embeddings and the need to take advantage of its higher dimensional space. While embeddings expose more underlying structure that can be used for tensor math, ranking systems often are still good ol' trees. This project to me points at a new major "hinge" of information architecture.
What does this have to do with wavelet trees?
(Sorry for being a dick if im wrong) - was this an AI generated comment that got confused by the domain specific meaning of the word "rank" in this context?
That is a pretty badly overloaded word for it, and I didn't even pay much attention to the notion of "rank" anyways. I'm mainly interested in how the text is represented with bit vectors. It's very reminiscent of how vector math plays out in other ML domains, but I would bet that many people working with text have never heard of it.
Can you explain how this is useful for those problems though? I'm struggling to come up with a way to use rank queries on embeddings in order to get back useful information.
> It's very reminiscent of how vector math plays out in other ML domains
How so?
You weren't wrong. Wavelet Trees have no relationship whatsoever to vector embeddings.
Discussed here 12 years ago. https://news.ycombinator.com/item?id=5526991 (7 comments)
I think the site just went down.
Interestingly, this algo has absolutely nothing whatsoever to do with "Wavelets" or even waves. The name "wavelet" stuck to it mostly only because it uses a recursive decomposition approach which happens to be something that Wavelet does in actual wave processing. It got collectively labeled "Wavelet" when what was really meant was just "Recursive".
The name derives from an analogy with the wavelet transform for signals, which recursively decomposes a signal into low-frequency and high-frequency components (Wikipedia).
"Wavelet tree" is not just a collective label but the name explicitly given by the authors of the paper where the data structure was first described in. At least Vitter had worked in image/video compression, where wavelet transforms and similar techniques are common. I believe the original idea was adapting those techniques for representing strings, and the wavelet tree data structure was the final outcome.
You're seriously nit picking what "collective label" means? It means that name was accepted by the community.
Doesn't really seem like a nitpick to me. Your description of the situation feels a bit misleading.
Sounds like you haven't quite found a mistake yet. Keep thinking. Maybe you'll think of something.