Language Support for Marginalia Search

176 points by Bogdanp 2 days ago

ofalkaed 2 days ago

Surprisingly informative for what is pretty much a press release, learned a good deal about search engines.

marginalia_nu 2 days ago

(author)
I'm kinda allergic to writing "I did the thing" posts, so I can't help but tryhard and attempt to make them compelling somehow.
Writing in this manner is also very helpful in making sense of the work for myself. Takes a better understanding of the subject to thoroughly explain what you've built than to merely build it. Sometimes I've gone back and read through one of these updates to just get a refresher on what my thinking was when I built something.
- ofalkaed 2 days ago
  
  In my experience, that is pretty much what marginalia search is. I rarely get what I expect but I always get something very interesting that makes me understand my expectations better which is very helpful in accomplishing my goals. Thanks for your work, marginalia is probably my favorite little corner of the web.
- LTom a day ago
  
  A quick question: are you looking for feedback on search results in other languages (as in, what I expect vs. what I get), or is it too early for that?
  - marginalia_nu a day ago
    
    Yeah it's definitely helpful to have those types of reports.

mariusor 2 days ago

Off topic, but would there be a way to integrate marginalia with a specific website? Similarly to how people use google search for their forums or how HN uses algolia?

I'm asking this as one of my projects is a link aggregator similar to old reddit (and HN to some extent) and I would like to be able to present to users a search box, but without having to implement document indexing and search. (I assume ad principio that the website is already aligned ethically and technologically with what Marginalia stands for :D)

marginalia_nu 2 days ago

Should be soon-ish. I'm working right now on laying the ground works for ad-hoc domain filters. That's technically already possible but comes at a too big performance impact that it deteriorates the search results.
When it works, one of the things I have in mind is making a site search-esque functionality available, as well as exposing it via the public API so that it can be whiteboxed.
- mariusor 2 days ago
  
  Nice. Is there a way to track the work you're doing there (and in general actually)?
  - marginalia_nu 2 days ago
    
    Best is probably the search-engine tag on my blog[1]. It's the closest you get to release notes for the project.
    [1] https://www.marginalia.nu/tags/search-engine/

smoghat a day ago

I’m a little confused by Marginalia. I looked to find out what its purpose was, but couldn’t find it. My bad, I guess, but then again I’m not a search engine. It is pretty cool for a DIY project but the results were really off, especially for searches for individuals. Like take Ezra Klein as an example. Sure there is a link to his show from castbox, a service I have never heard of, and then a bunch of anti Ezra Klein articles. Wikipedia shows up, the last link of the first page is to Abundance. But no NYT? That seems like a big problem. I thought I’d look up Daring Fireball and the only link to his site was a ways down and was to a list of links in 2008. These are just two random searches. I did others, starting with myself, and my results were similar.

Likely I am totally not understanding what this search engine is for. I see this a lot on submissions here. I find something interesting sounding but I don’t understand the context. Maybe it’s just me, but it’s confusing.

marginalia_nu a day ago

The point of Marginalia Search, as far as there is one, is mostly to complement the bigger search engines by providing tools to find obscure stuff that's drowned out elsewhere, mostly by offering a bunch of filters.
It's not a google replacement, and if you already know what you're looking for then it's probably not the right tool.
Maybe you're looking for mechanical keyboard discussions, then maybe a search for "mechanical keyboard" in the Blogs or Forums filters will provide results you are into.
It's also pretty good at unearthing weird stuff. Say you want to read up on Jack Parsons[3], that Jet Propulsion Lab guy who dabbled in occultism, fell in with Alistair Crowley and then got scammed out of his wealth by L Ron Hubbard, and finally blew himself up, well that is the sort of topic Marginalia Search generally excels at.
[1] https://marginalia-search.com/search?query=mechanical+keyboa...
[2] https://marginalia-search.com/search?query=mechanical+keyboa...
[3] https://marginalia-search.com/search?query=Jack+Parsons&prof...
iamnothere a day ago

It’s for finding results that are less common or more unlikely to appear on other engines, so your results make sense. Why would you need yet another link to an NYT article? That space is crowded. Every engine will find it.
Where it particularly shines is finding highly specific results that get buried in other search engines. Some topics (particularly topics of high commercial interest) have become impossible to research on mainstream search engines. Marginalia will actually find informative articles about these topics rather than page after page of product results and spam.
It may not be useful to you if you’re not a researcher, writer, or someone who often needs to dig deeply into subjects beyond the level of common knowledge.
FabCH a day ago

It's a one-man Search engine developed and hosted in the EU.
If you read his about page, it is basically an anti-centralization anti-ad anti-spyware attempt at websearch. It is also "The project is independent in that it has no loans, no investors looking for a payday, no strings attached anywhere to pressure it into doing anything than providing as much and as good internet search as it is capable of."
It not indexing NYT seems precisely on brand.
- marginalia_nu a day ago
  
  It does index bits of NYT, but coverage is pretty spotty outside of their archives. They put a lot of crawler countermeasures up on their main site (which I guess is fair, they have a business to run), but author biographies are generally accessible, including Ezra's[1].
  Though since the search engine doesn't really apply much in terms of domain authority, this doesn't rank very highly, the websites that talk about Ezra Klein rank higher.
  [1] https://marginalia-search.com/search?query=site%3Anytimes.co...

atombender 15 hours ago

> Thankfully the BM-25 model used in ranking is robust to this, as it relies on live data from the index itself.

I'm confused by this. TD-IDF incorporates the term frequency (the IDF part), which search engines precompute for the index as a whole. But so does BM25; its IDF formula is slightly different, but also relies on term frequencies. What's the difference?

marginalia_nu 7 hours ago

The index has the most up-to-date term frequency information, but it is logistically inacessible, and it's not really practical to interrogate it when extracting keywords (as you need this information for 100 billion terms), so a somewhat stale version is kept in memory instead and used in that process.
When searching, doing BM25, it is a lot more accessible as you already fetch that information indirectly as part of looking up the documents lists, and this is typically only done up to about a dozen times per query.

vintermann a day ago

This is never going to work. The author is apparently against AI in search in favor of "simplicity", but this sort of thing

> Sentences are stemmed and POS-tagged. Sentences, with stemming and POS-tag data is fed into keyword extraction algorithms

IS AI, it's just old fashioned and bad AI. What he's trying will never work well, for the same reason rule-based machine translation never worked well: there are just too many rules and exceptions. Simplicity is great when you can have it, but with human language, simplicity was never on the table.

He's going to have to bite the bullet and use document embedding models sooner or later.

marginalia_nu a day ago

This code is just for helping identify document topics, it literally doesn't need to be perfect. Embedding a billion documents with a server that has no GPU is neither practical nor something that yields good results.

reedf1 2 days ago

Took me too long to realize this wasn't a tool to search for marginalia in scanned manuscripts.

iamnothere 2 days ago

Hey, at least it isn’t named after a very large number, an excited exclamation, or a sound effect. Surely no product with one of those names would ever succeed.
- marginalia_nu 2 days ago
  
  I probably should have named it cartoon-trombone.wav in retrospect.
  - reedf1 a day ago
    
    It's a fine name! I had marginalia on the mind - I am reading The Name of the Rose.
    
    iamnothere a day ago
    
    That makes sense. I am perhaps overly sensitive to the drive by “name haters” who seem to show up in every FOSS or indie project thread.
    
    reedf1 a day ago
    
    I feel a bit bad it was interpreted that way.
    Some fun context, I was trying to find a scanned copy of the first 'correct' book on optics (written by https://en.wikipedia.org/wiki/Ibn_al-Haytham). Possibly the first person to really use the scientific method in circa 1000CE (!!). And I found this (https://cudl.lib.cam.ac.uk/view/MS-PETERHOUSE-00209/103) filled with interesting optical diagrams like something out of my high school physics notebooks. Anyway - I was also thinking about how they might index interesting doodles in the margins. So it was on my mind.

internet_points 2 days ago

What tools/data do you use for pos-tagging? I'm guessing it has to be fast, to run without a google data center :)

marginalia_nu 2 days ago

I'm using RDRPosTagger[1], though I've optimized the code a bit so that it's not just algorithmically efficient, but to use the language in a way that is fast. It isn't perfect, but it's good enough to be useful.
Language detection and sentence splitting are the other two slow bits of processing.
[1] https://github.com/datquocnguyen/RDRPOSTagger

juliend2 2 days ago

I remember asking you for this, so Thank you so much! It works quite well from what I can see.

Small UI issue: on Desktop, the left sidebar should be scrollable, because now on Firefox I can't reach the "Language" menu item in the search results view, unless I zoom-out.