This appears to do no chunking. It just shoves the entire document (entire book, in my case) into the embedding request to Ollama. So it's only helpful if all your documents are small (i.e. no books).
The embedding model (bge-m3 in this case) has a sequence length of 8192 tokens, i.e. rlama tries to embed the whole book, but Ollama can only put the first few pages into the embedding request.
Then when retrieving, it retrieves the entire document instead of the relevant passage (because there is no chunking), but truncates this to the first 1000 characters, i.e. the first half-page of Table of Contents.
As a result, when queried, the model says: "There is no direct mention of the Buddha in the provided documents." (The word Buddha appears 44,121 times in the documents I indexed.)
A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.
---
I'd also recommend showing the search results to the user (I think just having a vector search engine is already an extremely useful feature, even without the AI summary / question answering), and altering the prompt to provide references (e.g. the based on the chunk metadata like page number).
I have just implemented chunking with overlap for larger documents to split texts into smaller chunks and ensure access to all documentation in your RAG. It's currently in the testing phase, and I’d like to experiment with different models to optimize the process. Once I confirm that everything is working correctly, I can merge the PR into the main branch, and you’ll just need to update Rlama with `rlama update`.
Sadly, the hardest part of running local models with tools like Ollama appears to be longer context prompts.
Models that respond really quickly to a short sentence prompt need vastly more RAM and CPU/GPU time for significantly longer inputs. I'm finding this really damages their utility for me.
> A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.
Books have author provided logical chunking in chapters. You can further split/summarize smaller sections and then do a hierarchical search (naive chunking kind of sucks from my experience)
yeah, chunking seems to be the key for any decent RAG implementation... it's interesting how much the retrieval strategy impacts the final answer quality. i've seen some community members mention that even with chunking, things like chunk overlap and smart metadata can significantly improve results. also, presenting search results to the user alongside the AI summary is a great point.
This is my next step. Currently, I’ve built an MVP to test the features, integrations, and see how far I can go with rLlama. I’m already developing a RAG on my end by chunking the data, adding overlap, and using metadata to retrieve the best possible context. This should be deployed soon. The version on GitHub has been pushed for days now, and it was only a version to showcase the features. I can’t wait to improve it and make it useful for everyone!
Really nice project, congrats & great work! Quick notes:
- as an end user, some primary concerns re apps using the file system:
- who will be able to read it? does the app share data?
- I'm not thinking about a privacy policy, but a hard block that would not allow any internet access for the binary/app. Would rlama still work correctly ?
- is the app able to modify/delete files?
- it should be ensured that there is no "full file system" access, ie just read permission
- code note: surprised that .ts (typescript) is not listed
- really crisp website: did you code it from scratch or is it template-based?
I put ollama on a docker container, at first with no internet access, and then by using opensnitch to keep an eye on this. You can probably put rlama on another container and do the same thing.
Note that there are threat profiles for which this is not enough security.
What is the architecture/tech-stack used in building this? I didn't find this info neither on github readme, nor on website.
I like the fact that it is written in Go and small enough to skim over the weekend, but after repeatedly burning my time on dozens of llm ecosystem tools, I'm careful in choosing to even explore the code myself without seeing these basic disclosures upfront. I'm sure you'd see more people adopting your tool if you can provide a high-level overview of the project's architecture (ideally in a visual manner)
Hey! Yes, that's something I was planning to do—a complete documentation on the code, its architecture, and the entire stack to allow others to develop alongside me. I just deployed a functional version, and soon, the website will have documentation with its architecture and a visualization of the entire code.
but for now here is the stack used:
Core Language: Go (chosen for performance, cross-platform compatibility, and single binary distribution)
CLI Framework: Cobra (for command-line interface structure)
LLM Integration: Ollama API (for embeddings and completions)
Storage: Local filesystem-based storage (JSON files for simplicity and portability)
Vector Search: Custom implementation of cosine similarity for embedding retrieval
Hi, if you want to keep using a Go embedded/in-process vector store, but with some additional features, you can check out my project https://github.com/philippgille/chromem-go
I feel very doubtful on usefulness of these tools because of hallucinations. How reliable is this one in comparison with others like these? How well does it cite the source?
To me getting my data from my notes correctly is most important. I use AI tools for coding occasionally (which I can easily verify on my own), for anything else I can never bring myself to be doubtless about the output.
I don't know about the OP tool, but open webui has its own document database which you can integrate with LLMs, and when answering questions it always cites the source with a link for you to verify
I thought about adding an API interface for it. It is on my to-do list of things that could be good to add. For now, I'm gathering feedback to see what people like about it or not.
Just do your own RAG. It's very easy and Ollama actually have a quick start tutorial on their page. Then you can also fine-tune the process to your needs.
This is cool and pretty much what I was wondering about, I mean obvious it was possible, cool to see it implemented. Looking forward to having a play.
I am building a tool purely with AI and been working on specs and designs. It is clear that Claude and Grok can’t really keep up with the context that we humans can jump around all over the place. Being able to build this local documentation repo and Q&A it will be neat.
I've already made some examples, even with my own codebase, to see how it can be used to understand projects, and I want to show how it can be used with documentation or studies. I will publish them next week.
This appears to do no chunking. It just shoves the entire document (entire book, in my case) into the embedding request to Ollama. So it's only helpful if all your documents are small (i.e. no books).
The embedding model (bge-m3 in this case) has a sequence length of 8192 tokens, i.e. rlama tries to embed the whole book, but Ollama can only put the first few pages into the embedding request.
Then when retrieving, it retrieves the entire document instead of the relevant passage (because there is no chunking), but truncates this to the first 1000 characters, i.e. the first half-page of Table of Contents.
As a result, when queried, the model says: "There is no direct mention of the Buddha in the provided documents." (The word Buddha appears 44,121 times in the documents I indexed.)
A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.
---
I'd also recommend showing the search results to the user (I think just having a vector search engine is already an extremely useful feature, even without the AI summary / question answering), and altering the prompt to provide references (e.g. the based on the chunk metadata like page number).
I have just implemented chunking with overlap for larger documents to split texts into smaller chunks and ensure access to all documentation in your RAG. It's currently in the testing phase, and I’d like to experiment with different models to optimize the process. Once I confirm that everything is working correctly, I can merge the PR into the main branch, and you’ll just need to update Rlama with `rlama update`.
Sadly, the hardest part of running local models with tools like Ollama appears to be longer context prompts.
Models that respond really quickly to a short sentence prompt need vastly more RAM and CPU/GPU time for significantly longer inputs. I'm finding this really damages their utility for me.
> A better solution (and, as far as I can tell, what every other RAG does) is to split the document into chunks that can actually fit the context of the embedding model, and then retrieve those chunks -- ideally with metadata about which part of the document it's from.
Books have author provided logical chunking in chapters. You can further split/summarize smaller sections and then do a hierarchical search (naive chunking kind of sucks from my experience)
What's the gold standard paid offering that does this?
Not a paid solution, but great for testing models yourself: AWS bedrock.
Wonky documentation (definitely released too early), but imo the best model agnostic diy solution out there.
yeah, chunking seems to be the key for any decent RAG implementation... it's interesting how much the retrieval strategy impacts the final answer quality. i've seen some community members mention that even with chunking, things like chunk overlap and smart metadata can significantly improve results. also, presenting search results to the user alongside the AI summary is a great point.
This is my next step. Currently, I’ve built an MVP to test the features, integrations, and see how far I can go with rLlama. I’m already developing a RAG on my end by chunking the data, adding overlap, and using metadata to retrieve the best possible context. This should be deployed soon. The version on GitHub has been pushed for days now, and it was only a version to showcase the features. I can’t wait to improve it and make it useful for everyone!
Really nice project, congrats & great work! Quick notes:
- as an end user, some primary concerns re apps using the file system:
- code note: surprised that .ts (typescript) is not listed- really crisp website: did you code it from scratch or is it template-based?
I put ollama on a docker container, at first with no internet access, and then by using opensnitch to keep an eye on this. You can probably put rlama on another container and do the same thing.
Note that there are threat profiles for which this is not enough security.
What is the architecture/tech-stack used in building this? I didn't find this info neither on github readme, nor on website.
I like the fact that it is written in Go and small enough to skim over the weekend, but after repeatedly burning my time on dozens of llm ecosystem tools, I'm careful in choosing to even explore the code myself without seeing these basic disclosures upfront. I'm sure you'd see more people adopting your tool if you can provide a high-level overview of the project's architecture (ideally in a visual manner)
Hey! Yes, that's something I was planning to do—a complete documentation on the code, its architecture, and the entire stack to allow others to develop alongside me. I just deployed a functional version, and soon, the website will have documentation with its architecture and a visualization of the entire code.
but for now here is the stack used: Core Language: Go (chosen for performance, cross-platform compatibility, and single binary distribution) CLI Framework: Cobra (for command-line interface structure) LLM Integration: Ollama API (for embeddings and completions) Storage: Local filesystem-based storage (JSON files for simplicity and portability) Vector Search: Custom implementation of cosine similarity for embedding retrieval
Hi, if you want to keep using a Go embedded/in-process vector store, but with some additional features, you can check out my project https://github.com/philippgille/chromem-go
Why not use an established open source vector db like pg_vector etc? I imagine your implementation is not going to be as performant
I recommend using this hybrid vector/full text search engine that works across many runtimes: https://github.com/oramasearch/orama
Defeats the point of the single binary installation if you have to set up dependencies.
rlama requires a python install (and several dependencies via pip) to extract text.
https://github.com/DonTizi/rlama/blob/main/internal/service/...
I feel very doubtful on usefulness of these tools because of hallucinations. How reliable is this one in comparison with others like these? How well does it cite the source?
To me getting my data from my notes correctly is most important. I use AI tools for coding occasionally (which I can easily verify on my own), for anything else I can never bring myself to be doubtless about the output.
> How well does it cite the source?
I don't know about the OP tool, but open webui has its own document database which you can integrate with LLMs, and when answering questions it always cites the source with a link for you to verify
Could this work with llama.cpp, since it’s the engine behind Ollama?
I usually build llama.cpp from source and download quantized (GGUF) models from Huggingface, haven’t used Ollama this far.
No, for now, I’ve only made it work with Ollama, but it could be ideal to do it directly on llama.cpp. Thank you, I’ll take note of it.
That would be great. Llama.cpp’s built in server offers HTTP embedding endpoints.
This is great! It would be great if there were an API interface for integration into other systems.
I thought about adding an API interface for it. It is on my to-do list of things that could be good to add. For now, I'm gathering feedback to see what people like about it or not.
Just do your own RAG. It's very easy and Ollama actually have a quick start tutorial on their page. Then you can also fine-tune the process to your needs.
Cool project. What license is this released under? Not seeing it documented.
Just added an Apache License
Say that I'm an amateur historian. I go into an archive and scan a bunch of documents (letters, diagrams, maps, etc). They're saved as JPG files.
What's the best way to make sense of that corpus of knowledge? Is it Rlama or something else?
I'm currently trying to homebrew this with Gemini, but am not sure if there's something that gets me out of building a RAG system from scratch.
This is cool and pretty much what I was wondering about, I mean obvious it was possible, cool to see it implemented. Looking forward to having a play.
I am building a tool purely with AI and been working on specs and designs. It is clear that Claude and Grok can’t really keep up with the context that we humans can jump around all over the place. Being able to build this local documentation repo and Q&A it will be neat.
How good is the RAG though? Just throwing a vector db at it doesn't make it useful...
Nice work. Any plans to somehow integrate into mdbooks (https://rust-lang.github.io/mdBook/) ?
Or a general web user-interface?
Would be interesting to see an example session of a user interacting with rlama. Maybe a Q&A about it's own Go code.
I've already made some examples, even with my own codebase, to see how it can be used to understand projects, and I want to show how it can be used with documentation or studies. I will publish them next week.
[dead]