The technical achievement: Got it down to 5.1MB by stripping everything
except pure inference. Written in Rust, uses llama.cpp's engine.
One feature I'm excited about: You can use LoRA adapters directly without
converting them. Just point to your .gguf base model and .gguf LoRA -
it handles the merge at runtime. Makes iterating on fine-tuned models
much faster since there's no conversion step.
Your data never leaves your machine. No telemetry. No accounts. Just a
tiny binary that makes GGUF models work with your AI coding tools.
Would love feedback on the auto-discovery feature - it finds your models
automatically so you don't need any configuration.
What's your local LLM setup? Are you using LoRA adapters for anything specific?
1. Install Shimmy:
cargo install shimmy
2. Get GGUF models (same models you'd use with Ollama):
# Download to ./models/ directory
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf --local-dir
./models/
# Or use existing Ollama models from ~/.ollama/models/
3. Start serving:
./shimmy serve
4. Use with any OpenAI-compatible client at http://localhost:11435
I am trying to use ~/.ollama/models/, even linked it to ~/models. I don’t have phi-3, it may be possible none of my models are supported. It acts as though it sees nothing.
How do I know for sure it is checking ~/.ollama/models/ (if linking isn’t the right approach.)
Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.
Key differences:
- Architecture: llama-swap = proxy + multiple servers, Shimmy = single server
- Resource usage: llama-swap runs multiple processes, Shimmy = one 50MB process
- Use case: llama-swap for managing many models, Shimmy for simplicity
Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.
Hey HN! I built this because I was tired of waiting 10 seconds for Ollama's 680MB binary to start just to run a 4GB model locally.
Quick demo - working VSCode + local AI in 30 seconds: curl -L https://github.com/Michael-A-Kuykendall/shimmy/releases/late... ./shimmy serve # Point VSCode/Cursor to localhost:11435
The technical achievement: Got it down to 5.1MB by stripping everything except pure inference. Written in Rust, uses llama.cpp's engine.
One feature I'm excited about: You can use LoRA adapters directly without converting them. Just point to your .gguf base model and .gguf LoRA - it handles the merge at runtime. Makes iterating on fine-tuned models much faster since there's no conversion step.
Your data never leaves your machine. No telemetry. No accounts. Just a tiny binary that makes GGUF models work with your AI coding tools.
Would love feedback on the auto-discovery feature - it finds your models automatically so you don't need any configuration.
What's your local LLM setup? Are you using LoRA adapters for anything specific?
You may have noticed already, but the link to the binary is throwing a 404.
This should be fixed now!
How do I use it with ollama models?
To use Shimmy (instead of Ollama):
I am trying to use ~/.ollama/models/, even linked it to ~/models. I don’t have phi-3, it may be possible none of my models are supported. It acts as though it sees nothing.
How do I know for sure it is checking ~/.ollama/models/ (if linking isn’t the right approach.)
I didn't have that path set to autodiscover; pull the newest version this is fixed now!!
[dead]
Nice, a rust tool wrapping llama.cpp
how does it differ from llama-server?
and from llama-swap?
Shimmy is designed to be "invisible infrastructure" - the simplest possible way to get local inference working with your existing AI tools. llama-server gives you more control, llama-swap gives you multi-model management.
Shimmy is for when you want the absolute minimum footprint - CI/CD pipelines, quick local testing, or systems where you can't install 680MB of dependencies.
Windows Defender tripped this for me, calling it out as Bearfoos trojan. Most likely a false positive, but jfyi.
Try cargo install or intentionally exclude, unsigned Rust binaries will do this.
looks cool, ty! really great project will try this out.
[dead]