mdp2021 17 hours ago Also see# Smarter Local LLMs, Lower VRAM Costs – All Without Sacrificing Quality, Thanks to Google’s New [Quantization-Aware Training] "QAT" Optimizationhttps://www.hardware-corner.net/smarter-local-llm-lower-vram...> According to Google, they’ve «reduced the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.»
philipkglass 17 hours ago Are there comparisons between int4 QAT versions of these models and the more common GGUF Q4_K_M quantizations generated post-training? The QAT models appear to be slightly larger:https://ollama.com/library/gemma3/tagsI presume QAT are better but I don't see how much better. mdp2021 17 hours ago > I presume QAT are better but I don't see how much betterNot the data for Google's Gemma, but some numbers are here: https://aclanthology.org/2024.findings-acl.26/ ( https://aclanthology.org/2024.findings-acl.26.pdf )
mdp2021 17 hours ago > I presume QAT are better but I don't see how much betterNot the data for Google's Gemma, but some numbers are here: https://aclanthology.org/2024.findings-acl.26/ ( https://aclanthology.org/2024.findings-acl.26.pdf )
Also see
# Smarter Local LLMs, Lower VRAM Costs – All Without Sacrificing Quality, Thanks to Google’s New [Quantization-Aware Training] "QAT" Optimization
https://www.hardware-corner.net/smarter-local-llm-lower-vram...
> According to Google, they’ve «reduced the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.»
Are there comparisons between int4 QAT versions of these models and the more common GGUF Q4_K_M quantizations generated post-training? The QAT models appear to be slightly larger:
https://ollama.com/library/gemma3/tags
I presume QAT are better but I don't see how much better.
> I presume QAT are better but I don't see how much better
Not the data for Google's Gemma, but some numbers are here: https://aclanthology.org/2024.findings-acl.26/ ( https://aclanthology.org/2024.findings-acl.26.pdf )