Running large language models locally used to mean either renting an A100 or watching your laptop melt. Modern LLM quantization changes the math: a 70B model that needed 140GB of VRAM at FP16 can now fit on a single 24GB consumer GPU with almost no quality loss. But the moment you go to download a model, you hit an alphabet soup — GGUF, AWQ, GPTQ, EXL2, bitsandbytes — and the wrong choice will cost you 3x in throughput or noticeable accuracy.
This guide cuts through the jargon. You will learn what each format actually does, when to pick which one, and the real-world numbers behind the marketing claims. By the end you will know exactly which file to grab from Hugging Face for your hardware in 2026.
What Is LLM Quantization (and Why It Matters in 2026)
Quantization compresses a model’s weights from 16-bit floats down to 8-, 4-, or even 2-bit integers. A Llama-3 70B model in FP16 weighs about 140GB. Quantized to 4-bit, the same model drops to roughly 35GB — small enough to run on a single RTX 4090 or two 3090s.
The trade-off is precision. Crush the weights too aggressively and the model starts hallucinating, repeating itself, or losing its grasp of code syntax. The art of modern LLM quantization is choosing what to compress hard and what to preserve. Each format below answers that question differently.

GGUF: The Universal Default
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and downstream tools like Ollama, LM Studio, and Jan. It is a single self-contained file that bundles weights, tokenizer, and metadata.
- Best for: CPU inference, Apple Silicon (Metal), mixed CPU+GPU offload, hobbyists.
- Quality at 4-bit (Q4_K_M): Roughly 92% of FP16 quality on standard benchmarks.
- Speed: Excellent on Apple M-series chips and CPU-heavy rigs; slower than AWQ or GPTQ on pure CUDA.
- Quantization time: 5–15 minutes for a 7B model.
GGUF wins on portability. The same file runs on a MacBook, a gaming PC, and a Linux server with zero conversion. If you are using Ollama or LM Studio, you are using GGUF whether you realized it or not. Look for tags like Q4_K_M (balanced), Q5_K_M (higher quality, larger file), or Q8_0 (near lossless, biggest file).
GPTQ: The Mature GPU Workhorse
GPTQ was the first method to push LLMs to 4-bit while keeping them coherent. It uses second-order Hessian information to figure out which weights tolerate aggressive rounding and which need protection. Three years on, GPTQ has the largest ecosystem of pre-quantized models on Hugging Face.
- Best for: NVIDIA GPU inference where throughput matters and you want a model that “just works” in vLLM, TGI, or Transformers.
- Quality at 4-bit: Around 90–91% of FP16 — slightly behind AWQ on reasoning tasks.
- Speed: Up to 5x faster than GGUF on a pure GPU pipeline with the Marlin kernel.
- Quantization time: 2–4 hours for a 7B model on an A100.
GPTQ is the safe pick when you need predictable behavior on a single NVIDIA card and don’t want to chase the latest kernel optimizations. It pairs especially well with high-throughput serving stacks like vLLM.
AWQ: The Quality-First Choice
AWQ (Activation-aware Weight Quantization) takes a different angle. Instead of analyzing weights in isolation, it watches which weights actually fire during inference on calibration data. The roughly 1% of weights that matter most get protected; the rest are aggressively crushed to INT4.
- Best for: Production NVIDIA inference where output quality is non-negotiable.
- Quality at 4-bit: Roughly 95% of FP16 — the strongest of the popular 4-bit formats.
- Speed: Up to 1.6x faster than baseline with the Marlin kernel; benchmarks show 741 tokens/sec on vLLM for a 7B model.
- Quantization time: 10–30 minutes for a 7B model, 1–3 hours for 70B.
AWQ has become the default choice for teams shipping coding assistants, RAG systems, and anything where a small accuracy regression shows up immediately in user complaints. The faster quantization time is a real bonus when you are iterating on fine-tunes.
EXL2 and the Honorable Mentions
EXL2 is the format used by ExLlamaV2, a CUDA-optimized runtime popular with power users. Its trick is mixed-precision per layer: critical layers get more bits, everything else gets less. The result is better quality at a given file size than uniform quantization.
If you want maximum tokens per second on a single NVIDIA card and don’t mind a smaller model selection, EXL2 is hard to beat. bitsandbytes, meanwhile, is the only option here that supports training (it powers QLoRA fine-tuning) but gives up some inference speed in exchange.
Head-to-Head: Which LLM Quantization Format Should You Pick?
Here is the short version that actually matters when you are staring at a Hugging Face model page:
- Apple Silicon Mac, CPU, or hobbyist setup: Pick GGUF Q4_K_M. It just works.
- Single NVIDIA GPU, quality matters most: Pick AWQ.
- NVIDIA GPU, broadest model selection, predictable behavior: Pick GPTQ.
- Maximum tokens/sec on a single NVIDIA card: Pick EXL2.
- You need to fine-tune, not just infer: Pick bitsandbytes with QLoRA.
For most production teams in 2026, the practical answer is AWQ on vLLM. For most individual developers, it is GGUF on Ollama. Everyone else is optimizing for an edge case — which is fine, but make sure you have actually measured the bottleneck first.
Real-World Quality Trade-offs
Benchmarks are great, but the differences between 4-bit formats often disappear in everyday use. A few patterns worth knowing:
- Code generation is the most sensitive workload — this is where AWQ’s edge over GPTQ shows up clearly.
- Long-context retrieval degrades faster under aggressive quantization. Bump up to 5-bit or 6-bit if you are running a 32k+ context RAG pipeline.
- Multilingual tasks can lose more quality than English-only ones because tokenizers and rare-token weights get hit harder.
If you are deploying to production, run your own evals on a held-out task set before committing. The official Hugging Face quantization docs are a solid starting point for setup.

FAQ
Is GGUF slower than GPTQ on GPU?
Yes, on pure NVIDIA GPU inference GGUF is typically 2–5x slower than GPTQ or AWQ with optimized kernels. GGUF’s strength is portability and CPU/Apple Silicon support, not raw GPU throughput.
Can I fine-tune an AWQ or GPTQ model?
Not directly. AWQ and GPTQ are inference-only formats. To fine-tune in 4-bit, use bitsandbytes with QLoRA, then merge and re-quantize the result to AWQ or GPTQ for production serving.
What does Q4_K_M mean in GGUF?
Q4_K_M is a 4-bit quantization variant that uses a mix of 4-bit and 6-bit blocks for the most sensitive weights. It is the most popular GGUF preset because it balances size and quality well. Q5_K_M is higher quality at the cost of file size; Q3 variants are smaller but visibly degraded.
Does quantization work for vision-language models?
Yes, but the vision encoder is more sensitive than the language head. Most VLM quantizations keep the vision tower in FP16 and only quantize the LLM portion. Look for model cards that specify which components are quantized before downloading.
Conclusion: Pick Your LLM Quantization Format and Move On
The right LLM quantization format is the one that matches your hardware and runtime — not the one with the highest benchmark on a leaderboard. GGUF for portability, AWQ for production quality, GPTQ for ecosystem maturity, EXL2 for raw speed. Pick one, ship it, and revisit only if you can measure a real bottleneck.
Want more practical AI engineering guides? Browse our Technology section for deep dives on vLLM, RAG architectures, agent frameworks, and more.

