table of contents
| LLAMA-QUANTIZE(1) | User Commands | LLAMA-QUANTIZE(1) |
NAME¶
llama-quantize - llama-quantize
DESCRIPTION¶
usage: obj-x86_64-linux-gnu/bin/llama-quantize [--help] [--allow-requantize] [--leave-output-tensor] [--pure] [--imatrix] [--include-weights]
- [--exclude-weights] [--output-tensor-type] [--token-embedding-type] [--tensor-type] [--tensor-type-file] [--prune-layers] [--keep-split] [--override-kv] [--dry-run] model-f32.gguf [model-quant.gguf] type [nthreads]
- --allow-requantize
- allow requantizing tensors that have already been quantized WARNING: this can severely reduce quality compared to quantizing
- from 16bit or 32bit!
- --leave-output-tensor
- leave output.weight un(re)quantized increases model size but may also increase quality, especially when requantizing
- --pure
- disable k-quant mixtures and quantize all tensors to the same type
- --imatrix file_name
- use data in file_name as importance matrix for quant optimizations
- --include-weights tensor_name
- use importance matrix for this/these tensor(s)
- --exclude-weights tensor_name
- do not use importance matrix for this/these tensor(s)
- --output-tensor-type ggml_type
- use this ggml_type for the output.weight tensor
- --token-embedding-type ggml_type
- use this ggml_type for the token embeddings tensor
- --tensor-type tensor_name=ggml_type
- quantize this tensor to this ggml_type this is an advanced option to selectively quantize tensors. may be specified multiple times. example: --tensor-type attn_q=q8_0
- --tensor-type-file tensor_types.txt
- list of tensors to quantize to a specific ggml_type this is an advanced option to selectively quantize a long list of tensors. the file should use the same format as above, separated by spaces or newlines.
- --prune-layers L0,L1,L2...
- comma-separated list of layer numbers to prune from the model WARNING: this is an advanced option, use with care.
- --keep-split
- generate quantized model in the same shards as input
- --override-kv KEY=TYPE:VALUE
- override model metadata by key in the quantized model. may be specified multiple times. WARNING: this is an advanced option, use with care.
- --dry-run
- calculate and show the final quantization size without performing quantization example: llama-quantize --dry-run model-f32.gguf Q4_K
note: --include-weights and --exclude-weights cannot be used together
-----------------------------------------------------------------------------
- allowed quantization types
-----------------------------------------------------------------------------
- 40
- or Q1_0 : 1.125 bpw quantization
- 2
- or Q4_0 : 4.34G, +0.4685 ppl @ Llama-3-8B
- 3
- or Q4_1 : 4.78G, +0.4511 ppl @ Llama-3-8B
- 38
- or MXFP4_MOE : MXFP4 MoE
- 8
- or Q5_0 : 5.21G, +0.1316 ppl @ Llama-3-8B
- 9
- or Q5_1 : 5.65G, +0.1062 ppl @ Llama-3-8B
- 19
- or IQ2_XXS : 2.06 bpw quantization
- 20
- or IQ2_XS : 2.31 bpw quantization
- 28
- or IQ2_S : 2.5 bpw quantization
- 29
- or IQ2_M : 2.7 bpw quantization
- 24
- or IQ1_S : 1.56 bpw quantization
- 31
- or IQ1_M : 1.75 bpw quantization
- 36
- or TQ1_0 : 1.69 bpw ternarization
- 37
- or TQ2_0 : 2.06 bpw ternarization
- 10
- or Q2_K : 2.96G, +3.5199 ppl @ Llama-3-8B
- 21
- or Q2_K_S : 2.96G, +3.1836 ppl @ Llama-3-8B
- 23
- or IQ3_XXS : 3.06 bpw quantization
- 26
- or IQ3_S : 3.44 bpw quantization
- 27
- or IQ3_M : 3.66 bpw quantization mix
- 12
- or Q3_K : alias for Q3_K_M
- 22
- or IQ3_XS : 3.3 bpw quantization
- 11
- or Q3_K_S : 3.41G, +1.6321 ppl @ Llama-3-8B
- 12
- or Q3_K_M : 3.74G, +0.6569 ppl @ Llama-3-8B
- 13
- or Q3_K_L : 4.03G, +0.5562 ppl @ Llama-3-8B
- 25
- or IQ4_NL : 4.50 bpw non-linear quantization
- 30
- or IQ4_XS : 4.25 bpw non-linear quantization
- 15
- or Q4_K : alias for Q4_K_M
- 14
- or Q4_K_S : 4.37G, +0.2689 ppl @ Llama-3-8B
- 15
- or Q4_K_M : 4.58G, +0.1754 ppl @ Llama-3-8B
- 17
- or Q5_K : alias for Q5_K_M
- 16
- or Q5_K_S : 5.21G, +0.1049 ppl @ Llama-3-8B
- 17
- or Q5_K_M : 5.33G, +0.0569 ppl @ Llama-3-8B
- 18
- or Q6_K : 6.14G, +0.0217 ppl @ Llama-3-8B
- 7
- or Q8_0 : 7.96G, +0.0026 ppl @ Llama-3-8B
- 1
- or F16 : 14.00G, +0.0020 ppl @ Mistral-7B
- 32
- or BF16 : 14.00G, -0.0050 ppl @ Mistral-7B
- 0
- or F32 : 26.00G @ 7B
- COPY
- : only copy tensors, no quantizing
| May 2026 | debian |