Quantization: 1.58-bit for all decoder layers; 4-bit for embedding and lm_head
Model Bit-Widths
Mixed-Precision Recipe
Bit-Width
This Repo
100% 4-bit + 0% 1.58-bit
4
50% 4-bit + 50% 1.58-bit
2.79
12.5% 4-bit + 87.5% 1.58-bit
1.88
0% 4-bit + 100% 1.58-bit
1.58
âī¸
Model Performance
Models
W-A-KV
ARC-e
ARC-c
HellaS.
BoolQ
PIQA
WinoG.
SIQA
OBQA
Tr.QA2
Ethics
MMLU
IFEval
GSM8K
HumanE.
Average (â)
Qwen3-0.6B
16-16-16
56.02
34.04
47.23
64.04
67.36
56.04
39.20
31.20
42.84
47.70
40.12
58.41
41.54
37.20
47.35
EdgeRazor
4-16-16
58.54
33.45
45.04
68.01
68.34
55.72
40.07
33.40
43.69
54.36
39.37
53.42
42.00
34.15
47.83
EdgeRazor
2.79-16-16
51.77
28.33
37.47
70.70
63.71
54.06
40.33
28.20
42.72
55.08
36.85
51.39
26.69
31.10
44.17
EdgeRazor
1.88-16-16
51.22
27.73
34.21
66.91
63.66
53.35
38.43
27.60
43.80
55.92
28.78
42.51
25.09
23.17
41.60
EdgeRazor
1.58-16-16
45.75
25.77
33.89
66.64
60.72
52.33
38.23
29.80
44.40
51.70
32.85
37.34
14.25
23.17
39.77
EdgeRazor
4-8-8
57.79
33.70
45.00
67.49
67.85
55.88
40.17
33.80
43.53
54.09
39.73
53.42
42.00
34.76
47.80
EdgeRazor
2.79-8-8
52.10
28.50
37.36
70.58
63.92
53.12
40.12
28.60
42.82
54.97
36.44
49.54
26.99
32.32
44.10
EdgeRazor
1.88-8-8
51.47
27.99
34.22
66.85
63.49
53.04
38.02
27.40
43.88
55.92
29.56
44.55
25.09
23.17
41.76
EdgeRazor
1.58-8-8
44.87
26.11
33.88
66.73
60.55
51.30
38.28
31.00
44.72
50.76
33.09
38.45
15.01
22.56
39.81
Quickstart
It is recommended to ensure that EdgeRazor is installed in advance for weight-activation quantization. The provided weights are already quantized (quantized_weights*scaling_bf16); to enable activation and KV cache quantization, set trust_remote_code=True in the model configuration.
python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "zhangsq-nju/Qwen3-0.6B-EdgeRazor-1.58bit"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # For EdgeRazor-nbit, we only train the instruct mode.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 ()
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content)
print("content:", content)
Citation
If you find our project useful in your research, please consider kindly citing our papers âī¸:
text
@article{zhangsh-edgerazor,
title={{EdgeRazor}: A Lightweight Framework for Large Language Models via Mixed-Precision Quantization-Aware Distillation},
author={Shu-Hao Zhang and Le-Tong Huang and Xiang-Sheng Deng and Xin-Yi Zou and Chen Wu and Nan Li and Shao-Qun Zhang},
year={2026},
}
â ī¸ Incomplete Data
Some information about this model is not available.
Use with Caution - Verify details from the original source before relying on this data.