🧠

qwen2-72b-instruct

by qwen Model ID: hf-model--qwen--qwen2-72b-instruct
FNI 7.5
Top 85%

"Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo con......"

🔗 View Source
Audited 7.5 FNI Score
72.71B Params
4k Context
22.8K Downloads
H100+ ~57GB Est. VRAM

Quick Commands

🦙 Ollama Run
ollama run qwen2-72b-instruct
🤗 HF Download
huggingface-cli download qwen/qwen2-72b-instruct
📦 Install Lib
pip install -U transformers
📊

Engineering Specs

Hardware

Parameters
72.71B
Architecture
Qwen2ForCausalLM
Context Length
4K
Model Size
135.4GB

🧠 Lifecycle

Library
-
Precision
float16
Tokenizer
-

🌐 Identity

Source
HuggingFace
License
Open Access
💾

Est. VRAM Benchmark

~57GB

Analyze Hardware

* Technical estimation for FP16/Q4 weights. Does not include OS overhead or long-context batching. For Technical Reference Only.

🕸️ Neural Mesh Hub

Interconnecting Research, Data & Ecosystem

📈 Interest Trend

--

* Real-time activity index across HuggingFace, GitHub and Research citations.

No similar models found.

🔬Technical Deep Dive

Full Specifications [+]
---

🚀 What's Next?

Quick Commands

🦙 Ollama Run
ollama run qwen2-72b-instruct
🤗 HF Download
huggingface-cli download qwen/qwen2-72b-instruct
📦 Install Lib
pip install -U transformers
🖥️

Hardware Compatibility

Multi-Tier Validation Matrix

Live Sync
🎮 Compatible

RTX 3060 / 4060 Ti

Entry 8GB VRAM
🎮 Compatible

RTX 4070 Super

Mid 12GB VRAM
💻 Compatible

RTX 4080 / Mac M3

High 16GB VRAM
🚀 Compatible

RTX 3090 / 4090

Pro 24GB VRAM
🏗️ Compatible

RTX 6000 Ada

Workstation 48GB VRAM
🏭 Compatible

A100 / H100

Datacenter 80GB VRAM
ℹ️

Pro Tip: Compatibility is estimated for 4-bit quantization (Q4). High-precision (FP16) or ultra-long context windows will significantly increase VRAM requirements.

README

6,368 chars • Full Disclosure Protocol Active

ZEN MODE • README

Qwen2-72B-Instruct

Introduction

Qwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model. This repo contains the instruction-tuned 72B Qwen2 model.

Compared with the state-of-the-art opensource language models, including the previous released Qwen1.5, Qwen2 has generally surpassed most opensource models and demonstrated competitiveness against proprietary models across a series of benchmarks targeting for language understanding, language generation, multilingual capability, coding, mathematics, reasoning, etc.

Qwen2-72B-Instruct supports a context length of up to 131,072 tokens, enabling the processing of extensive inputs. Please refer to this section for detailed instructions on how to deploy Qwen2 for handling long texts.

For more details, please refer to our blog, GitHub, and Documentation.

Model Details

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

Training details

We pretrained the models with a large amount of data, and we post-trained the models with both supervised finetuning and direct preference optimization.

Requirements

The code of Qwen2 has been in the latest Hugging face transformers and we advise you to install transformers>=4.37.0, or you might encounter the following error:

KeyError: 'qwen2'

Quickstart

Here provides a code snippet with apply_chat_template to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2-72B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-72B-Instruct")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

Processing Long Texts

To handle extensive inputs exceeding 32,768 tokens, we utilize YARN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.

For deployment, we recommend using vLLM. You can enable the long-context capabilities by following these steps:

  1. Install vLLM: You can install vLLM by running the following command.
pip install "vllm>=0.4.3"

Or you can install vLLM from source.

  1. Configure Model Settings: After downloading the model weights, modify the config.json file by including the below snippet:

        {
            "architectures": [
                "Qwen2ForCausalLM"
            ],
            // ...
            "vocab_size": 152064,
    
            // adding the following snippets
            "rope_scaling": {
                "factor": 4.0,
                "original_max_position_embeddings": 32768,
                "type": "yarn"
            }
        }

    This snippet enable YARN to support longer contexts.

  2. Model Deployment: Utilize vLLM to deploy your model. For instance, you can set up an openAI-like server using the command:

    python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Instruct --model path/to/weights

    Then you can access the Chat API by:

    curl http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
        "model": "Qwen2-72B-Instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Your Long Input Here."}
        ]
        }'

    For further usage instructions of vLLM, please refer to our Github.

Note: Presently, vLLM only supports static YARN, which means the scaling factor remains constant regardless of input length, potentially impacting performance on shorter texts. We advise adding the rope_scaling configuration only when processing long contexts is required.

Evaluation

We briefly compare Qwen2-72B-Instruct with similar-sized instruction-tuned LLMs, including our previous Qwen1.5-72B-Chat. The results are shown as follows:

Datasets Llama-3-70B-Instruct Qwen1.5-72B-Chat Qwen2-72B-Instruct
English
MMLU 82.0 75.6 82.3
MMLU-Pro 56.2 51.7 64.4
GPQA 41.9 39.4 42.4
TheroemQA 42.5 28.8 44.4
MT-Bench 8.95 8.61 9.12
Arena-Hard 41.1 36.1 48.1
IFEval (Prompt Strict-Acc.) 77.3 55.8 77.6
Coding
HumanEval 81.7 71.3 86.0
MBPP 82.3 71.9 80.2
MultiPL-E 63.4 48.1 69.2
EvalPlus 75.2 66.9 79.0
LiveCodeBench 29.3 17.9 35.7
Mathematics
GSM8K 93.0 82.7 91.1
MATH 50.4 42.5 59.7
Chinese
C-Eval 61.6 76.1 83.8
AlignBench 7.42 7.28 8.27

Citation

If you find our work helpful, feel free to give us a cite.

@article{qwen2,
  title={Qwen2 Technical Report},
  year={2024}
}

📝 Limitations & Considerations

  • Benchmark scores may vary based on evaluation methodology and hardware configuration.
  • VRAM requirements are estimates; actual usage depends on quantization and batch size.
  • FNI scores are relative rankings and may change as new models are added.
  • License Unknown: Verify licensing terms before commercial use.
  • Source: Unknown
📜

Cite this model

Academic & Research Attribution

BibTeX
@misc{hf_model__qwen__qwen2_72b_instruct,
  author = {qwen},
  title = {undefined Model},
  year = {2026},
  howpublished = {\url{https://huggingface.co/qwen/qwen2-72b-instruct}},
  note = {Accessed via Free2AITools Knowledge Fortress}
}
APA Style
qwen. (2026). undefined [Model]. Free2AITools. https://huggingface.co/qwen/qwen2-72b-instruct
🔄 Daily sync (03:00 UTC)

AI Summary: Based on Hugging Face metadata. Not a recommendation.

📊 FNI Methodology 📚 Knowledge Baseℹ️ Verify with original source

🛡️ Model Transparency Report

Verified data manifest for traceability and transparency.

100% Data Disclosure Active

🆔 Identity & Source

id
hf-model--qwen--qwen2-72b-instruct
author
qwen
tags
transformerssafetensorsqwen2text-generationchatconversationalenarxiv:2309.00071base_model:qwen/qwen2-72bbase_model:finetune:qwen/qwen2-72blicense:othertext-generation-inferenceendpoints_compatibledeploy:azureregion:us

⚙️ Technical Specs

architecture
Qwen2ForCausalLM
params billions
72.71
context length
4,096
vram gb
57
vram is estimated
true
vram formula
VRAM ≈ (params * 0.75) + 2GB (KV) + 0.5GB (OS)

📊 Engagement & Metrics

likes
718
downloads
22,774

Free2AITools Constitutional Data Pipeline: Curated disclosure mode active. (V15.x Standard)