Intermediate ⏱️ 4 min

🎓 VRAM Requirements

VRAM Requirements

Understanding VRAM needs is essential for running LLMs locally.

Estimation Formula

description: How much GPU memory you need to run AI models at different sizes and precisions keywords: vram, gpu memory, llm requirements, llama 3 vram, quantization memory

VRAM Requirements Guide

VRAM (Video RAM) is the dedicated memory on your graphics card (GPU). For Large Language Models, VRAM is the primary bottleneck. If a model doesn’t fit in your VRAM, it will either run extremely slowly (using system RAM) or fail to load entirely.

Why VRAM Matters

Unlike standard software, LLMs need to keep their entire set of “weights” in memory to generate text quickly. A model’s size is determined by its parameter count (e.g., 7B, 70B) and its precision (e.g., 16-bit, 4-bit).

VRAM Calculation formula

A rough rule of thumb for VRAM needed is: VRAM (GB) ≈ (Parameters in Billions * Bits per Weight) / 8 * 1.2 (The 1.2 factor accounts for “overhead” like context window and KV cache)

Requirements by Model Size

Model Size4-bit (Standard)8-bit (High Qual)16-bit (Pro)
1B - 3B1.5 - 2 GB3 - 4 GB6 - 8 GB
7B - 8B5 - 6 GB8 - 10 GB14 - 16 GB
11B - 14B8 - 10 GB14 - 16 GB22 - 28 GB
30B - 34B18 - 20 GB32 - 35 GB60 - 70 GB
70B - 72B40 - 45 GB70 - 75 GB130 - 140 GB

Consumer Level (Mid-Range)

  • RTX 3060 (12GB): Best budget choice for 7B/8B models at high precision.
  • RTX 4060 Ti (16GB): Good entry point for 14B models.

Enthusiast Level (High-End)

  • RTX 3090 / 4090 (24GB): The “Gold Standard” for local LLMs. Runs 30B models comfortably or 70B models at 2.5-bit.
  • Dual RTX 3090 (48GB total): Best value for running 70B models (Llama 3) at high quality.

Professional / Mac

  • Mac Studio (64GB - 192GB Unified Memory): Best for massive models (70B+) as the M-series chips share memory between CPU and GPU.

Context Length & KV Cache

Loading the model is only part of the story. As you chat, the “KV Cache” grows.

  • 8K Context: Adds ~0.5 - 1GB VRAM.
  • 32K Context: Adds ~2 - 4GB VRAM.
  • 128K Context: Can add 10GB+ VRAM depending on the architecture.

How to Reduce VRAM Usage

  1. Use Quantization: Switching from 16-bit to 4-bit reduces VRAM needs by 75%.
  2. KV Cache Quantization: Some tools (like vLLM or llama.cpp) can compress the cache to 4-bit/8-bit.
  3. Context Scaling: Limit the maximum context length in your settings.

Quick Reference

Model SizeFP16INT84-bit
7B14 GB7 GB4 GB
13B26 GB13 GB7 GB
34B68 GB34 GB18 GB
70B140 GB70 GB35 GB

Context Length Impact

KV cache grows with context:

ContextAdditional VRAM (7B)
2K+0.5 GB
8K+2 GB
32K+8 GB
128K+32 GB

Consumer GPU VRAM

GPUVRAMMax Model (4-bit)
RTX 306012 GB~20B
RTX 407012 GB~20B
RTX 409024 GB~45B
Apple M2 Pro16 GB~25B
Apple M3 Max64 GB~100B

🕸️ Knowledge Mesh