Local Inference

Local Inference

Run LLMs on your own hardware for privacy, cost savings, and offline access.

ToolPlatformBest For
OllamaMac/Linux/WinEasy setup
llama.cppAllPerformance
LM StudioMac/WinGUI
vLLMLinuxServing
text-gen-webuiAllFeatures

Quick Start: Ollama

# Install
curl -fsSL https://ollama.com/install.sh | sh

# Run a model
ollama run llama3.2

# List models
ollama list

Hardware Requirements

Minimum (7B models)

  • 8 GB RAM
  • 4-core CPU
  • (Optional) 8+ GB VRAM GPU
  • 16 GB RAM
  • 8-core CPU
  • 12+ GB VRAM GPU

Power User (70B+ models)

  • 32 GB RAM
  • High-end GPU (24+ GB) or
  • Apple Silicon (32+ GB unified)

Performance Tips

  1. Use quantized models (Q4_K_M balance)
  2. GPU offloading when possible
  3. Limit context length if RAM constrained
  4. Use MoE models for efficiency