pg2grasp
| Entity Passport | |
| Registry ID | hf-model--nagaaato--pg2grasp |
| License | Apache-2.0 |
| Provider | huggingface |
Cite this model
Academic & Research Attribution
@misc{hf_model__nagaaato__pg2grasp,
author = {nagaaato},
title = {pg2grasp Model},
year = {2026},
howpublished = {\url{https://huggingface.co/nagaaato/pg2grasp}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
Quick Commands
huggingface-cli download nagaaato/pg2grasp âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for pg2grasp: Semantic (S:50), Authority (A:0), Popularity (P:10), Recency (R:96), Quality (Q:65).
Verification Authority
đ What's Next?
Technical Deep Dive
PG2-Grasp â Text-Grounded Robot Grasp Prediction
Fine-tuned from google/paligemma2-10b-mix-224 to predict robot grasp centers from natural language prompts and RGB images.
This repo contains bf16 safetensors weights (fine-tuned, not base model).
Usage
BF16 (full precision)
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor
from PIL import Image, ImageOps
import torch
model_id = "nagaaato/pg2grasp"
processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
).to("cuda").eval()
# Prepare image (square-pad to max(w, h))
image = Image.open("scene.png").convert("RGB")
w, h = image.size
max_dim = max(w, h)
pad_left = (max_dim - w) // 2
pad_top = (max_dim - h) // 2
padded = ImageOps.pad(image, (max_dim, max_dim), color=0)
# Prompt format: "Pick the
VRAM: ~20 GB (bf16, batch_size=1)
INT8 Quantization
Requires bitsandbytes and accelerate.
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from PIL import Image, ImageOps
import torch
model_id = "nagaaato/pg2grasp"
processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
device_map="cuda",
).eval()
image = Image.open("scene.png").convert("RGB")
w, h = image.size
max_dim = max(w, h)
pad_left = (max_dim - w) // 2
pad_top = (max_dim - h) // 2
padded = ImageOps.pad(image, (max_dim, max_dim), color=0)
inputs = processor(
images=padded,
text="Pick the apple.",
return_tensors="pt",
).to("cuda")
with torch.no_grad():
outputs = model(**inputs)
loc_logits = outputs.logits[0, -1, 256000:257024]
pred_token = loc_logits.argmax().item()
# Same decoding as BF16
LOC_GRID = 32
row = pred_token // LOC_GRID
col = pred_token % LOC_GRID
cx_px = (col + 0.5) / LOC_GRID * max_dim - pad_left
cy_px = (row + 0.5) / LOC_GRID * max_dim - pad_top
VRAM: ~10-11 GB (INT8 quantized)
INT4 NF4 Quantization
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
import torch
model_id = "nagaaato/pg2grasp"
processor = AutoProcessor.from_pretrained(model_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
model_id,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
),
device_map="cuda",
).eval()
# Same usage as above
VRAM: ~5-6 GB (INT4 NF4 quantized)
Model Details
- Base: PaliGemma2-10B (9.66B params)
- SigLIP-SO400M vision encoder (robot-adapted from piRECAP05)
- Gemma2-9B language model (42 decoder layers)
- Fine-tuning: selective unfreeze on robotics grasp datasets
- Training: FSDP on 8ÃA100-80GB
- Differential LR: base layers frozen, top layers + vision backbone trainable
- Output: single
token (1024-way softmax over 32Ã32 grid) - Input resolution: 224Ã224 (square-padded with black borders)
Eval Results (validation set)
| Metric | Score |
|---|---|
| canonical_median | 45.1 px |
| best_grasp_median | 14.1 px |
| W100 (within 100px) | 80.0% |
| grounding margin | 9.04 |
| grounding top1 | 89.8% |
Notes
- The repository contains bf16 weights only. Quantization happens at load time via BitsAndBytesConfig â weights remain bf16 on disk.
- Prompt format is strict:
"<image>Pick the <object>."with period. - Image padding is required: square-pad with black, compute offset to decode back to original space.
- LOC tokens: indices 256000â257023 (1024 loc tokens for 32Ã32 grid).
â ī¸ Incomplete Data
Some information about this model is not available. Use with Caution - Verify details from the original source before relying on this data.
View Original Source âđ Limitations & Considerations
- âĸ Benchmark scores may vary based on evaluation methodology and hardware configuration.
- âĸ VRAM requirements are estimates; actual usage depends on quantization and batch size.
- âĸ FNI scores are relative rankings and may change as new models are added.
- â License Unknown: Verify licensing terms before commercial use.
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Model Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-model--nagaaato--pg2grasp
- slug
- nagaaato--pg2grasp
- source
- huggingface
- author
- nagaaato
- license
- Apache-2.0
- tags
- safetensors, paligemma, robotics, grasp-prediction, vision-language-model, image-to-text, en, base_model:google/paligemma2-10b-mix-224, base_model:finetune:google/paligemma2-10b-mix-224, license:apache-2.0, region:us
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
- image-to-text
đ Engagement & Metrics
- downloads
- 130
- stars
- 0
- forks
- null
Data indexed from public sources. Updated daily.