Deception Probes Activations
| Entity Passport | |
| Registry ID | hf-dataset--xycoord--deception-probes-activations |
| License | Other |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__xycoord__deception_probes_activations,
author = {xycoord},
title = {Deception Probes Activations Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/xycoord/deception-probes-activations}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Free2AITools Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Deception Probes Activations: Semantic (S:50), Authority (A:61), Popularity (P:51), Recency (R:99), Quality (Q:50).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
Deception Probes Activations
Pre-extracted residual-stream activations for training and evaluating deception detection probes on LLMs. Each example contains per-token hidden states from a specific transformer layer, saved in bfloat16 safetensors format.
License
This dataset contains activations derived from multiple sources with different licenses. See the LICENSE file for full details.
| Component | Source | License |
|---|---|---|
| Apollo Probe Pairs (statements) | Azaria & Mitchell (2023) | CC BY-NC-ND 4.0 |
| Controlled Taxonomy | Custom prompts + Azaria & Mitchell facts | CC BY-NC-ND 4.0 |
| Liar's Bench â Convincing Game | Cadenza Labs | CC BY 4.0 |
| Liar's Bench â Instructed Deception | Cadenza Labs | Academic fair use (see LICENSE) |
| Liar's Bench â Insider Trading | Cadenza Labs | CC BY 4.0 |
| Liar's Bench â Alpaca | Cadenza Labs (from Stanford Alpaca) | MIT |
| Liar's Bench â Harm-Pressure Choice | Cadenza Labs | CC BY 4.0 |
| Liar's Bench â Harm-Pressure Knowledge | Cadenza Labs | CC BY 4.0 |
Due to the CC BY-NC-ND 4.0 license on the Azaria & Mitchell data (used in Apollo Probe Pairs and Controlled Taxonomy), this dataset as a whole should be treated as non-commercial use only.
Models & Layers
| Model | HF ID | Layers Available | Hidden Dim | Data |
|---|---|---|---|---|
| Gemma 3 27B IT | google/gemma-3-27b-it |
31 | 5376 | train, eval |
| Llama 3.3 70B Instruct | meta-llama/Llama-3.3-70B-Instruct |
20 | 8192 | train, val, eval |
Datasets
Training Data (`train/`)
Apollo Probe Pairs
Contrastive activations from the Apollo Research methodology. 12 prompt pairs with honest/dishonest system instructions applied to 306 factual statements from the Azaria & Mitchell dataset. Statement tokens only (the factual claim, not the system prompt).
| Model | Layer | Examples | Truthful | Deceptive |
|---|---|---|---|---|
| Gemma 3 27B | 31 | 7,344 | 3,672 | 3,672 |
| Llama 3.3 70B | 20 | 7,344 | 3,672 | 3,672 |
Controlled Taxonomy
Confound-controlled training data with 16 deception types à 2 conditions (honest/dishonest). Facts are distributed round-robin across prompt variants so each fact appears exactly once per deception type, eliminating content as a confound. Statement tokens only, placed in the pre-filled assistant turn.
| Model | Layer | Split | Examples |
|---|---|---|---|
| Llama 3.3 70B | 20 | train | ~9,760 |
| Llama 3.3 70B | 20 | val | ~4,896 |
Evaluation Data (`eval/`)
Activations from Liar's Bench (Cadenza Labs). Each subset uses on-policy completions only (filtered by model). Response tokens only (the model's reply, not the prompt).
| Subset | Description | Llama Examples | Gemma Examples |
|---|---|---|---|
| Convincing Game | Social deception: convince an interrogator | 888 | 621 |
| Instructed Deception | Explicitly told to lie or tell the truth | 5,494 | 5,196 |
| Insider Trading | Strategic deception in a financial scenario | 1,080 | 3,557 |
| Alpaca | Non-deceptive calibration data (all neutral) | 2,000 | 2,000 |
| Harm-Pressure Choice | Deception under pressure | 2,134 | â |
| Harm-Pressure Knowledge | Deception under pressure | 2,139 | â |
Deprecated Data (`deprecated/`)
Collections with a known system prompt bug. Preserved for reproducibility.
See deprecated/README.md for details.
Directory Structure
âââ train/ # Probe training data
â âââ apollo_probe_pairs/
â â âââ gemma-3-27b-it/layer_31/
â â âââ llama-3.3-70b-instruct/layer_20/
â âââ controlled_taxonomy/
â âââ llama-3.3-70b-instruct/layer_20/
â
âââ val/ # Validation data
â âââ controlled_taxonomy/
â âââ llama-3.3-70b-instruct/layer_20/
â
âââ eval/ # Evaluation / test data
â âââ liars_bench_convincing/
â â âââ gemma-3-27b-it/layer_31/
â â âââ llama-3.3-70b-instruct/layer_20/
â âââ liars_bench_instructed/
â â âââ gemma-3-27b-it/layer_31/
â â âââ llama-3.3-70b-instruct/layer_20/
â âââ liars_bench_insider_trading/
â â âââ gemma-3-27b-it/layer_31/
â â âââ llama-3.3-70b-instruct/layer_20/
â âââ liars_bench_alpaca/
â â âââ gemma-3-27b-it/layer_31/
â â âââ llama-3.3-70b-instruct/layer_20/
â âââ liars_bench_harm_pressure_choice/
â â âââ llama-3.3-70b-instruct/layer_20/
â âââ liars_bench_harm_pressure_knowledge/
â âââ llama-3.3-70b-instruct/layer_20/
â
âââ deprecated/ # Buggy collections (preserved)
âââ v0_gemma_l31_liars_bench/
âââ v0_llama_l20_apollo/
âââ v0_llama_l20_liars_bench/
âââ v0_llama_l22_apollo/
âââ v0_llama_l22_liars_bench/
Path Pattern
{split}/{dataset_name}/{model}/{layer_N}/activations/*.safetensors
{split}/{dataset_name}/{model}/{layer_N}/metadata.jsonl
File Format
Safetensors
Each safetensors file contains multiple examples, keyed by example_id.
Each tensor has shape (n_tokens, hidden_dim) in bfloat16.
Metadata (JSONL)
One JSON object per example with fields:
| Field | Description |
|---|---|
dataset |
Dataset name (e.g. "apollo_probe_pairs", "liars_bench_instructed") |
model |
Model short name ("gemma-3-27b-it" or "llama-3.3-70b-instruct") |
layer |
Layer index |
split |
"train", "val", or "test" |
example_id |
Unique ID, also the tensor key in the safetensors file |
label |
"truthful", "deceptive", or "neutral" |
text |
The input text (statement or model response) |
token_info |
{"type": "statement_tokens" or "response_tokens", "n_tokens": int, "hidden_dim": int} |
activation_file |
Relative path to the safetensors file containing this example |
Apollo examples also include pair_key, side, and system_prompt.
Controlled taxonomy examples also include deception_type and condition.
Quick Start
import json
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
repo_id = "xycoord/deception-probes-activations"
# Download metadata for Llama Apollo training data
meta_path = hf_hub_download(
repo_id,
"train/apollo_probe_pairs/llama-3.3-70b-instruct/layer_20/metadata.jsonl",
repo_type="dataset",
)
with open(meta_path) as f:
examples = [json.loads(line) for line in f]
# Download and load activations
act_path = hf_hub_download(
repo_id,
"train/apollo_probe_pairs/llama-3.3-70b-instruct/layer_20/activations/baseline_apollo_0_honest.safetensors",
repo_type="dataset",
)
tensors = load_file(act_path)
# tensors["baseline_apollo_0_honest_0"].shape == (n_tokens, 8192)
đ Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
dataset |
string |
example_id |
string |
label |
string |
pair_key |
string |
side |
string |
system_prompt |
string |
text |
string |
token_info |
unknown |
activation_file |
string |
model |
string |
layer |
int64 |
split |
string |
Estimated Rows: 675,648
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-dataset--xycoord--deception-probes-activations
- slug
- xycoord--deception-probes-activations
- source
- huggingface
- author
- xycoord
- license
- Other
- tags
- task_categories:text-classification, language:en, license:other, size_categories:1m<n<10m, format:json, modality:text, library:datasets, library:dask, library:polars, library:mlcroissant, arxiv:2304.13734, arxiv:2407.15285, region:us, deception, mechanistic-interpretability, activations, probing, safety, alignment
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 28,499
- stars
- 0
- forks
- null
Data indexed from public sources. Updated daily.