Llm Election Data 2024
| Entity Passport | |
| Registry ID | hf-dataset--sarahcen--llm-election-data-2024 |
| License | MIT |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__sarahcen__llm_election_data_2024,
author = {sarahcen},
title = {Llm Election Data 2024 Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/sarahcen/llm-election-data-2024}},
note = {Accessed via Free2AITools Knowledge Fortress}
} đŦTechnical Deep Dive
Full Specifications [+]âž
âī¸ Nexus Index V2.0
đŦ Index Insight
FNI V2.0 for Llm Election Data 2024: Semantic (S:50), Authority (A:0), Popularity (P:55), Recency (R:50), Quality (Q:30).
Verification Authority
đī¸ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
đ Explore Full Dataset âđ§Ŧ Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
Data Release for Large-Scale, Longitudinal Survey of Large Language Models (LLMs) During the 2024 US Elections
Overview
This repository contains the questions asked of and responses given by LLMs during the 2024 US elections, collected for a longitudinal survey conducted from July 23, 2024 to November 12, 2024. The study is described in detail in the paper "Large-Scale, Longitudinal Study of Large Language Models During the 2024 US Election Season" by Sarah H. Cen, Andrew Ilyas, Hedi Driss, Charlotte Park, Aspen K. Hopkins, Chara Podimata, and Aleksander Madry.
As described below and in detail in our paper, 12 LLMs (some of which were equipped with internet access) were queried daily (with the exception of Claude 3 Opus, which was queried weekly) on a fixed set of questions over approximately 4 months. In addition, the same questions were asked of Google search as a baseline. The questions varied by type and topic, and each question was asked multiple times with different prompt variations (e.g., different steering and different instructions). This repository contains the questions, responses, and other relevant documentation as well as sample code.
Usage and Quick Start
The file sample_code.py contains an example of how to load and filter the data.
Summary of Data Collection Procedure
Details about the data collection procedure can be found in the full paper. A brief description is provided below.
Models
We queried 12 models:
- We used model APIs to directly query 9 LLMs (all were "offline" in that they did not have internet access, with the exception of Perplexity, which is "online").
- For the remaining 3 models/systems, we simulated the querying of models equipped with internet access by combining model APIs with Google search (using Serper) via Langchain.
- We directly queried Google search using SerpAPI as a baseline.
The models are given in reference_jsons/models_and_questions.json.
All models were queried daily, with the exception of Claude 3 Opus, which was queried weekly.
Note that Perplexity switched from using llama-3.1-sonar-large-32k-online before Sept. 3 to using llama-3.1-sonar-large-128k-online after.
Question Taxonomy and Generation
All questions were hand crafted according to the taxonomy and procedure given below.
The final questions can be found in reference_jsons/all_questions_flattened.json.
Taxonomy. All questions are first divided by type (exo, endo, or baseline), then category, and then subcategory. See reference_jsons/election_questions_taxonomy.json for the full question taxonomy.
Question generation procedure. The final questions were generated in 3 steps:
- Each question is associated with a base question template, such as "What is {candidate}âs position on {issue} as a political issue in the 2024 US presidential election?", where
candidateandissueare placeholders. - The placeholders are substituted with all combinations of placeholder values, as specified in
reference_jsons/election_questions_taxonomy.json. This creates what we refer to as the pre-prompt questions. - Each pre-prompt question results in 22 final questions, which are generated by modifying each pre-prompt question using the prompt variations given in
reference_jsons/prompt_variations.json. (Note that the default prompt variation"{}"is referred to asnonein each CSV columnprompt_type.)
In summary, the question generation process is [question template] -> replace placeholders -> [pre-prompt question] -> add prompt variation -> [final question]. The full set of unique questions is in reference_jsons/all_questions_flattened.json.
Baseline questions.
We added 32 baseline questions that are not related to the election. The baseline questions are taken from well known benchmark datasets and are given in reference_jsons/stable_baselines_taxonomy.json.
Dataset
The collected data is given in CSVs in the raw_data folder, in which there are nested folders with the following structure:
raw_data/
âââ [model]/
âââ [question type]/
âââ [question category]/
âââ [pre-prompt question hash]/
âââ [prompt hash].csv
Explanation:
Each CSV file contains the responses of a specific LLM (as given by the outermost folder model) to a specific question. Each question is uniquely specified by the pre-prompt question hash and the prompt hash, which correspond to [pre-prompt question hash] folder and [prompt hash] filename, respectively. The nested folders [question type] and [question category] subdivide the questions/CSVs for navigability.
[model]- Name of model queried. Full list of models given inreference_jsons/models_and_questions.json. Specific endpoints/snapshot are described in the full paper.[question type]- Type of question (endo, exo, or baseline). Seereference_jsons/models_and_questions.json.[question category]- Category of question (e.g., election issues). Seereference_jsons/models_and_questions.json.[pre-prompt question hash]- Unique hash identifier for each unique pre-prompt question (i.e., the question before prompt variation). The hash generation code is given below.[prompt hash]- Unique hash identifier for each prompt variation/type. Unique prompt variations are inreference_jsons/prompt_variations.jsonand the hash generation code is given below.
Hash Generation
The following code shows how the pre-prompt question hashes, prompt type hashes, and base question template hashes were generated (which can be found in reference_jsons/pre_prompt_q_hash_mapping.json, reference_jsons/prompt_type_hash_mapping.json, and reference_jsons/base_q_template_hash_mapping.json). It also shows how to obtain the dictionaries mapping hash to original string and vice versa.
import hashlib
import random
random.seed(42)
def hash_string(input_string):
"""Hashes a string using SHA-256 and returns the hexadecimal digest."""
hash_object = hashlib.sha256(input_string.encode())
return hash_object.hexdigest()
def create_hash_mapping(df, column_name):
"""Creates a dictionary mapping hashes to original strings from a DataFrame column."""
hash_dict = {hash_string(value): value for value in df[column_name].unique()}
return hash_dict
def load_rev_mapping(mapping):
"""Loads a reverse mapping from a mapping dictionary."""
return {v: k for k, v in mapping.items()}
# How hashes for base_q_template, pre_prompt_q, and prompt_type were generated
df.loc[:, "base_q_template_hash"] = df["base_q_template"].apply(hash_string)
df.loc[:, "pre_prompt_q_hash"] = df["pre_prompt_q"].apply(hash_string)
df.loc[:, "prompt_type_hash"] = df["prompt_type"].apply(hash_string)
# How to get dictionaries mapping base_q_template, pre_prompt_q, and prompt_type from hash
# to original string for all unique strings in the DataFrame
base_q_template_hash_dict = create_hash_mapping(df, "base_q_template")
pre_prompt_q_hash_dict = create_hash_mapping(df, "pre_prompt_q")
prompt_type_hash_dict = create_hash_mapping(df, "prompt_type")
# How to get the inverse mapping (from original string to hash) from the hash mapping
inv_base_q_template_hash = load_rev_mapping(base_q_template_hash_dict)
inv_pre_prompt_q_hash = load_rev_mapping(pre_prompt_q_hash_dict)
inv_prompt_type_hash = load_rev_mapping(prompt_type_hash_dict)
Data Format
Each CSV file in raw_data contains the following columns:
| Column Name | Description |
|---------------------------|-------------|
| model | Name of the model queried |
| date | Date of query |
| type | Question type (endo, exo, or baseline) |
| category | Question category (e.g., election issues) |
| subcategory | Question subcategory (often omitted) |
| prompt_type | Type of prompt variation applied to the pre_prompt_q |
| frequency | How often this question was asked (daily or weekly) |
| base_q_template | Base template from which the question was derived |
| placeholders | Dictionary of any placeholders in the base template and the placeholders' values |
| pre_prompt_q | Question after placeholders have been applied to base_q_template (before prompt variation) |
| question | Full query/question text (after placeholders replaced in template and prompt variation applied) |
| response | Model's response |
| timestamp | Timestamp when model was queried |
| answer | For baseline questions only, the correct answer |
| base_q_template_hash | Hash of the base question template base_q_template |
| pre_prompt_q_hash | Hash of the pre-prompted question pre_prompt_q |
| prompt_type_hash | Hash corresponding to the prompt variation prompt_type |
| query_type | Type of question phrasing (this is not utilized in our analyses) |
| variation_type | Whether template has any placeholders (this is not utilized in our analyses) |
Citation
If you use this dataset in your research, please cite:
@inproceedings{largescale2025cen,
author = {Cen, Sarah H. and Ilyas, Andrew and Driss, Hedi and Park, Charlotte and Hopkins, Aspen and Podimata, Chara and Madry, Aleksander},
title = {Large-Scale, Longitudinal Study of Large Language Models During the 2024 {US} Election Season},
year = {2025}
}
Contact
For questions or issues, please reach out to Sarah Cen at [email protected]
Last Updated: September 16, 2025
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
đĄī¸ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
đ Identity & Source
- id
- hf-dataset--sarahcen--llm-election-data-2024
- slug
- sarahcen--llm-election-data-2024
- source
- huggingface
- author
- sarahcen
- license
- MIT
- tags
- task_categories:text-generation, task_categories:question-answering, language:en, license:mit, arxiv:2509.18446, region:us, elections, politics, llm-evaluation, longitudinal-study, us-election-2024
âī¸ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
đ Engagement & Metrics
- downloads
- 63,542
- stars
- 3
- forks
- 0
Data indexed from public sources. Updated daily.