Multipl E
| Entity Passport | |
| Registry ID | hf-dataset--nuprl--multipl-e |
| License | ["mit"] |
| Provider | huggingface |
Cite this dataset
Academic & Research Attribution
@misc{hf_dataset__nuprl__multipl_e,
author = {nuprl},
title = {Multipl E Dataset},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/nuprl/multipl-e}},
note = {Accessed via Free2AITools Knowledge Fortress}
} π¬Technical Deep Dive
Full Specifications [+]βΎ
βοΈ Nexus Index V2.0
π¬ Index Insight
FNI V2.0 for Multipl E: Semantic (S:50), Authority (A:0), Popularity (P:57), Recency (R:24), Quality (Q:30).
Verification Authority
ποΈ Data Preview
Row-level preview not available for this dataset.
Schema structure is shown in the Field Logic panel when available.
π Explore Full Dataset β𧬠Field Logic
Schema not yet indexed for this dataset.
Dataset Specification
Dataset Card for MultiPL-E
Dataset Description
- Repository: https://github.com/nuprl/MultiPL-E
- Paper: https://ieeexplore.ieee.org/abstract/document/10103177
- Point of Contact: [email protected], [email protected], [email protected]
Dataset Summary
MultiPL-E is a dataset for evaluating large language models for code generation that supports 22 programming languages. It takes the OpenAI HumanEval and the Mostly Basic Python Programs (MBPP) benchmarks and uses little compilers to translate them to other languages. It is easy to add support for new languages and benchmarks.
The dataset is divided into several configurations named SRCDATA-LANG, where SRCDATA is either "humaneval" or "mbpp" and LANG is one of the supported languages. We use the canonical file extension for each language to identify the language, e.g., "cpp" for C++, "lua" for Lua, "clj" for Clojure, and so on.
Using MultiPL-E
MultiPL-E is part of the BigCode Code Generation LM Harness. This is the easiest way to use MultiPL-E.
MultiPL-E has its own evaluation framework that supports proprietary models, the prompt ablations, more source benchmarks, and more recently added programming languages. See the MultiPL-E tutorial on how to use this framework directly.
The MultiPL-E Ablations
The MultiPL-E paper presented several ablations of the prompt for the original
set of programming languages. We do not include them in the current version of
MultiPL-E, but they are still available in this repository from revision
d23b094 or earlier. (You can optionally pass the revision to
datasets.load_dataset.)
These are the prompt variations:
SRCDATA-LANG-keep is the same as SRCDATA-LANG, but the text of the prompt is totally unchanged. If the original prompt had Python doctests, they remain as Python instead of being translated to LANG. If the original prompt had Python-specific terminology, e.g., "list", it remains "list", instead of being translated, e.g., to "vector" for C++.
SRCDATA-LANG-transform transforms the doctests to LANG but leaves the natural language text of the prompt unchanged.
SRCDATA-LANG-removed removes the doctests from the prompt.
Note that MBPP does not have any doctests, so the "removed" and "transform" variations are not available for MBPP.
Changelog
Version 3.3
This update fixes a Lua bug. We had a spurious stop token that would have negatively impacts all Lua results. Re-evaluting models on Lua with this fix should produce a result that is identical or slightly higher. See Issue 165 for more information.
Version 3.2
MultiPL-E now supports Ada, thanks to Rowan Walshe. Rowan identified some issues that likely have a small negative impact on the benchmark scores for existing languages. We have not updated the prompts for those languages at this time. See the discussions PR 162 and PR 163.
Version 3.1.1
This version fixes a bug that affected some TypeScript problems, thanks to Niels MΓΌndler . The issue impacts MBPP-based problems. The fix changes whitespace in a few HumanEval-based problems that should be insignificant. These are the relevant changes:
=== mbpp-ts_prompt_mbpp_253_count_integer.diff ===
- function count_integer(list1: number| string| number[]): number {
+ function count_integer(list1: (number | string | number)[]): number {
=== mbpp-ts_prompt_mbpp_278_count_first_elements.diff ===
- function count_first_elements(test_tup: number| [number, number][]): number {
+ function count_first_elements(test_tup: (number | [number, number])[]): number {
=== mbpp-ts_prompt_mbpp_294_max_val.diff ===
- function max_val(listval: string| number[]): number {
+ function max_val(listval: (string | number)[]): number {
=== mbpp-ts_prompt_mbpp_297_flatten_list.diff ===
- function flatten_list(list1: number| number[][]): number[] {
+ function flatten_list(list1: (number | number[])[]): number[] {
=== mbpp-ts_prompt_mbpp_405_check_tuplex.diff ===
- function check_tuplex(tuplex: string| number[], tuple1: any): boolean {
+ function check_tuplex(tuplex: (string | number)[], tuple1: any): boolean {
=== mbpp-ts_prompt_mbpp_410_min_val.diff ===
- function min_val(listval: string| number[]): number {
+ function min_val(listval: (string | number)[]): number {
=== mbpp-ts_prompt_mbpp_419_round_and_sum.diff ===
- function round_and_sum(list1: number| number[]): number {
+ function round_and_sum(list1: (number | number)[]): number {
=== mbpp-ts_prompt_mbpp_65_recursive_list_sum.diff ===
- function recursive_list_sum(data_list: number| number[][]): number {
+ function recursive_list_sum(data_list: (number | number[])[]): number {
=== mbpp-ts_prompt_mbpp_755_second_smallest.diff ===
- function second_smallest(numbers: number| number[]): number | undefined {
+ function second_smallest(numbers: (number | number)[]): number | undefined {
See Github Issue 160 for more information.
Version 3.1
MultiPL-E now supports Dart, thanks to Devon Carew.
Version 3.0
This is the first significant update since MultiPL-E was used in StarCoder 1.
- The dataset was versioned at 3.0, and we are bumping the software version to stay in sync.
- We no longer publish the MultiPL-E ablations, but they are available in
revision
d23b094and earlier. - New programming languages supported:
- Clojure, thanks to Alex Miller
- Elixir, thanks to Marko Vukovic
- Haskell, thanks to Thomas Dwyer
- OCaml, thanks to John Gouwar
- Changes to existing HumanEval-based problems:
- Four Scala problems have fixed prompts/tests (12, 90, 128, 162).
- Some whitespace-only changes to problems for Racket (18 problems), R (36 problems), Julia (159 problems), and D (156 problems). We will try to avoid these kinds of changes in the future.
- The MBPP-based problems have changes analogous to the HumanEval-based problems.
See the directory diffs_v3.0 in the dataset repository for the diffs to
each prompt.
Version 0.5.0
Instruction-following support and new languages
- New languages: Luau, Elixir, Lean, Coq, Dafny
- Support for instruction-following prompts
- vLLM support for faster evaluation
Version 0.4.0
QoL improvements and new languages
- New languages: OCaml, MATLAB
- Using
.jsonlinstead of.jsonfor prompts - Several bugfixes to prompts
Version 0.3.0
This version was used to evaluate StarCoder
This version corrects several bugs in prompts and test cases that resulted in lower pass@k rates for some of the statically typed languages. The most significant difference is that the pass@k for Java increases by about 2% on HumanEval.
Version 0.2.0
This version was used to evaluate SantaCoder
π Structured Schema (Zero-Fabrication)
| Feature Key | Data Type |
|---|---|
name |
string |
language |
string |
prompt |
string |
doctests |
string |
original |
string |
prompt_terminology |
string |
tests |
string |
stop_tokens |
unknown |
Estimated Rows: 157
Social Proof
AI Summary: Based on Hugging Face metadata. Not a recommendation.
π‘οΈ Dataset Transparency Report
Technical metadata sourced from upstream repositories.
π Identity & Source
- id
- hf-dataset--nuprl--multipl-e
- slug
- nuprl--multipl-e
- source
- huggingface
- author
- nuprl
- license
- ["mit"]
- tags
- annotations_creators:machine-generated, language_creators:machine-generated, language_creators:expert-generated, multilinguality:monolingual, source_datasets:original, source_datasets:extended|openai_humaneval, source_datasets:extended|mbpp, language:en, license:mit, size_categories:10k<n<100k, format:parquet, modality:text, library:datasets, library:pandas, library:mlcroissant, library:polars, arxiv:2301.03988, arxiv:2305.06161, doi:10.57967/hf/4446, region:us
βοΈ Technical Specs
- architecture
- null
- params billions
- null
- context length
- null
- pipeline tag
π Engagement & Metrics
- downloads
- 73,765
- stars
- 65
- forks
- 0
Data indexed from public sources. Updated daily.