🪐 Introducing RealPBT: A large-scale dataset of property-based tests

Community Article Published January 28, 2026

realpbt

We're releasing RealPBT, a comprehensive dataset of over 13 million test records extracted from real-world Python and TypeScript codebases. This dataset provides researchers and practitioners with unprecedented insight into how developers write and structure property-based tests. At a hgiher level it offers insight into how real-world software engineers specify software.

What are property based tests?

Property-based tests (PBTs) differ fundamentally from traditional unit tests. While a unit test checks specific examples:

def test_reverse_list():
    assert reverse([1, 2, 3]) == [3, 2, 1]

A property-based test verifies general properties across many randomly-generated inputs:

@given(lists(integers()))
def test_reverse_twice_is_identity(xs):
    assert reverse(reverse(xs)) == xs

This single PBT automatically checks the property with hundreds of different lists, catching edge cases developers might never think to write by hand.

Why do we care?

From an academic perspective, property-based tests are considered the entry point to formal methods, a collection of mathematical techniques for software assurance, the most powerful of which produce concrete proofs of software correctness. Since the ~80s, formal methods have offered a clear vision for how real-world software can be assured against bugs or exploitation, yet have remained unpopular in practice. We therefore envision this dataset as providing anthropological clarity for researchers hoping to see how real-world software engineers outside academia utilize formal methods when they do. We hope such work can then "bubble up" to HCI work for more powerful tools like theorem provers, which need to be made more human-friendly.

Another reason to care about this data is that it offers a corpus of real-world software properties which humans found interesting enough to write down but couldn't prove. In a follow-on dataset, which we will publish soon, we will formalize these properties in Lean as an interactive theorem-proving benchmark.

Another area of active research concerns getting LLMs to write good PBTs. Hopefully this dataset can aid research in that direction as well, e.g., by enabling multi-shot inference.

What We Scraped

The RealPBT dataset contains:

  • 54,345 Python PBTs with full code quality metrics (complexity, maintainability, Halstead measures)
  • 6,283 TypeScript PBTs

For the Python PBTs, we additionally extracted:

  • 6.3M Python functions: these are the functions under test
  • 6.8M Python unit tests: these are unit tests in the repos which test the same functions that the PBTs test
  • Dependency graphs mapping which functions each test calls

Example: Dependency Analysis

Consider this test from CPython's test suite:

def test_dawg(self):
    # Property-based test for DAWG data structure
    ...

Our dataset includes not just the test code, but its complete dependency graph showing all 26 functions it calls, from decode_varint_unsigned to build_compression_dawg. This reveals the test's complexity and helps researchers understand testing patterns in large projects like CPython. It also provides sufficient detail to, e.g., resolve new Lean theorem proving challenges, as alluded to above (more to come!).

pbt_deps_final

Using the Dataset

from datasets import load_dataset

# Load individual tables
pbts = load_dataset("Benchify/realpbt", data_files="pbts.jsonl", split="train")
ts_pbts = load_dataset("Benchify/realpbt", data_files="pbts_typescript.jsonl", split="train")
unit_tests = load_dataset("Benchify/realpbt", data_files="unit_tests.jsonl", split="train")
functions = load_dataset("Benchify/realpbt", data_files="functions.jsonl", split="train")

Tables connect through shared identifiers. Python PBTs include dependency information showing which functions they call:

# Example: Find functions called by a specific PBT
pbt = pbts[0]
print(f"PBT: {pbt['name']}")
print(f"Repository: {pbt['repo']['name']}")
print(f"Dependencies: {pbt['dependencies']}")

# The dependencies list contains function names that can be matched
# against the functions table to build complete call graphs

If you prefer working with SQL, you can reconstruct the original database:

import sqlite3
import json

conn = sqlite3.connect('pbts.db')
cursor = conn.cursor()

# Create tables
cursor.execute('''
    CREATE TABLE pbts (
        id INTEGER PRIMARY KEY,
        name TEXT,
        code TEXT,
        source_file TEXT,
        repository_name TEXT,
        language TEXT,
        dependencies TEXT,
        metrics TEXT
    )
''')

# Load and insert PBTs
with open('pbts.jsonl', 'r') as f:
    for line in f:
        pbt = json.loads(line)
        cursor.execute(
            'INSERT INTO pbts VALUES (?, ?, ?, ?, ?, ?, ?, ?)',
            (pbt['id'], pbt['name'], pbt['code'], pbt['source_file'],
             pbt['repo']['name'], pbt['language'],
             json.dumps(pbt.get('dependencies', [])),
             json.dumps(pbt.get('metrics', {})))
        )

conn.commit()

Acknowledgments

This work was led by myself (Max von Hippel). Benchify, Inc. provied the funding and infrastructure. Evan Boehs and Jake Ginesin built the scraper, and Juan Castaño helped with the database and AWS. The Dartmouth DALI Lab worked on extending the project to include TypeScript. Sekpey Herbert Setor Kwame at DALI worked on TypeScript PBT summarization.

Community

Sign up or log in to comment