Buckets:

AvaSiG
/

codeparrot-bucket

71.1 GB

3,182 files

Updated about 1 month ago

Ctrl+K

Name	Size	Uploaded	Xet hash
data		about 1 month ago	97 items
.gitattributes	2.72 kB xet	about 1 month ago	47863cb1
LICENSE	4.06 kB xet	about 1 month ago	bade2cbe
README.md	11.4 kB xet	about 1 month ago	a53f5e34
compbiobench.v1.tsv	57.4 kB xet	about 1 month ago	a9eb6605
config.json	812 Bytes xet	about 1 month ago	a8d73aec
corpus.parquet	7.68 GB xet	about 1 month ago	a175b34c
file-000000000000.json.gz	255 MB xet	about 2 months ago	d46639a7
file-000000000001.json.gz	253 MB xet	about 2 months ago	a6081d67
file-000000000002.json.gz	254 MB xet	about 2 months ago	6667ec7b
file-000000000003.json.gz	246 MB xet	about 2 months ago	bb0bfb6b
file-000000000004.json.gz	252 MB xet	about 2 months ago	3dd34630
file-000000000005.json.gz	255 MB xet	about 2 months ago	7661227a
file-000000000006.json.gz	253 MB xet	about 2 months ago	a526f120
file-000000000007.json.gz	252 MB xet	about 2 months ago	79ddb631
file-000000000008.json.gz	253 MB xet	about 2 months ago	4a000ff7
file-000000000009.json.gz	252 MB xet	about 2 months ago	21ec9b47
file-000000000010.json.gz	250 MB xet	about 2 months ago	3ea3a30a
file-000000000011.json.gz	253 MB xet	about 2 months ago	12d84da1
file-000000000012.json.gz	254 MB xet	about 2 months ago	8abc665c
file-000000000013.json.gz	253 MB xet	about 2 months ago	fc284002
file-000000000014.json.gz	254 MB xet	about 2 months ago	05ae9a06
file-000000000015.json.gz	255 MB xet	about 2 months ago	1dfee942
file-000000000016.json.gz	250 MB xet	about 2 months ago	1ebed8d8
file-000000000017.json.gz	254 MB xet	about 2 months ago	40eedda4
file-000000000018.json.gz	251 MB xet	about 2 months ago	4c63abde
file-000000000019.json.gz	250 MB xet	about 2 months ago	240a77d5
file-000000000020.json.gz	250 MB xet	about 2 months ago	2cef1381
file-000000000021.json.gz	252 MB xet	about 2 months ago	f63b1f3d
file-000000000022.json.gz	254 MB xet	about 2 months ago	2e628d92
file-000000000023.json.gz	250 MB xet	about 2 months ago	066b4f41
file-000000000024.json.gz	253 MB xet	about 2 months ago	5916c1e9
file-000000000025.json.gz	251 MB xet	about 2 months ago	95d8856f
file-000000000026.json.gz	252 MB xet	about 2 months ago	54d6e994
file-000000000027.json.gz	253 MB xet	about 2 months ago	59528c0f
file-000000000028.json.gz	249 MB xet	about 2 months ago	76d69130
file-000000000029.json.gz	251 MB xet	about 2 months ago	c0263ef1
file-000000000030.json.gz	251 MB xet	about 2 months ago	e2043a67
file-000000000031.json.gz	252 MB xet	about 2 months ago	dc196ddc
file-000000000032.json.gz	254 MB xet	about 2 months ago	6765782d
file-000000000033.json.gz	251 MB xet	about 2 months ago	b5e492ab
file-000000000034.json.gz	249 MB xet	about 2 months ago	606e4b3c
file-000000000035.json.gz	252 MB xet	about 2 months ago	33ce5c52
file-000000000036.json.gz	252 MB xet	about 2 months ago	c23d2583
file-000000000037.json.gz	252 MB xet	about 2 months ago	66fec262
file-000000000038.json.gz	250 MB xet	about 2 months ago	5e2256f1
file-000000000039.json.gz	253 MB xet	about 2 months ago	27b419bd
file-000000000040.json.gz	255 MB xet	about 2 months ago	ddc1e437
file-000000000041.json.gz	252 MB xet	about 2 months ago	68be35a7
file-000000000042.json.gz	256 MB xet	about 2 months ago	92f29574
file-000000000043.json.gz	254 MB xet	about 2 months ago	3eb92441
file-000000000044.json.gz	251 MB xet	about 2 months ago	0db2c8e7
file-000000000045.json.gz	249 MB xet	about 2 months ago	5ef4c4a2
file-000000000046.json.gz	252 MB xet	about 2 months ago	aad4b56e
file-000000000047.json.gz	251 MB xet	about 2 months ago	7724635f
file-000000000048.json.gz	250 MB xet	about 2 months ago	417741a3
file-000000000049.json.gz	251 MB xet	about 2 months ago	53b4f6a8
file-000000000050.json.gz	250 MB xet	about 2 months ago	134f5701
file-000000000051.json.gz	254 MB xet	about 2 months ago	4b8e3cca
file-000000000052.json.gz	253 MB xet	about 2 months ago	0f4d9923
file-000000000053.json.gz	255 MB xet	about 2 months ago	507cb000
file-000000000054.json.gz	253 MB xet	about 2 months ago	c1e5ef38
file-000000000055.json.gz	252 MB xet	about 2 months ago	f0a32674
file-000000000056.json.gz	251 MB xet	about 2 months ago	4d7deb07
file-000000000057.json.gz	254 MB xet	about 2 months ago	b765af17
file-000000000058.json.gz	250 MB xet	about 2 months ago	d334560f
file-000000000059.json.gz	250 MB xet	about 2 months ago	edf5c6fc
file-000000000060.json.gz	252 MB xet	about 2 months ago	c969f8c8
file-000000000061.json.gz	251 MB xet	about 2 months ago	e938cbb7
file-000000000062.json.gz	252 MB xet	about 2 months ago	63fdcc03
file-000000000063.json.gz	252 MB xet	about 2 months ago	fb1459e5
file-000000000064.json.gz	251 MB xet	about 2 months ago	a62601eb
file-000000000065.json.gz	253 MB xet	about 2 months ago	bb0ec036
file-000000000066.json.gz	252 MB xet	about 2 months ago	b29e3ea1
file-000000000067.json.gz	249 MB xet	about 2 months ago	3526cee4
file-000000000068.json.gz	249 MB xet	about 2 months ago	eeba9b73
file-000000000069.json.gz	252 MB xet	about 2 months ago	6d5d06d7
file-000000000070.json.gz	256 MB xet	about 2 months ago	b39f1b5e
file-000000000071.json.gz	253 MB xet	about 2 months ago	af7c3cf5
file-000000000072.json.gz	251 MB xet	about 2 months ago	8bcdfa6c
file-000000000073.json.gz	252 MB xet	about 2 months ago	052511f8
file-000000000074.json.gz	253 MB xet	about 2 months ago	a4d1362b
file-000000000075.json.gz	255 MB xet	about 2 months ago	b5a0237a
file-000000000076.json.gz	255 MB xet	about 2 months ago	d7eaf72b
file-000000000077.json.gz	251 MB xet	about 2 months ago	c2b33f26
file-000000000078.json.gz	254 MB xet	about 2 months ago	a274cb49
file-000000000079.json.gz	253 MB xet	about 2 months ago	0a9ec674
file-000000000080.json.gz	252 MB xet	about 2 months ago	b6eeff7c
file-000000000081.json.gz	253 MB xet	about 2 months ago	22ab0f50
file-000000000082.json.gz	251 MB xet	about 2 months ago	ac33bd29
file-000000000083.json.gz	254 MB xet	about 2 months ago	77fff7ee
file-000000000084.json.gz	252 MB xet	about 2 months ago	d90bde94
file-000000000085.json.gz	253 MB xet	about 2 months ago	18ced15d
file-000000000086.json.gz	253 MB xet	about 2 months ago	0e3169c3
file-000000000087.json.gz	256 MB xet	about 2 months ago	baf8dcba
file-000000000088.json.gz	250 MB xet	about 2 months ago	5ccaa1a3
file-000000000089.json.gz	253 MB xet	about 2 months ago	22b8c55f
file-000000000090.json.gz	252 MB xet	about 2 months ago	8deb8ec8
file-000000000091.json.gz	250 MB xet	about 2 months ago	3a8fb01c
file-000000000092.json.gz	253 MB xet	about 2 months ago	5f32e57b

README.md

Privasis-Zero

Dataset Description:

Privasis-Zero is a large-scale synthetic dataset consisting of diverse text records—such as medical and financial records, legal documents, emails, and messages—containing rich, privacy-sensitive information. Each record includes synthetic profile details, surrounding social context, and annotations of privacy-related content. All data are fully generated using LLMs, supplemented with first names sourced from the U.S. Social Security Administration’s public database.

The dataset is designed to support the training and evaluation of models or agents that operate on privacy-sensitive data. For example, it includes annotated text-sanitization instructions along with their corresponding sanitized outputs. The current release focuses on English-language content.

This dataset is for non-commercial/research and development purposes only.

Dataset Owner(s):

NVIDIA Corporation

Dataset Creation Date:

December 3rd, 2025

License/Terms of Use:

NVIDIA License

Additional Details

This dataset contains synthetic data generated using multiple large language models.
Each model contributes to one or more dataset subsets: General Corpus, Train Set, and Test Set.

The table below summarizes the inclusion of each model’s generations:

Model	General Corpus	Test Set
Gemini-2.5-pro	✔️	✔️
GPT-5	✔️	✔️
Llama 4 Maverick	✔️	✔️
Qwen3 235B Instruct	✔️	✔️
GPT-OSS-120B	✔️	❌
Qwen3 Next 80B Instruct	✔️	❌
GPT-4.1	✔️	❌
GPT-4.1-mini	✔️	❌

General Corpus includes all models and represents the broadest portion of the dataset.

Train Set contains generations only from:

GPT-OSS-120B
Qwen3 Next 80B Instruct

Test Set contains generations only from:

Gemini-2.5-pro
GPT-5
Llama 4 Maverick
Qwen3 235B Instruct

Corpus Columns

Column	Type	Description
`id`	`str`	SHA-256 hash identifier for the record.
`record_tags`	`list[str]`	Category tags for the record.
`record`	`str`	The generated text containing PII and sensitive attributes.
`profile`	`str`	JSON string of the synthetic person profile.
`background_context`	`str`	Narrative context for the record.
`record_type`	`str`	Description of the document type.
`record_format`	`str`	Style/tone specification.
`attributes`	`str`	JSON string of annotated attributes.
`grouped_attributes`	`str`	JSON string of grouped attribute clusters.
`generator_model`	`str`	Model used to generate the record.

Eval Columns

All four eval JSONL files share the same 17-column schema.

Record Metadata

Column	Type	Description
`id`	`str`	SHA-256 hash identifier for the record.
`profile`	`dict`	Synthetic person profile containing demographic info (`first_name`, `last_name`, `sex`, `age`, `citizenship`, etc.) and an `event_list` describing the scenario.
`record_type`	`str`	Description of the document type (e.g., "SMS reminder from MyMedClinic.ro", "Handwritten note inside daily planner").
`background_context`	`str`	Narrative context explaining the circumstances under which the record was created.
`format`	`str`	Style/tone specification for the record (e.g., "Sticky Note Style", "Brief Status Alert").
`generator_model`	`str`	Model used to generate the record (e.g., `qwen3-235b`, `qwen3-80b`, `llama4-maverick`, `gemini-2.5-pro`).
`record_tags`	`list[str]`	Category tags for the record. Possible values: `admin`, `comms`, `creative`, `educational`, `finance`, `hr`, `legal`, `marketing`, `medical`, `notes`, `other`, `project`, `research`, `sales`, `tech`.

Original and Sanitized Records

Column	Type	Description
`original_record`	`str`	The original generated text containing PII and sensitive attributes.
`sanitized_record`	`str`	The sanitized version of the record with attributes abstracted, dropped, or kept per the instructions. Empty string (`""`) in `hard_test.jsonl` and `hard_valid.jsonl` — the hard split is intended for evaluation where models must produce sanitized outputs; our own sanitization pipeline failed on these records, so no reference sanitization is provided.

Attribute Annotations

Column	Type	Description
`annotated_attributes`	`dict`	Flat annotation of all identified attributes, split into `profile` (identity-related) and `event` (scenario-related) sub-dicts. Each key is an attribute name, each value is the attribute's text.
`grouped_annotated_attributes`	`dict`	Same attributes as `annotated_attributes`, but grouped into semantically meaningful clusters (e.g., "Personal Identifiers", "Clinic Location and Provider Information"). Keys are group names, values are dicts of attributes.

Sanitization Instructions

Column	Type	Description
`attributes_to_abstract`	`dict`	Attributes to generalize/anonymize. Contains `selected` (individual attrs or grouped attrs with `group_name`) and `group` (bool indicating whether a group-level abstraction was applied).
`attributes_to_drop`	`dict`	Attributes to remove entirely. Contains `selected` (dict of attr name to value, or `null`) and `group` (bool).
`attributes_to_keep`	`dict`	Attributes to retain as-is. Each key is an attribute name with a sub-dict containing `value`, `sanitization` (always `"keep"`), `group_name`, and `inference_from_original_record`.
`base_instruction`	`str`	Bullet-point sanitization instructions specifying how each attribute should be abstracted, dropped, or kept.
`smoothed_instruction`	`str`	Prose-form rewrite of `base_instruction` as a single coherent directive.

Sanitization Trace

Column	Type	Description
`other_sanitization_details`	`dict`	Full provenance of the sanitization process. Contains three sub-fields described below.

`other_sanitization_details` sub-fields

decomposed_record — list[dict]

The original record split into text segments.

Field	Type	Description
`seq`	`str`	Text content of the segment.
`terminator`	`str`	Delimiter following this segment (e.g., `"\n\n"`).
`idx`	`int`	Sequence index.

sanitized_sequences_by_attribute — dict[str, list[dict]]

Keyed by attribute value (e.g., a person's name, a date). Each entry is a list of sequence objects showing the sanitized text and all identified spans for that attribute.

Each sequence object contains:

text — sanitized text for this sequence
terminator — segment delimiter
idx — sequence index
spans — dict keyed by attribute value, where each value is a list of span objects:
- attr — attribute value (str or list)
- span — matched text in the original
- location — [start, end] character offsets
- confidence — float (typically 1.0)
- attr_type — attribute type (e.g., "event_date", "event_organizer")
- sanitization_option — "abstract", "drop", or "keep"
- group_name — semantic group this attribute belongs to
- merged_from — (optional) list of sub-spans that were merged into this span

sequence_sanitization_mapping — dict[str, dict]

Keyed by sequence index (as string). Each entry maps an original sequence to its final sanitized form.

Field	Type	Description
`original_sequence`	`str`	Original text of the sequence.
`sanitized_sequence`	`str`	Final sanitized text.
`target_attributes`	`list[str]`	Attribute values targeted for sanitization in this sequence.
`strategies_used`	`list[str]`	Sanitization strategies applied (e.g., `["abstract"]`).
`spans`	`list[dict]`	List of span objects, each with `attribute`, `span_text`, `location` (`[start, end]`), `confidence`, `attr_type`, `sanitization_option`, and `group_name`.

Intended Usage:

Researchers who would like to conduct privacy-related or social data-related projects. Individuals who would like to sanitize their private information from texts.

Dataset Characterization

Data Collection Method
- [Synthetic]
Labeling Method
- [Synthetic]

Dataset Format

Text Records

Dataset Quantification

1.3M text records
54M annotated records
Measurement of Total Data Storage: 15GB

Reference(s):

Privasis: Synthesizing the Largest “Public” Private Dataset from Scratch

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer teams to ensure this dataset meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Total size: 71.1 GB

Files: 3,182

Last updated: Jun 11

Pre-warmed CDN: US EU US EU