Marxist-GRPO Training Dataset
Overview
This directory contains curated and synthetic Q&A pairs for fine-tuning
Marxist-Leninist language models. The canonical source records live under
sources/ with full provenance metadata. Targeted synthetic corrections live
in synthetic/*.jsonl. The dataset is prepared for Hugging Face datasets
via the loading script in dataset.py.
Data Layout
sources/**.jsonl: author-attributed Q&A records (qa_record schema).synthetic/*.jsonl: synthetic Q&A records with qa_record metadata for targeted fixes.schema/: JSON Schema definitions for validation and tooling.MANIFEST.yaml: inventory, checksums, and per-file statistics.- Training notebooks, logs, and formatted SFT data live under
llm/.
Hugging Face Configs
The dataset script exposes three configs:
qa: full metadata records (sources + synthetic).pairs(default): instruction/response pairs from sources + synthetic files.grpo: GRPO-ready prompt/answer records with system + user messages.
All configs currently provide a single train split.
Usage
Local usage:
from datasets import load_dataset
dataset = load_dataset("path/to/dataset", "pairs", trust_remote_code=True)
train = dataset["train"]
GRPO usage (for Marxist_GRPO_Training.ipynb-style training):
from datasets import load_dataset
grpo = load_dataset("path/to/dataset", "grpo", trust_remote_code=True)["train"]
Once published to the Hub, replace the path with org/dataset-name.
Schema Notes
All JSONL records use schema/qa_record.schema.json. The pairs and grpo
configs derive their fields from the same sources plus synthetic/*.jsonl.
License
The dataset is licensed under AGPL-3.0 (see LICENSE and MANIFEST.yaml).