Marxist-GRPO Training Dataset

Overview

This directory contains curated and synthetic Q&A pairs for fine-tuning Marxist-Leninist language models. The canonical source records live under sources/ with full provenance metadata. Targeted synthetic corrections live in synthetic/*.jsonl. The dataset is prepared for Hugging Face datasets via the loading script in dataset.py.

Data Layout

sources/**.jsonl: author-attributed Q&A records (qa_record schema).
synthetic/*.jsonl: synthetic Q&A records with qa_record metadata for targeted fixes.
schema/: JSON Schema definitions for validation and tooling.
MANIFEST.yaml: inventory, checksums, and per-file statistics.
Training notebooks, logs, and formatted SFT data live under llm/.

Hugging Face Configs

The dataset script exposes three configs:

qa: full metadata records (sources + synthetic).
pairs (default): instruction/response pairs from sources + synthetic files.
grpo: GRPO-ready prompt/answer records with system + user messages.

All configs currently provide a single train split.

Usage

Local usage:

from datasets import load_dataset

dataset = load_dataset("path/to/dataset", "pairs", trust_remote_code=True)
train = dataset["train"]

GRPO usage (for Marxist_GRPO_Training.ipynb-style training):

from datasets import load_dataset

grpo = load_dataset("path/to/dataset", "grpo", trust_remote_code=True)["train"]

Once published to the Hub, replace the path with org/dataset-name.

Schema Notes

All JSONL records use schema/qa_record.schema.json. The pairs and grpo configs derive their fields from the same sources plus synthetic/*.jsonl.

License

The dataset is licensed under AGPL-3.0 (see LICENSE and MANIFEST.yaml).

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support