Marxist-GRPO Training Dataset

Overview

This directory contains curated and synthetic Q&A pairs for fine-tuning Marxist-Leninist language models. The canonical source records live under sources/ with full provenance metadata. Targeted synthetic corrections live in synthetic/*.jsonl. The dataset is prepared for Hugging Face datasets via the loading script in dataset.py.

Data Layout

  • sources/**.jsonl: author-attributed Q&A records (qa_record schema).
  • synthetic/*.jsonl: synthetic Q&A records with qa_record metadata for targeted fixes.
  • schema/: JSON Schema definitions for validation and tooling.
  • MANIFEST.yaml: inventory, checksums, and per-file statistics.
  • Training notebooks, logs, and formatted SFT data live under llm/.

Hugging Face Configs

The dataset script exposes three configs:

  • qa: full metadata records (sources + synthetic).
  • pairs (default): instruction/response pairs from sources + synthetic files.
  • grpo: GRPO-ready prompt/answer records with system + user messages.

All configs currently provide a single train split.

Usage

Local usage:

from datasets import load_dataset

dataset = load_dataset("path/to/dataset", "pairs", trust_remote_code=True)
train = dataset["train"]

GRPO usage (for Marxist_GRPO_Training.ipynb-style training):

from datasets import load_dataset

grpo = load_dataset("path/to/dataset", "grpo", trust_remote_code=True)["train"]

Once published to the Hub, replace the path with org/dataset-name.

Schema Notes

All JSONL records use schema/qa_record.schema.json. The pairs and grpo configs derive their fields from the same sources plus synthetic/*.jsonl.

License

The dataset is licensed under AGPL-3.0 (see LICENSE and MANIFEST.yaml).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support