prolewiki-llm

GRPO fine-tuning infrastructure for training Marxist-Leninist language models.

Overview

This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes:

Multi-Layer Reward System: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring)
Headless Training: Docker container for automated RunPod deployment with auto-shutoff
Jupyter Notebook: Production-ready notebook optimized for A40/A100 GPUs
Comprehensive Tests: Unit and integration tests for all components

Quick Start

RunPod Deployment (Recommended)

# 1. Build Docker image
docker build -t marxist-grpo:latest -f docker/Dockerfile .

# 2. Push to registry and deploy on RunPod
# Use A40 (48GB, $0.35/hr) for best cost/performance

# 3. Set environment variables on pod:
#    - HF_TOKEN
#    - WANDB_API_KEY
#    - HF_REPO (optional, for model upload)

Local Development

# Install dependencies
uv sync --group dev

# Download spaCy model (required for rewards)
python -m spacy download en_core_web_sm

# Run tests
uv run pytest -m "not slow and not gpu"

Repository Structure

prolewiki-llm/
├── src/prolewiki_llm/
│   ├── grpo_rewards.py       # Multi-layer reward functions
│   ├── train_headless.py     # Headless training script
│   ├── export_grpo_dataset.py # Dataset conversion
│   └── wandb_logging.py      # W&B integration
├── docker/
│   ├── Dockerfile            # Training container
│   ├── start.sh              # Entrypoint with auto-shutoff
│   └── .env.example          # Environment reference
├── notebooks/
│   └── Marxist_GRPO_RunPod_Optimized.ipynb
├── tests/
│   ├── unit/                 # Unit tests
│   ├── integration/          # Shell script tests
│   └── fixtures/             # Mock commands
└── training_data/
    └── grpo_dataset.jsonl    # Training data

Reward Functions

The reward system uses multiple layers to ensure quality responses:

Layer	Function	Purpose
1	`match_format_exactly`	Validate `<think>...</think>` tags
2	`nli_coherence_reward`	Response entails ground truth (BART-MNLI)
3	`self_consistency_reward`	No internal contradictions
4	`structural_coherence_reward`	Terms in proper syntactic roles (spaCy)
5	`topic_relevance_reward`	Answer addresses the question
6	`interconnection_depth_reward`	Rewards analysis, penalizes buzzword salad

Use full_coherence_reward() for the complete 6-layer check, or robust_coherence_reward() for a faster 3-layer version.

Training Configuration

Key environment variables for train_headless.py:

Variable	Default	Description
`MODEL_NAME`	`unsloth/DeepSeek-R1-0528-Qwen3-8B`	Base model
`MAX_STEPS`	`500`	Training steps
`BATCH_SIZE`	`2`	Per-device batch size
`LEARNING_RATE`	`5e-6`	Learning rate
`REWARD_MODE`	`FULL`	`FULL`, `ROBUST`, or `LEGACY`
`HF_REPO`	`prolewiki/marxist-grpo-lora`	Upload destination

GPU Requirements

GPU	VRAM	Price	Recommendation
A40	48GB	$0.35/hr	Best value for 8B models
A100	80GB	$1.19/hr	Overkill for this use case
RTX 4090	24GB	$0.34/hr	Too small for 16-bit GRPO

Critical Notes

torch.compile must be disabled on RunPod/Jupyter (causes hangs)
load_in_4bit=False is required for GRPO (16-bit LoRA adapters)
use_gradient_checkpointing=True (not "unsloth") for stability

Related Projects

ProleWiki - The Marxist-Leninist encyclopedia
pw-mcp - MCP server for ProleWiki semantic search

License

AGPL-3.0-only

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for percyraskova/llm-training

Base model

deepseek-ai/DeepSeek-R1-0528-Qwen3-8B

Finetuned

unsloth/DeepSeek-R1-0528-Qwen3-8B

Adapter

(5)

this model