prolewiki-llm
GRPO fine-tuning infrastructure for training Marxist-Leninist language models.
Overview
This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes:
- Multi-Layer Reward System: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring)
- Headless Training: Docker container for automated RunPod deployment with auto-shutoff
- Jupyter Notebook: Production-ready notebook optimized for A40/A100 GPUs
- Comprehensive Tests: Unit and integration tests for all components
Quick Start
RunPod Deployment (Recommended)
# 1. Build Docker image
docker build -t marxist-grpo:latest -f docker/Dockerfile .
# 2. Push to registry and deploy on RunPod
# Use A40 (48GB, $0.35/hr) for best cost/performance
# 3. Set environment variables on pod:
# - HF_TOKEN
# - WANDB_API_KEY
# - HF_REPO (optional, for model upload)
Local Development
# Install dependencies
uv sync --group dev
# Download spaCy model (required for rewards)
python -m spacy download en_core_web_sm
# Run tests
uv run pytest -m "not slow and not gpu"
Repository Structure
prolewiki-llm/
βββ src/prolewiki_llm/
β βββ grpo_rewards.py # Multi-layer reward functions
β βββ train_headless.py # Headless training script
β βββ export_grpo_dataset.py # Dataset conversion
β βββ wandb_logging.py # W&B integration
βββ docker/
β βββ Dockerfile # Training container
β βββ start.sh # Entrypoint with auto-shutoff
β βββ .env.example # Environment reference
βββ notebooks/
β βββ Marxist_GRPO_RunPod_Optimized.ipynb
βββ tests/
β βββ unit/ # Unit tests
β βββ integration/ # Shell script tests
β βββ fixtures/ # Mock commands
βββ training_data/
βββ grpo_dataset.jsonl # Training data
Reward Functions
The reward system uses multiple layers to ensure quality responses:
| Layer | Function | Purpose |
|---|---|---|
| 1 | match_format_exactly |
Validate <think>...</think> tags |
| 2 | nli_coherence_reward |
Response entails ground truth (BART-MNLI) |
| 3 | self_consistency_reward |
No internal contradictions |
| 4 | structural_coherence_reward |
Terms in proper syntactic roles (spaCy) |
| 5 | topic_relevance_reward |
Answer addresses the question |
| 6 | interconnection_depth_reward |
Rewards analysis, penalizes buzzword salad |
Use full_coherence_reward() for the complete 6-layer check, or robust_coherence_reward() for a faster 3-layer version.
Training Configuration
Key environment variables for train_headless.py:
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
unsloth/DeepSeek-R1-0528-Qwen3-8B |
Base model |
MAX_STEPS |
500 |
Training steps |
BATCH_SIZE |
2 |
Per-device batch size |
LEARNING_RATE |
5e-6 |
Learning rate |
REWARD_MODE |
FULL |
FULL, ROBUST, or LEGACY |
HF_REPO |
prolewiki/marxist-grpo-lora |
Upload destination |
GPU Requirements
| GPU | VRAM | Price | Recommendation |
|---|---|---|---|
| A40 | 48GB | $0.35/hr | Best value for 8B models |
| A100 | 80GB | $1.19/hr | Overkill for this use case |
| RTX 4090 | 24GB | $0.34/hr | Too small for 16-bit GRPO |
Critical Notes
- torch.compile must be disabled on RunPod/Jupyter (causes hangs)
- load_in_4bit=False is required for GRPO (16-bit LoRA adapters)
- use_gradient_checkpointing=True (not
"unsloth") for stability
Related Projects
License
AGPL-3.0-only
Model tree for percyraskova/llm-training
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Finetuned
unsloth/DeepSeek-R1-0528-Qwen3-8B