File size: 4,354 Bytes
1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 1bc8867 81b3473 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | ---
language:
- en
license: agpl-3.0
library_name: transformers
tags:
- grpo
- rlhf
- fine-tuning
- marxism
- political-theory
- lora
- deepseek
- qwen
datasets:
- prolewiki/qa-corpus
base_model: unsloth/DeepSeek-R1-0528-Qwen3-8B
pipeline_tag: text-generation
---
# prolewiki-llm
GRPO fine-tuning infrastructure for training Marxist-Leninist language models.
## Overview
This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes:
- **Multi-Layer Reward System**: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring)
- **Headless Training**: Docker container for automated RunPod deployment with auto-shutoff
- **Jupyter Notebook**: Production-ready notebook optimized for A40/A100 GPUs
- **Comprehensive Tests**: Unit and integration tests for all components
## Quick Start
### RunPod Deployment (Recommended)
```bash
# 1. Build Docker image
docker build -t marxist-grpo:latest -f docker/Dockerfile .
# 2. Push to registry and deploy on RunPod
# Use A40 (48GB, $0.35/hr) for best cost/performance
# 3. Set environment variables on pod:
# - HF_TOKEN
# - WANDB_API_KEY
# - HF_REPO (optional, for model upload)
```
### Local Development
```bash
# Install dependencies
uv sync --group dev
# Download spaCy model (required for rewards)
python -m spacy download en_core_web_sm
# Run tests
uv run pytest -m "not slow and not gpu"
```
## Repository Structure
```
prolewiki-llm/
βββ src/prolewiki_llm/
β βββ grpo_rewards.py # Multi-layer reward functions
β βββ train_headless.py # Headless training script
β βββ export_grpo_dataset.py # Dataset conversion
β βββ wandb_logging.py # W&B integration
βββ docker/
β βββ Dockerfile # Training container
β βββ start.sh # Entrypoint with auto-shutoff
β βββ .env.example # Environment reference
βββ notebooks/
β βββ Marxist_GRPO_RunPod_Optimized.ipynb
βββ tests/
β βββ unit/ # Unit tests
β βββ integration/ # Shell script tests
β βββ fixtures/ # Mock commands
βββ training_data/
βββ grpo_dataset.jsonl # Training data
```
## Reward Functions
The reward system uses multiple layers to ensure quality responses:
| Layer | Function | Purpose |
|-------|----------|---------|
| 1 | `match_format_exactly` | Validate `<think>...</think>` tags |
| 2 | `nli_coherence_reward` | Response entails ground truth (BART-MNLI) |
| 3 | `self_consistency_reward` | No internal contradictions |
| 4 | `structural_coherence_reward` | Terms in proper syntactic roles (spaCy) |
| 5 | `topic_relevance_reward` | Answer addresses the question |
| 6 | `interconnection_depth_reward` | Rewards analysis, penalizes buzzword salad |
Use `full_coherence_reward()` for the complete 6-layer check, or `robust_coherence_reward()` for a faster 3-layer version.
## Training Configuration
Key environment variables for `train_headless.py`:
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | `unsloth/DeepSeek-R1-0528-Qwen3-8B` | Base model |
| `MAX_STEPS` | `500` | Training steps |
| `BATCH_SIZE` | `2` | Per-device batch size |
| `LEARNING_RATE` | `5e-6` | Learning rate |
| `REWARD_MODE` | `FULL` | `FULL`, `ROBUST`, or `LEGACY` |
| `HF_REPO` | `prolewiki/marxist-grpo-lora` | Upload destination |
## GPU Requirements
| GPU | VRAM | Price | Recommendation |
|-----|------|-------|----------------|
| **A40** | 48GB | $0.35/hr | Best value for 8B models |
| A100 | 80GB | $1.19/hr | Overkill for this use case |
| RTX 4090 | 24GB | $0.34/hr | Too small for 16-bit GRPO |
## Critical Notes
1. **torch.compile must be disabled** on RunPod/Jupyter (causes hangs)
2. **load_in_4bit=False** is required for GRPO (16-bit LoRA adapters)
3. **use_gradient_checkpointing=True** (not `"unsloth"`) for stability
## Related Projects
- [ProleWiki](https://en.prolewiki.org/) - The Marxist-Leninist encyclopedia
- [pw-mcp](https://github.com/prolewiki/pw-mcp) - MCP server for ProleWiki semantic search
## License
AGPL-3.0-only
|