---
language:
  - en
license: agpl-3.0
library_name: transformers
tags:
  - grpo
  - rlhf
  - fine-tuning
  - marxism
  - political-theory
  - lora
  - deepseek
  - qwen
datasets:
  - prolewiki/qa-corpus
base_model: unsloth/DeepSeek-R1-0528-Qwen3-8B
pipeline_tag: text-generation
---

# prolewiki-llm

GRPO fine-tuning infrastructure for training Marxist-Leninist language models.

## Overview

This repository contains the AI training infrastructure for fine-tuning language models on Marxist-Leninist theory using GRPO (Group Relative Policy Optimization). It includes:

- **Multi-Layer Reward System**: 17+ reward functions that prevent reward hacking (NLI coherence, self-consistency, structural analysis, topic relevance, depth scoring)
- **Headless Training**: Docker container for automated RunPod deployment with auto-shutoff
- **Jupyter Notebook**: Production-ready notebook optimized for A40/A100 GPUs
- **Comprehensive Tests**: Unit and integration tests for all components

## Quick Start

### RunPod Deployment (Recommended)

```bash
# 1. Build Docker image
docker build -t marxist-grpo:latest -f docker/Dockerfile .

# 2. Push to registry and deploy on RunPod
# Use A40 (48GB, $0.35/hr) for best cost/performance

# 3. Set environment variables on pod:
#    - HF_TOKEN
#    - WANDB_API_KEY
#    - HF_REPO (optional, for model upload)
```

### Local Development

```bash
# Install dependencies
uv sync --group dev

# Download spaCy model (required for rewards)
python -m spacy download en_core_web_sm

# Run tests
uv run pytest -m "not slow and not gpu"
```

## Repository Structure

```
prolewiki-llm/
├── src/prolewiki_llm/
│   ├── grpo_rewards.py       # Multi-layer reward functions
│   ├── train_headless.py     # Headless training script
│   ├── export_grpo_dataset.py # Dataset conversion
│   └── wandb_logging.py      # W&B integration
├── docker/
│   ├── Dockerfile            # Training container
│   ├── start.sh              # Entrypoint with auto-shutoff
│   └── .env.example          # Environment reference
├── notebooks/
│   └── Marxist_GRPO_RunPod_Optimized.ipynb
├── tests/
│   ├── unit/                 # Unit tests
│   ├── integration/          # Shell script tests
│   └── fixtures/             # Mock commands
└── training_data/
    └── grpo_dataset.jsonl    # Training data
```

## Reward Functions

The reward system uses multiple layers to ensure quality responses:

| Layer | Function | Purpose |
|-------|----------|---------|
| 1 | `match_format_exactly` | Validate `<think>...</think>` tags |
| 2 | `nli_coherence_reward` | Response entails ground truth (BART-MNLI) |
| 3 | `self_consistency_reward` | No internal contradictions |
| 4 | `structural_coherence_reward` | Terms in proper syntactic roles (spaCy) |
| 5 | `topic_relevance_reward` | Answer addresses the question |
| 6 | `interconnection_depth_reward` | Rewards analysis, penalizes buzzword salad |

Use `full_coherence_reward()` for the complete 6-layer check, or `robust_coherence_reward()` for a faster 3-layer version.

## Training Configuration

Key environment variables for `train_headless.py`:

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | `unsloth/DeepSeek-R1-0528-Qwen3-8B` | Base model |
| `MAX_STEPS` | `500` | Training steps |
| `BATCH_SIZE` | `2` | Per-device batch size |
| `LEARNING_RATE` | `5e-6` | Learning rate |
| `REWARD_MODE` | `FULL` | `FULL`, `ROBUST`, or `LEGACY` |
| `HF_REPO` | `prolewiki/marxist-grpo-lora` | Upload destination |

## GPU Requirements

| GPU | VRAM | Price | Recommendation |
|-----|------|-------|----------------|
| **A40** | 48GB | $0.35/hr | Best value for 8B models |
| A100 | 80GB | $1.19/hr | Overkill for this use case |
| RTX 4090 | 24GB | $0.34/hr | Too small for 16-bit GRPO |

## Critical Notes

1. **torch.compile must be disabled** on RunPod/Jupyter (causes hangs)
2. **load_in_4bit=False** is required for GRPO (16-bit LoRA adapters)
3. **use_gradient_checkpointing=True** (not `"unsloth"`) for stability

## Related Projects

- [ProleWiki](https://en.prolewiki.org/) - The Marxist-Leninist encyclopedia
- [pw-mcp](https://github.com/prolewiki/pw-mcp) - MCP server for ProleWiki semantic search

## License

AGPL-3.0-only