code / md /LLM_MODELS_COMPARISON.md
Laura Wagner
to commit or not commit that is the question
5f5806d
# LLM Models for Deepfake Annotation
## Overview
The pipeline now includes **6 LLM options** in individual cells for easy comparison:
1. **Deepseek** - Testing (use first!)
2. **Qwen (API)** - Chinese (Alibaba Cloud)
3. **Llama** - American (Meta)
4. **Mixtral** - French (Mistral AI)
5. **Gemma** - American Open Source (Google)
6. **Qwen-2.5-32B Local** - FREE local inference (NEW!)
## The 6 LLMs
### 1. Deepseek (Testing)
**Cell 10**
- **Model**: deepseek-chat
- **Provider**: DeepSeek
- **API**: https://platform.deepseek.com/
- **Cost**: ~$0.14-0.28 per 1M tokens (~$1-2 for 10k entries)
- **Use case**: **Test this first!** Cheapest option to verify pipeline works
- **API Key**: `misc/credentials/deepseek_api_key.txt`
---
### 2. Qwen API (Chinese)
**Cells 11-12**
- **Model**: qwen-max (automatically uses Qwen3-Max)
- **Provider**: Alibaba Cloud DashScope
- **API**: https://dashscope.aliyun.com/
- **Cost**: Variable (check Alibaba pricing)
- **Use case**: Chinese company, strong multilingual support
- **API Key**: `misc/credentials/qwen_api_key.txt`
- **Note**: Uses latest Qwen3-Max when you specify `qwen-max`
---
### 6. Qwen-2.5-32B Local (FREE!)
**Cells 19-20** (NEW!)
- **Model**: qwen2.5:32b-instruct
- **Provider**: Ollama (local inference)
- **Setup**: https://ollama.com/
- **Cost**: **$0** (FREE - no API costs!)
- **Requirements**:
- A100 80GB GPU (or similar)
- ~25GB VRAM during inference
- ~20GB storage for model download
- Ollama installed
- **Speed**: 5-10 tokens/sec on A100 (~100-200 samples/hour)
- **Use case**:
- βœ… Large datasets (>1000 samples) where cost matters
- βœ… Privacy-sensitive research data
- βœ… Offline processing
- βœ… Strong multilingual support
- **Setup guide**: See `QWEN_LOCAL_SETUP.md`
---
### 3. Llama (American)
**Cells 13-14**
- **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
- **Provider**: Together AI (hosting Meta's model)
- **Developer**: Meta (American)
- **API**: https://www.together.ai/
- **Cost**: ~$0.90 per 1M tokens (~$5-10 for 10k entries)
- **Use case**: Open-source American model, good quality
- **API Key**: `misc/credentials/together_api_key.txt`
---
### 4. Mixtral (French)
**Cells 15-16**
- **Model**: open-mixtral-8x22b
- **Provider**: Mistral AI
- **Developer**: Mistral AI (French)
- **API**: https://mistral.ai/
- **Cost**: ~$2 per 1M tokens (~$10-20 for 10k entries)
- **Use case**: European alternative, Mixture-of-Experts architecture
- **API Key**: `misc/credentials/mistral_api_key.txt`
- **Note**: Using open-mixtral-8x22b (cheaper than mistral-large)
---
### 5. Gemma (American Open Source)
**Cells 17-18**
- **Model**: google/gemma-2-27b-it
- **Provider**: Together AI (hosting Google's model)
- **Developer**: Google (American)
- **API**: https://www.together.ai/ (same as Llama)
- **Cost**: ~$0.80 per 1M tokens (~$4-8 for 10k entries)
- **Use case**: American open-source alternative, competitive quality
- **API Key**: `misc/credentials/together_api_key.txt` (same as Llama)
- **Note**: Fully open-source, can be self-hosted
---
## Cost Comparison (10,000 entries)
| Model | Provider | Cost | Time | Origin |
|-------|----------|------|------|--------|
| **Qwen-2.5-32B Local** | Ollama (local) | **$0** | ~50-100 hrs | πŸ‡¨πŸ‡³ Chinese |
| **Deepseek** | DeepSeek | ~$1-2 | ~5-10 hrs | πŸ‡¨πŸ‡³ Chinese |
| **Gemma 2** | Together AI | ~$4-8 | ~5-10 hrs | πŸ‡ΊπŸ‡Έ American (open) |
| **Llama 3.1** | Together AI | ~$5-10 | ~5-10 hrs | πŸ‡ΊπŸ‡Έ American (open) |
| **Mixtral** | Mistral AI | ~$10-20 | ~5-10 hrs | πŸ‡«πŸ‡· French (open) |
| **Qwen API** | Alibaba | Variable | ~5-10 hrs | πŸ‡¨πŸ‡³ Chinese |
**Note**: Local inference is FREE but slower. Good for large datasets where cost matters more than time.
## Recommended Testing Order
### 1. Start with Deepseek
```python
# Cell 10
TEST_MODE = True
TEST_SIZE = 10
```
- **Why**: Cheapest, verify pipeline works
- **Cost**: Pennies for 10 samples
### 2. Compare on Small Sample
Pick 2-3 models and run on same 100 samples:
```python
# In each cell:
TEST_MODE = True
TEST_SIZE = 100
```
**Good combinations:**
- Budget: Deepseek + Gemma
- Quality: Llama + Mixtral
- Geographic: Qwen + Llama + Mixtral
### 3. Production Run
Choose best model from testing and run full dataset:
```python
TEST_MODE = False
MAX_ROWS = None # or 20000
```
## API Key Setup
### For Deepseek & Qwen (separate keys):
```bash
echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt
echo "your-qwen-key" > misc/credentials/qwen_api_key.txt
```
### For Llama & Gemma (same Together AI key):
```bash
echo "your-together-key" > misc/credentials/together_api_key.txt
```
Both Llama and Gemma use the same Together AI key!
### For Mixtral:
```bash
echo "your-mistral-key" > misc/credentials/mistral_api_key.txt
```
## Output Files
Each LLM saves to a separate file:
```
data/CSV/
β”œβ”€β”€ deepseek_annotated_POI_test.csv # Deepseek test
β”œβ”€β”€ deepseek_annotated_POI.csv # Deepseek full
β”œβ”€β”€ qwen_annotated_POI_test.csv # Qwen API test
β”œβ”€β”€ qwen_annotated_POI.csv # Qwen API full
β”œβ”€β”€ qwen_local_annotated_POI_test.csv # Qwen Local test (NEW!)
β”œβ”€β”€ qwen_local_annotated_POI.csv # Qwen Local full (NEW!)
β”œβ”€β”€ llama_annotated_POI_test.csv # Llama test
β”œβ”€β”€ llama_annotated_POI.csv # Llama full
β”œβ”€β”€ mixtral_annotated_POI_test.csv # Mixtral test
β”œβ”€β”€ mixtral_annotated_POI.csv # Mixtral full
β”œβ”€β”€ gemma_annotated_POI_test.csv # Gemma test
└── gemma_annotated_POI.csv # Gemma full
```
## Comparing Results
After running multiple LLMs, compare results:
```python
import pandas as pd
# Load results from different models
deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv')
qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') # NEW!
llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv')
mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv')
gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv')
# Compare profession distributions
print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head())
print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head())
print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head()) # NEW!
print("Llama professions:", llama_df['profession_llm'].value_counts().head())
print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head())
print("Gemma professions:", gemma_df['profession_llm'].value_counts().head())
# Compare specific cases
print("\nIrene identification:")
print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values)
print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values)
print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values)
print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values)
```
## Model Characteristics
### Deepseek
- βœ… Very cheap
- βœ… Good for testing
- ⚠️ Less documentation
- πŸ‡¨πŸ‡³ Chinese company
### Qwen (Qwen3-Max)
- βœ… Latest version automatically used
- βœ… Strong multilingual
- βœ… Good Asian name recognition
- πŸ’° Variable cost
- πŸ‡¨πŸ‡³ Chinese company (Alibaba)
### Llama 3.1 70B
- βœ… Open-source
- βœ… Strong overall performance
- βœ… Well-documented
- βœ… American (Meta)
- πŸ’° Mid-range cost
### Mixtral 8x22B
- βœ… Open-source
- βœ… MoE architecture (efficient)
- βœ… European alternative
- πŸ’° Mid-range cost
- πŸ‡«πŸ‡· French company
### Gemma 2 27B
- βœ… Fully open-source
- βœ… Can self-host
- βœ… American (Google)
- βœ… Cheap via API
- βœ… Good quality for size
### Qwen-2.5-32B Local (NEW!)
- βœ… **FREE** - $0 cost (no API fees)
- βœ… **FAST** - Local inference on A100 (5-10 tokens/sec)
- βœ… **PRIVATE** - Data never leaves your machine
- βœ… **OFFLINE** - Works without internet
- βœ… **HIGH QUALITY** - 32B parameter model
- βœ… Strong multilingual support
- ⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed
- πŸ‡¨πŸ‡³ Chinese company (Alibaba)
- πŸ“¦ Model size: ~20GB download
## Decision Matrix
### If you prioritize...
**FREE / Zero Cost**: Use **Qwen-2.5-32B Local** (no API fees!)
**Cost** (with API): Use **Deepseek** or **Gemma**
**Quality**: Use **Qwen-2.5-32B Local**, **Llama**, or **Mixtral**
**Privacy**: Use **Qwen-2.5-32B Local** (data stays on your machine)
**American/Open Source**: Use **Gemma** or **Llama**
**Asian Names**: Use **Qwen** (API or Local - strong multilingual)
**European Provider**: Use **Mixtral**
**Testing**: Use **Deepseek** first, always!
## Running Multiple Models
You can run all 6 models in sequence:
```python
# 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k)
# 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost)
# 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k)
# 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k)
# 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k)
# 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!)
```
Each saves to its own file, so you can compare results!
## Notes
- **Llama and Gemma use the same API key** (Together AI)
- All models use the **same 9 profession categories**
- All models have **automatic retries** with exponential backoff
- All models **save progress** every 10 rows
- All models are **resumable** if interrupted
## Summary
You now have **6 LLM options** to choose from:
1. πŸ§ͺ **Deepseek** - Test first (cheapest API)
2. πŸ‡¨πŸ‡³ **Qwen3-Max API** - Chinese, strong multilingual
3. πŸ‡ΊπŸ‡Έ **Llama 3.1 70B** - American, open-source
4. πŸ‡«πŸ‡· **Mixtral 8x22B** - French, open-source MoE
5. πŸ‡ΊπŸ‡Έ **Gemma 2 27B** - American open-source (Google)
6. πŸ’° **Qwen-2.5-32B Local** - FREE local inference (NEW!)
Each in its own cell, easy to run and compare! πŸŽ‰
**Recommended workflow**:
1. Test with Deepseek (Cell 10) - verify pipeline works
2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama)
3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!