code

File size: 10,465 Bytes

5f5806d

# LLM Models for Deepfake Annotation

## Overview

The pipeline now includes **6 LLM options** in individual cells for easy comparison:

1. **Deepseek** - Testing (use first!)
2. **Qwen (API)** - Chinese (Alibaba Cloud)
3. **Llama** - American (Meta)
4. **Mixtral** - French (Mistral AI)
5. **Gemma** - American Open Source (Google)
6. **Qwen-2.5-32B Local** - FREE local inference (NEW!)

## The 6 LLMs

### 1. Deepseek (Testing)
**Cell 10**

- **Model**: deepseek-chat
- **Provider**: DeepSeek
- **API**: https://platform.deepseek.com/
- **Cost**: ~$0.14-0.28 per 1M tokens (~$1-2 for 10k entries)
- **Use case**: **Test this first!** Cheapest option to verify pipeline works
- **API Key**: `misc/credentials/deepseek_api_key.txt`

---

### 2. Qwen API (Chinese)
**Cells 11-12**

- **Model**: qwen-max (automatically uses Qwen3-Max)
- **Provider**: Alibaba Cloud DashScope
- **API**: https://dashscope.aliyun.com/
- **Cost**: Variable (check Alibaba pricing)
- **Use case**: Chinese company, strong multilingual support
- **API Key**: `misc/credentials/qwen_api_key.txt`
- **Note**: Uses latest Qwen3-Max when you specify `qwen-max`

---

### 6. Qwen-2.5-32B Local (FREE!)
**Cells 19-20** (NEW!)

- **Model**: qwen2.5:32b-instruct
- **Provider**: Ollama (local inference)
- **Setup**: https://ollama.com/
- **Cost**: **$0** (FREE - no API costs!)
- **Requirements**:
  - A100 80GB GPU (or similar)
  - ~25GB VRAM during inference
  - ~20GB storage for model download
  - Ollama installed
- **Speed**: 5-10 tokens/sec on A100 (~100-200 samples/hour)
- **Use case**:
  - ✅ Large datasets (>1000 samples) where cost matters
  - ✅ Privacy-sensitive research data
  - ✅ Offline processing
  - ✅ Strong multilingual support
- **Setup guide**: See `QWEN_LOCAL_SETUP.md`

---

### 3. Llama (American)
**Cells 13-14**

- **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
- **Provider**: Together AI (hosting Meta's model)
- **Developer**: Meta (American)
- **API**: https://www.together.ai/
- **Cost**: ~$0.90 per 1M tokens (~$5-10 for 10k entries)
- **Use case**: Open-source American model, good quality
- **API Key**: `misc/credentials/together_api_key.txt`

---

### 4. Mixtral (French)
**Cells 15-16**

- **Model**: open-mixtral-8x22b
- **Provider**: Mistral AI
- **Developer**: Mistral AI (French)
- **API**: https://mistral.ai/
- **Cost**: ~$2 per 1M tokens (~$10-20 for 10k entries)
- **Use case**: European alternative, Mixture-of-Experts architecture
- **API Key**: `misc/credentials/mistral_api_key.txt`
- **Note**: Using open-mixtral-8x22b (cheaper than mistral-large)

---

### 5. Gemma (American Open Source)
**Cells 17-18**

- **Model**: google/gemma-2-27b-it
- **Provider**: Together AI (hosting Google's model)
- **Developer**: Google (American)
- **API**: https://www.together.ai/ (same as Llama)
- **Cost**: ~$0.80 per 1M tokens (~$4-8 for 10k entries)
- **Use case**: American open-source alternative, competitive quality
- **API Key**: `misc/credentials/together_api_key.txt` (same as Llama)
- **Note**: Fully open-source, can be self-hosted

---

## Cost Comparison (10,000 entries)

| Model | Provider | Cost | Time | Origin |
|-------|----------|------|------|--------|
| **Qwen-2.5-32B Local** | Ollama (local) | **$0** | ~50-100 hrs | 🇨🇳 Chinese |
| **Deepseek** | DeepSeek | ~$1-2 | ~5-10 hrs | 🇨🇳 Chinese |
| **Gemma 2** | Together AI | ~$4-8 | ~5-10 hrs | 🇺🇸 American (open) |
| **Llama 3.1** | Together AI | ~$5-10 | ~5-10 hrs | 🇺🇸 American (open) |
| **Mixtral** | Mistral AI | ~$10-20 | ~5-10 hrs | 🇫🇷 French (open) |
| **Qwen API** | Alibaba | Variable | ~5-10 hrs | 🇨🇳 Chinese |

**Note**: Local inference is FREE but slower. Good for large datasets where cost matters more than time.

## Recommended Testing Order

### 1. Start with Deepseek
```python
# Cell 10
TEST_MODE = True
TEST_SIZE = 10
```
- **Why**: Cheapest, verify pipeline works
- **Cost**: Pennies for 10 samples

### 2. Compare on Small Sample
Pick 2-3 models and run on same 100 samples:
```python
# In each cell:
TEST_MODE = True
TEST_SIZE = 100
```

**Good combinations:**
- Budget: Deepseek + Gemma
- Quality: Llama + Mixtral
- Geographic: Qwen + Llama + Mixtral

### 3. Production Run
Choose best model from testing and run full dataset:
```python
TEST_MODE = False
MAX_ROWS = None  # or 20000
```

## API Key Setup

### For Deepseek & Qwen (separate keys):
```bash
echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt
echo "your-qwen-key" > misc/credentials/qwen_api_key.txt
```

### For Llama & Gemma (same Together AI key):
```bash
echo "your-together-key" > misc/credentials/together_api_key.txt
```
Both Llama and Gemma use the same Together AI key!

### For Mixtral:
```bash
echo "your-mistral-key" > misc/credentials/mistral_api_key.txt
```

## Output Files

Each LLM saves to a separate file:

```
data/CSV/
├── deepseek_annotated_POI_test.csv       # Deepseek test
├── deepseek_annotated_POI.csv            # Deepseek full
├── qwen_annotated_POI_test.csv           # Qwen API test
├── qwen_annotated_POI.csv                # Qwen API full
├── qwen_local_annotated_POI_test.csv     # Qwen Local test (NEW!)
├── qwen_local_annotated_POI.csv          # Qwen Local full (NEW!)
├── llama_annotated_POI_test.csv          # Llama test
├── llama_annotated_POI.csv               # Llama full
├── mixtral_annotated_POI_test.csv        # Mixtral test
├── mixtral_annotated_POI.csv             # Mixtral full
├── gemma_annotated_POI_test.csv          # Gemma test
└── gemma_annotated_POI.csv               # Gemma full
```

## Comparing Results

After running multiple LLMs, compare results:

```python
import pandas as pd

# Load results from different models
deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv')
qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')  # NEW!
llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv')
mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv')
gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv')

# Compare profession distributions
print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head())
print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head())
print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head())  # NEW!
print("Llama professions:", llama_df['profession_llm'].value_counts().head())
print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head())
print("Gemma professions:", gemma_df['profession_llm'].value_counts().head())

# Compare specific cases
print("\nIrene identification:")
print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values)
print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values)
print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values)
print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values)
```

## Model Characteristics

### Deepseek
- ✅ Very cheap
- ✅ Good for testing
- ⚠️ Less documentation
- 🇨🇳 Chinese company

### Qwen (Qwen3-Max)
- ✅ Latest version automatically used
- ✅ Strong multilingual
- ✅ Good Asian name recognition
- 💰 Variable cost
- 🇨🇳 Chinese company (Alibaba)

### Llama 3.1 70B
- ✅ Open-source
- ✅ Strong overall performance
- ✅ Well-documented
- ✅ American (Meta)
- 💰 Mid-range cost

### Mixtral 8x22B
- ✅ Open-source
- ✅ MoE architecture (efficient)
- ✅ European alternative
- 💰 Mid-range cost
- 🇫🇷 French company

### Gemma 2 27B
- ✅ Fully open-source
- ✅ Can self-host
- ✅ American (Google)
- ✅ Cheap via API
- ✅ Good quality for size

### Qwen-2.5-32B Local (NEW!)
- ✅ **FREE** - $0 cost (no API fees)
- ✅ **FAST** - Local inference on A100 (5-10 tokens/sec)
- ✅ **PRIVATE** - Data never leaves your machine
- ✅ **OFFLINE** - Works without internet
- ✅ **HIGH QUALITY** - 32B parameter model
- ✅ Strong multilingual support
- ⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed
- 🇨🇳 Chinese company (Alibaba)
- 📦 Model size: ~20GB download

## Decision Matrix

### If you prioritize...

**FREE / Zero Cost**: Use **Qwen-2.5-32B Local** (no API fees!)

**Cost** (with API): Use **Deepseek** or **Gemma**

**Quality**: Use **Qwen-2.5-32B Local**, **Llama**, or **Mixtral**

**Privacy**: Use **Qwen-2.5-32B Local** (data stays on your machine)

**American/Open Source**: Use **Gemma** or **Llama**

**Asian Names**: Use **Qwen** (API or Local - strong multilingual)

**European Provider**: Use **Mixtral**

**Testing**: Use **Deepseek** first, always!

## Running Multiple Models

You can run all 6 models in sequence:

```python
# 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k)
# 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost)
# 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k)
# 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k)
# 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k)
# 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!)
```

Each saves to its own file, so you can compare results!

## Notes

- **Llama and Gemma use the same API key** (Together AI)
- All models use the **same 9 profession categories**
- All models have **automatic retries** with exponential backoff
- All models **save progress** every 10 rows
- All models are **resumable** if interrupted

## Summary

You now have **6 LLM options** to choose from:

1. 🧪 **Deepseek** - Test first (cheapest API)
2. 🇨🇳 **Qwen3-Max API** - Chinese, strong multilingual
3. 🇺🇸 **Llama 3.1 70B** - American, open-source
4. 🇫🇷 **Mixtral 8x22B** - French, open-source MoE
5. 🇺🇸 **Gemma 2 27B** - American open-source (Google)
6. 💰 **Qwen-2.5-32B Local** - FREE local inference (NEW!)

Each in its own cell, easy to run and compare! 🎉

**Recommended workflow**:
1. Test with Deepseek (Cell 10) - verify pipeline works
2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama)
3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!