code / md /LLM_MODELS_COMPARISON.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

10.5 kB

	# LLM Models for Deepfake Annotation

	## Overview

	The pipeline now includes 6 LLM options in individual cells for easy comparison:

	1. Deepseek - Testing (use first!)
	2. Qwen (API) - Chinese (Alibaba Cloud)
	3. Llama - American (Meta)
	4. Mixtral - French (Mistral AI)
	5. Gemma - American Open Source (Google)
	6. Qwen-2.5-32B Local - FREE local inference (NEW!)

	## The 6 LLMs

	### 1. Deepseek (Testing)
	Cell 10

	- Model: deepseek-chat
	- Provider: DeepSeek
	- API: https://platform.deepseek.com/
	- Cost: ~$0.14-0.28 per 1M tokens (~$1-2 for 10k entries)
	- Use case: Test this first! Cheapest option to verify pipeline works
	- API Key: `misc/credentials/deepseek_api_key.txt`

	---

	### 2. Qwen API (Chinese)
	Cells 11-12

	- Model: qwen-max (automatically uses Qwen3-Max)
	- Provider: Alibaba Cloud DashScope
	- API: https://dashscope.aliyun.com/
	- Cost: Variable (check Alibaba pricing)
	- Use case: Chinese company, strong multilingual support
	- API Key: `misc/credentials/qwen_api_key.txt`
	- Note: Uses latest Qwen3-Max when you specify `qwen-max`

	---

	### 6. Qwen-2.5-32B Local (FREE!)
	Cells 19-20 (NEW!)

	- Model: qwen2.5:32b-instruct
	- Provider: Ollama (local inference)
	- Setup: https://ollama.com/
	- Cost: $0 (FREE - no API costs!)
	- Requirements:
	- A100 80GB GPU (or similar)
	- ~25GB VRAM during inference
	- ~20GB storage for model download
	- Ollama installed
	- Speed: 5-10 tokens/sec on A100 (~100-200 samples/hour)
	- Use case:
	- ✅ Large datasets (>1000 samples) where cost matters
	- ✅ Privacy-sensitive research data
	- ✅ Offline processing
	- ✅ Strong multilingual support
	- Setup guide: See `QWEN_LOCAL_SETUP.md`

	---

	### 3. Llama (American)
	Cells 13-14

	- Model: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
	- Provider: Together AI (hosting Meta's model)
	- Developer: Meta (American)
	- API: https://www.together.ai/
	- Cost: ~$0.90 per 1M tokens (~$5-10 for 10k entries)
	- Use case: Open-source American model, good quality
	- API Key: `misc/credentials/together_api_key.txt`

	---

	### 4. Mixtral (French)
	Cells 15-16

	- Model: open-mixtral-8x22b
	- Provider: Mistral AI
	- Developer: Mistral AI (French)
	- API: https://mistral.ai/
	- Cost: ~$2 per 1M tokens (~$10-20 for 10k entries)
	- Use case: European alternative, Mixture-of-Experts architecture
	- API Key: `misc/credentials/mistral_api_key.txt`
	- Note: Using open-mixtral-8x22b (cheaper than mistral-large)

	---

	### 5. Gemma (American Open Source)
	Cells 17-18

	- Model: google/gemma-2-27b-it
	- Provider: Together AI (hosting Google's model)
	- Developer: Google (American)
	- API: https://www.together.ai/ (same as Llama)
	- Cost: ~$0.80 per 1M tokens (~$4-8 for 10k entries)
	- Use case: American open-source alternative, competitive quality
	- API Key: `misc/credentials/together_api_key.txt` (same as Llama)
	- Note: Fully open-source, can be self-hosted

	---

	## Cost Comparison (10,000 entries)

	\| Model \| Provider \| Cost \| Time \| Origin \|
	\|-------\|----------\|------\|------\|--------\|
	\| Qwen-2.5-32B Local \| Ollama (local) \| $0 \| ~50-100 hrs \| 🇨🇳 Chinese \|
	\| Deepseek \| DeepSeek \| ~$1-2 \| ~5-10 hrs \| 🇨🇳 Chinese \|
	\| Gemma 2 \| Together AI \| ~$4-8 \| ~5-10 hrs \| 🇺🇸 American (open) \|
	\| Llama 3.1 \| Together AI \| ~$5-10 \| ~5-10 hrs \| 🇺🇸 American (open) \|
	\| Mixtral \| Mistral AI \| ~$10-20 \| ~5-10 hrs \| 🇫🇷 French (open) \|
	\| Qwen API \| Alibaba \| Variable \| ~5-10 hrs \| 🇨🇳 Chinese \|

	Note: Local inference is FREE but slower. Good for large datasets where cost matters more than time.

	## Recommended Testing Order

	### 1. Start with Deepseek
	```python
	# Cell 10
	TEST_MODE = True
	TEST_SIZE = 10
	```
	- Why: Cheapest, verify pipeline works
	- Cost: Pennies for 10 samples

	### 2. Compare on Small Sample
	Pick 2-3 models and run on same 100 samples:
	```python
	# In each cell:
	TEST_MODE = True
	TEST_SIZE = 100
	```

	Good combinations:
	- Budget: Deepseek + Gemma
	- Quality: Llama + Mixtral
	- Geographic: Qwen + Llama + Mixtral

	### 3. Production Run
	Choose best model from testing and run full dataset:
	```python
	TEST_MODE = False
	MAX_ROWS = None # or 20000
	```

	## API Key Setup

	### For Deepseek & Qwen (separate keys):
	```bash
	echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt
	echo "your-qwen-key" > misc/credentials/qwen_api_key.txt
	```

	### For Llama & Gemma (same Together AI key):
	```bash
	echo "your-together-key" > misc/credentials/together_api_key.txt
	```
	Both Llama and Gemma use the same Together AI key!

	### For Mixtral:
	```bash
	echo "your-mistral-key" > misc/credentials/mistral_api_key.txt
	```

	## Output Files

	Each LLM saves to a separate file:

	```
	data/CSV/
	├── deepseek_annotated_POI_test.csv # Deepseek test
	├── deepseek_annotated_POI.csv # Deepseek full
	├── qwen_annotated_POI_test.csv # Qwen API test
	├── qwen_annotated_POI.csv # Qwen API full
	├── qwen_local_annotated_POI_test.csv # Qwen Local test (NEW!)
	├── qwen_local_annotated_POI.csv # Qwen Local full (NEW!)
	├── llama_annotated_POI_test.csv # Llama test
	├── llama_annotated_POI.csv # Llama full
	├── mixtral_annotated_POI_test.csv # Mixtral test
	├── mixtral_annotated_POI.csv # Mixtral full
	├── gemma_annotated_POI_test.csv # Gemma test
	└── gemma_annotated_POI.csv # Gemma full
	```

	## Comparing Results

	After running multiple LLMs, compare results:

	```python
	import pandas as pd

	# Load results from different models
	deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv')
	qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
	qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') # NEW!
	llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv')
	mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv')
	gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv')

	# Compare profession distributions
	print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head())
	print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head())
	print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head()) # NEW!
	print("Llama professions:", llama_df['profession_llm'].value_counts().head())
	print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head())
	print("Gemma professions:", gemma_df['profession_llm'].value_counts().head())

	# Compare specific cases
	print("\nIrene identification:")
	print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values)
	print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values)
	print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values)
	print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values)
	print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values)
	print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values)
	```

	## Model Characteristics

	### Deepseek
	- ✅ Very cheap
	- ✅ Good for testing
	- ⚠️ Less documentation
	- 🇨🇳 Chinese company

	### Qwen (Qwen3-Max)
	- ✅ Latest version automatically used
	- ✅ Strong multilingual
	- ✅ Good Asian name recognition
	- 💰 Variable cost
	- 🇨🇳 Chinese company (Alibaba)

	### Llama 3.1 70B
	- ✅ Open-source
	- ✅ Strong overall performance
	- ✅ Well-documented
	- ✅ American (Meta)
	- 💰 Mid-range cost

	### Mixtral 8x22B
	- ✅ Open-source
	- ✅ MoE architecture (efficient)
	- ✅ European alternative
	- 💰 Mid-range cost
	- 🇫🇷 French company

	### Gemma 2 27B
	- ✅ Fully open-source
	- ✅ Can self-host
	- ✅ American (Google)
	- ✅ Cheap via API
	- ✅ Good quality for size

	### Qwen-2.5-32B Local (NEW!)
	- ✅ FREE - $0 cost (no API fees)
	- ✅ FAST - Local inference on A100 (5-10 tokens/sec)
	- ✅ PRIVATE - Data never leaves your machine
	- ✅ OFFLINE - Works without internet
	- ✅ HIGH QUALITY - 32B parameter model
	- ✅ Strong multilingual support
	- ⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed
	- 🇨🇳 Chinese company (Alibaba)
	- 📦 Model size: ~20GB download

	## Decision Matrix

	### If you prioritize...

	FREE / Zero Cost: Use Qwen-2.5-32B Local (no API fees!)

	Cost (with API): Use Deepseek or Gemma

	Quality: Use Qwen-2.5-32B Local, Llama, or Mixtral

	Privacy: Use Qwen-2.5-32B Local (data stays on your machine)

	American/Open Source: Use Gemma or Llama

	Asian Names: Use Qwen (API or Local - strong multilingual)

	European Provider: Use Mixtral

	Testing: Use Deepseek first, always!

	## Running Multiple Models

	You can run all 6 models in sequence:

	```python
	# 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k)
	# 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost)
	# 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k)
	# 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k)
	# 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k)
	# 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!)
	```

	Each saves to its own file, so you can compare results!

	## Notes

	- Llama and Gemma use the same API key (Together AI)
	- All models use the same 9 profession categories
	- All models have automatic retries with exponential backoff
	- All models save progress every 10 rows
	- All models are resumable if interrupted

	## Summary

	You now have 6 LLM options to choose from:

	1. 🧪 Deepseek - Test first (cheapest API)
	2. 🇨🇳 Qwen3-Max API - Chinese, strong multilingual
	3. 🇺🇸 Llama 3.1 70B - American, open-source
	4. 🇫🇷 Mixtral 8x22B - French, open-source MoE
	5. 🇺🇸 Gemma 2 27B - American open-source (Google)
	6. 💰 Qwen-2.5-32B Local - FREE local inference (NEW!)

	Each in its own cell, easy to run and compare! 🎉

	Recommended workflow:
	1. Test with Deepseek (Cell 10) - verify pipeline works
	2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama)
	3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!