code / md /LLM_MODELS_COMPARISON.md

Laura Wagner

to commit or not commit that is the question

5f5806d about 2 months ago

preview code

raw

history blame contribute delete

10.5 kB

LLM Models for Deepfake Annotation

Overview

The pipeline now includes 6 LLM options in individual cells for easy comparison:

Deepseek - Testing (use first!)
Qwen (API) - Chinese (Alibaba Cloud)
Llama - American (Meta)
Mixtral - French (Mistral AI)
Gemma - American Open Source (Google)
Qwen-2.5-32B Local - FREE local inference (NEW!)

The 6 LLMs

1. Deepseek (Testing)

Cell 10

Model: deepseek-chat
Provider: DeepSeek
API: https://platform.deepseek.com/
Cost: ~~$0.14-0.28 per 1M tokens (~~$1-2 for 10k entries)
Use case: Test this first! Cheapest option to verify pipeline works
API Key: misc/credentials/deepseek_api_key.txt

2. Qwen API (Chinese)

Cells 11-12

Model: qwen-max (automatically uses Qwen3-Max)
Provider: Alibaba Cloud DashScope
API: https://dashscope.aliyun.com/
Cost: Variable (check Alibaba pricing)
Use case: Chinese company, strong multilingual support
API Key: misc/credentials/qwen_api_key.txt
Note: Uses latest Qwen3-Max when you specify qwen-max

6. Qwen-2.5-32B Local (FREE!)

Cells 19-20 (NEW!)

Model: qwen2.5:32b-instruct
Provider: Ollama (local inference)
Setup: https://ollama.com/
Cost: $0 (FREE - no API costs!)
Requirements:
- A100 80GB GPU (or similar)
- ~25GB VRAM during inference
- ~20GB storage for model download
- Ollama installed
Speed: 5-10 tokens/sec on A100 (~100-200 samples/hour)
Use case:
- ✅ Large datasets (>1000 samples) where cost matters
- ✅ Privacy-sensitive research data
- ✅ Offline processing
- ✅ Strong multilingual support
Setup guide: See QWEN_LOCAL_SETUP.md

3. Llama (American)

Cells 13-14

Model: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
Provider: Together AI (hosting Meta's model)
Developer: Meta (American)
API: https://www.together.ai/
Cost: ~~$0.90 per 1M tokens (~~$5-10 for 10k entries)
Use case: Open-source American model, good quality
API Key: misc/credentials/together_api_key.txt

4. Mixtral (French)

Cells 15-16

Model: open-mixtral-8x22b
Provider: Mistral AI
Developer: Mistral AI (French)
API: https://mistral.ai/
Cost: ~~$2 per 1M tokens (~~$10-20 for 10k entries)
Use case: European alternative, Mixture-of-Experts architecture
API Key: misc/credentials/mistral_api_key.txt
Note: Using open-mixtral-8x22b (cheaper than mistral-large)

5. Gemma (American Open Source)

Cells 17-18

Model: google/gemma-2-27b-it
Provider: Together AI (hosting Google's model)
Developer: Google (American)
API: https://www.together.ai/ (same as Llama)
Cost: ~~$0.80 per 1M tokens (~~$4-8 for 10k entries)
Use case: American open-source alternative, competitive quality
API Key: misc/credentials/together_api_key.txt (same as Llama)
Note: Fully open-source, can be self-hosted

Cost Comparison (10,000 entries)

Model	Provider	Cost	Time	Origin
Qwen-2.5-32B Local	Ollama (local)	$0	~50-100 hrs	🇨🇳 Chinese
Deepseek	DeepSeek	~$1-2	~5-10 hrs	🇨🇳 Chinese
Gemma 2	Together AI	~$4-8	~5-10 hrs	🇺🇸 American (open)
Llama 3.1	Together AI	~$5-10	~5-10 hrs	🇺🇸 American (open)
Mixtral	Mistral AI	~$10-20	~5-10 hrs	🇫🇷 French (open)
Qwen API	Alibaba	Variable	~5-10 hrs	🇨🇳 Chinese

Note: Local inference is FREE but slower. Good for large datasets where cost matters more than time.

Recommended Testing Order

1. Start with Deepseek

# Cell 10
TEST_MODE = True
TEST_SIZE = 10

Why: Cheapest, verify pipeline works
Cost: Pennies for 10 samples

2. Compare on Small Sample

Pick 2-3 models and run on same 100 samples:

# In each cell:
TEST_MODE = True
TEST_SIZE = 100

Good combinations:

Budget: Deepseek + Gemma
Quality: Llama + Mixtral
Geographic: Qwen + Llama + Mixtral

3. Production Run

Choose best model from testing and run full dataset:

TEST_MODE = False
MAX_ROWS = None  # or 20000

API Key Setup

For Deepseek & Qwen (separate keys):

echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt
echo "your-qwen-key" > misc/credentials/qwen_api_key.txt

For Llama & Gemma (same Together AI key):

echo "your-together-key" > misc/credentials/together_api_key.txt

Both Llama and Gemma use the same Together AI key!

For Mixtral:

echo "your-mistral-key" > misc/credentials/mistral_api_key.txt

Output Files

Each LLM saves to a separate file:

data/CSV/
├── deepseek_annotated_POI_test.csv       # Deepseek test
├── deepseek_annotated_POI.csv            # Deepseek full
├── qwen_annotated_POI_test.csv           # Qwen API test
├── qwen_annotated_POI.csv                # Qwen API full
├── qwen_local_annotated_POI_test.csv     # Qwen Local test (NEW!)
├── qwen_local_annotated_POI.csv          # Qwen Local full (NEW!)
├── llama_annotated_POI_test.csv          # Llama test
├── llama_annotated_POI.csv               # Llama full
├── mixtral_annotated_POI_test.csv        # Mixtral test
├── mixtral_annotated_POI.csv             # Mixtral full
├── gemma_annotated_POI_test.csv          # Gemma test
└── gemma_annotated_POI.csv               # Gemma full

Comparing Results

After running multiple LLMs, compare results:

import pandas as pd

# Load results from different models
deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv')
qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')  # NEW!
llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv')
mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv')
gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv')

# Compare profession distributions
print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head())
print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head())
print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head())  # NEW!
print("Llama professions:", llama_df['profession_llm'].value_counts().head())
print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head())
print("Gemma professions:", gemma_df['profession_llm'].value_counts().head())

# Compare specific cases
print("\nIrene identification:")
print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values)
print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values)
print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values)
print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values)

Model Characteristics

Deepseek

✅ Very cheap
✅ Good for testing
⚠️ Less documentation
🇨🇳 Chinese company

Qwen (Qwen3-Max)

✅ Latest version automatically used
✅ Strong multilingual
✅ Good Asian name recognition
💰 Variable cost
🇨🇳 Chinese company (Alibaba)

Llama 3.1 70B

✅ Open-source
✅ Strong overall performance
✅ Well-documented
✅ American (Meta)
💰 Mid-range cost

Mixtral 8x22B

✅ Open-source
✅ MoE architecture (efficient)
✅ European alternative
💰 Mid-range cost
🇫🇷 French company

Gemma 2 27B

✅ Fully open-source
✅ Can self-host
✅ American (Google)
✅ Cheap via API
✅ Good quality for size

Qwen-2.5-32B Local (NEW!)

✅ FREE - $0 cost (no API fees)
✅ FAST - Local inference on A100 (5-10 tokens/sec)
✅ PRIVATE - Data never leaves your machine
✅ OFFLINE - Works without internet
✅ HIGH QUALITY - 32B parameter model
✅ Strong multilingual support
⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed
🇨🇳 Chinese company (Alibaba)
📦 Model size: ~20GB download

Decision Matrix

If you prioritize...

FREE / Zero Cost: Use Qwen-2.5-32B Local (no API fees!)

Cost (with API): Use Deepseek or Gemma

Quality: Use Qwen-2.5-32B Local, Llama, or Mixtral

Privacy: Use Qwen-2.5-32B Local (data stays on your machine)

American/Open Source: Use Gemma or Llama

Asian Names: Use Qwen (API or Local - strong multilingual)

European Provider: Use Mixtral

Testing: Use Deepseek first, always!

Running Multiple Models

You can run all 6 models in sequence:

# 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k)
# 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost)
# 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k)
# 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k)
# 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k)
# 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!)

Each saves to its own file, so you can compare results!

Notes

Llama and Gemma use the same API key (Together AI)
All models use the same 9 profession categories
All models have automatic retries with exponential backoff
All models save progress every 10 rows
All models are resumable if interrupted

Summary

You now have 6 LLM options to choose from:

🧪 Deepseek - Test first (cheapest API)
🇨🇳 Qwen3-Max API - Chinese, strong multilingual
🇺🇸 Llama 3.1 70B - American, open-source
🇫🇷 Mixtral 8x22B - French, open-source MoE
🇺🇸 Gemma 2 27B - American open-source (Google)
💰 Qwen-2.5-32B Local - FREE local inference (NEW!)

Each in its own cell, easy to run and compare! 🎉

Recommended workflow:

Test with Deepseek (Cell 10) - verify pipeline works
For small datasets (<1000): Use API (Deepseek/Gemma/Llama)
For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!