code / md /LLM_MODELS_COMPARISON.md
Laura Wagner
to commit or not commit that is the question
5f5806d

LLM Models for Deepfake Annotation

Overview

The pipeline now includes 6 LLM options in individual cells for easy comparison:

  1. Deepseek - Testing (use first!)
  2. Qwen (API) - Chinese (Alibaba Cloud)
  3. Llama - American (Meta)
  4. Mixtral - French (Mistral AI)
  5. Gemma - American Open Source (Google)
  6. Qwen-2.5-32B Local - FREE local inference (NEW!)

The 6 LLMs

1. Deepseek (Testing)

Cell 10

  • Model: deepseek-chat
  • Provider: DeepSeek
  • API: https://platform.deepseek.com/
  • Cost: $0.14-0.28 per 1M tokens ($1-2 for 10k entries)
  • Use case: Test this first! Cheapest option to verify pipeline works
  • API Key: misc/credentials/deepseek_api_key.txt

2. Qwen API (Chinese)

Cells 11-12

  • Model: qwen-max (automatically uses Qwen3-Max)
  • Provider: Alibaba Cloud DashScope
  • API: https://dashscope.aliyun.com/
  • Cost: Variable (check Alibaba pricing)
  • Use case: Chinese company, strong multilingual support
  • API Key: misc/credentials/qwen_api_key.txt
  • Note: Uses latest Qwen3-Max when you specify qwen-max

6. Qwen-2.5-32B Local (FREE!)

Cells 19-20 (NEW!)

  • Model: qwen2.5:32b-instruct
  • Provider: Ollama (local inference)
  • Setup: https://ollama.com/
  • Cost: $0 (FREE - no API costs!)
  • Requirements:
    • A100 80GB GPU (or similar)
    • ~25GB VRAM during inference
    • ~20GB storage for model download
    • Ollama installed
  • Speed: 5-10 tokens/sec on A100 (~100-200 samples/hour)
  • Use case:
    • βœ… Large datasets (>1000 samples) where cost matters
    • βœ… Privacy-sensitive research data
    • βœ… Offline processing
    • βœ… Strong multilingual support
  • Setup guide: See QWEN_LOCAL_SETUP.md

3. Llama (American)

Cells 13-14

  • Model: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
  • Provider: Together AI (hosting Meta's model)
  • Developer: Meta (American)
  • API: https://www.together.ai/
  • Cost: $0.90 per 1M tokens ($5-10 for 10k entries)
  • Use case: Open-source American model, good quality
  • API Key: misc/credentials/together_api_key.txt

4. Mixtral (French)

Cells 15-16

  • Model: open-mixtral-8x22b
  • Provider: Mistral AI
  • Developer: Mistral AI (French)
  • API: https://mistral.ai/
  • Cost: $2 per 1M tokens ($10-20 for 10k entries)
  • Use case: European alternative, Mixture-of-Experts architecture
  • API Key: misc/credentials/mistral_api_key.txt
  • Note: Using open-mixtral-8x22b (cheaper than mistral-large)

5. Gemma (American Open Source)

Cells 17-18

  • Model: google/gemma-2-27b-it
  • Provider: Together AI (hosting Google's model)
  • Developer: Google (American)
  • API: https://www.together.ai/ (same as Llama)
  • Cost: $0.80 per 1M tokens ($4-8 for 10k entries)
  • Use case: American open-source alternative, competitive quality
  • API Key: misc/credentials/together_api_key.txt (same as Llama)
  • Note: Fully open-source, can be self-hosted

Cost Comparison (10,000 entries)

Model Provider Cost Time Origin
Qwen-2.5-32B Local Ollama (local) $0 ~50-100 hrs πŸ‡¨πŸ‡³ Chinese
Deepseek DeepSeek ~$1-2 ~5-10 hrs πŸ‡¨πŸ‡³ Chinese
Gemma 2 Together AI ~$4-8 ~5-10 hrs πŸ‡ΊπŸ‡Έ American (open)
Llama 3.1 Together AI ~$5-10 ~5-10 hrs πŸ‡ΊπŸ‡Έ American (open)
Mixtral Mistral AI ~$10-20 ~5-10 hrs πŸ‡«πŸ‡· French (open)
Qwen API Alibaba Variable ~5-10 hrs πŸ‡¨πŸ‡³ Chinese

Note: Local inference is FREE but slower. Good for large datasets where cost matters more than time.

Recommended Testing Order

1. Start with Deepseek

# Cell 10
TEST_MODE = True
TEST_SIZE = 10
  • Why: Cheapest, verify pipeline works
  • Cost: Pennies for 10 samples

2. Compare on Small Sample

Pick 2-3 models and run on same 100 samples:

# In each cell:
TEST_MODE = True
TEST_SIZE = 100

Good combinations:

  • Budget: Deepseek + Gemma
  • Quality: Llama + Mixtral
  • Geographic: Qwen + Llama + Mixtral

3. Production Run

Choose best model from testing and run full dataset:

TEST_MODE = False
MAX_ROWS = None  # or 20000

API Key Setup

For Deepseek & Qwen (separate keys):

echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt
echo "your-qwen-key" > misc/credentials/qwen_api_key.txt

For Llama & Gemma (same Together AI key):

echo "your-together-key" > misc/credentials/together_api_key.txt

Both Llama and Gemma use the same Together AI key!

For Mixtral:

echo "your-mistral-key" > misc/credentials/mistral_api_key.txt

Output Files

Each LLM saves to a separate file:

data/CSV/
β”œβ”€β”€ deepseek_annotated_POI_test.csv       # Deepseek test
β”œβ”€β”€ deepseek_annotated_POI.csv            # Deepseek full
β”œβ”€β”€ qwen_annotated_POI_test.csv           # Qwen API test
β”œβ”€β”€ qwen_annotated_POI.csv                # Qwen API full
β”œβ”€β”€ qwen_local_annotated_POI_test.csv     # Qwen Local test (NEW!)
β”œβ”€β”€ qwen_local_annotated_POI.csv          # Qwen Local full (NEW!)
β”œβ”€β”€ llama_annotated_POI_test.csv          # Llama test
β”œβ”€β”€ llama_annotated_POI.csv               # Llama full
β”œβ”€β”€ mixtral_annotated_POI_test.csv        # Mixtral test
β”œβ”€β”€ mixtral_annotated_POI.csv             # Mixtral full
β”œβ”€β”€ gemma_annotated_POI_test.csv          # Gemma test
└── gemma_annotated_POI.csv               # Gemma full

Comparing Results

After running multiple LLMs, compare results:

import pandas as pd

# Load results from different models
deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv')
qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv')
qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv')  # NEW!
llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv')
mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv')
gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv')

# Compare profession distributions
print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head())
print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head())
print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head())  # NEW!
print("Llama professions:", llama_df['profession_llm'].value_counts().head())
print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head())
print("Gemma professions:", gemma_df['profession_llm'].value_counts().head())

# Compare specific cases
print("\nIrene identification:")
print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values)
print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values)
print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values)
print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values)
print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values)

Model Characteristics

Deepseek

  • βœ… Very cheap
  • βœ… Good for testing
  • ⚠️ Less documentation
  • πŸ‡¨πŸ‡³ Chinese company

Qwen (Qwen3-Max)

  • βœ… Latest version automatically used
  • βœ… Strong multilingual
  • βœ… Good Asian name recognition
  • πŸ’° Variable cost
  • πŸ‡¨πŸ‡³ Chinese company (Alibaba)

Llama 3.1 70B

  • βœ… Open-source
  • βœ… Strong overall performance
  • βœ… Well-documented
  • βœ… American (Meta)
  • πŸ’° Mid-range cost

Mixtral 8x22B

  • βœ… Open-source
  • βœ… MoE architecture (efficient)
  • βœ… European alternative
  • πŸ’° Mid-range cost
  • πŸ‡«πŸ‡· French company

Gemma 2 27B

  • βœ… Fully open-source
  • βœ… Can self-host
  • βœ… American (Google)
  • βœ… Cheap via API
  • βœ… Good quality for size

Qwen-2.5-32B Local (NEW!)

  • βœ… FREE - $0 cost (no API fees)
  • βœ… FAST - Local inference on A100 (5-10 tokens/sec)
  • βœ… PRIVATE - Data never leaves your machine
  • βœ… OFFLINE - Works without internet
  • βœ… HIGH QUALITY - 32B parameter model
  • βœ… Strong multilingual support
  • ⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed
  • πŸ‡¨πŸ‡³ Chinese company (Alibaba)
  • πŸ“¦ Model size: ~20GB download

Decision Matrix

If you prioritize...

FREE / Zero Cost: Use Qwen-2.5-32B Local (no API fees!)

Cost (with API): Use Deepseek or Gemma

Quality: Use Qwen-2.5-32B Local, Llama, or Mixtral

Privacy: Use Qwen-2.5-32B Local (data stays on your machine)

American/Open Source: Use Gemma or Llama

Asian Names: Use Qwen (API or Local - strong multilingual)

European Provider: Use Mixtral

Testing: Use Deepseek first, always!

Running Multiple Models

You can run all 6 models in sequence:

# 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k)
# 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost)
# 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k)
# 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k)
# 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k)
# 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!)

Each saves to its own file, so you can compare results!

Notes

  • Llama and Gemma use the same API key (Together AI)
  • All models use the same 9 profession categories
  • All models have automatic retries with exponential backoff
  • All models save progress every 10 rows
  • All models are resumable if interrupted

Summary

You now have 6 LLM options to choose from:

  1. πŸ§ͺ Deepseek - Test first (cheapest API)
  2. πŸ‡¨πŸ‡³ Qwen3-Max API - Chinese, strong multilingual
  3. πŸ‡ΊπŸ‡Έ Llama 3.1 70B - American, open-source
  4. πŸ‡«πŸ‡· Mixtral 8x22B - French, open-source MoE
  5. πŸ‡ΊπŸ‡Έ Gemma 2 27B - American open-source (Google)
  6. πŸ’° Qwen-2.5-32B Local - FREE local inference (NEW!)

Each in its own cell, easy to run and compare! πŸŽ‰

Recommended workflow:

  1. Test with Deepseek (Cell 10) - verify pipeline works
  2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama)
  3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!