# LLM Models for Deepfake Annotation ## Overview The pipeline now includes **6 LLM options** in individual cells for easy comparison: 1. **Deepseek** - Testing (use first!) 2. **Qwen (API)** - Chinese (Alibaba Cloud) 3. **Llama** - American (Meta) 4. **Mixtral** - French (Mistral AI) 5. **Gemma** - American Open Source (Google) 6. **Qwen-2.5-32B Local** - FREE local inference (NEW!) ## The 6 LLMs ### 1. Deepseek (Testing) **Cell 10** - **Model**: deepseek-chat - **Provider**: DeepSeek - **API**: https://platform.deepseek.com/ - **Cost**: ~$0.14-0.28 per 1M tokens (~$1-2 for 10k entries) - **Use case**: **Test this first!** Cheapest option to verify pipeline works - **API Key**: `misc/credentials/deepseek_api_key.txt` --- ### 2. Qwen API (Chinese) **Cells 11-12** - **Model**: qwen-max (automatically uses Qwen3-Max) - **Provider**: Alibaba Cloud DashScope - **API**: https://dashscope.aliyun.com/ - **Cost**: Variable (check Alibaba pricing) - **Use case**: Chinese company, strong multilingual support - **API Key**: `misc/credentials/qwen_api_key.txt` - **Note**: Uses latest Qwen3-Max when you specify `qwen-max` --- ### 6. Qwen-2.5-32B Local (FREE!) **Cells 19-20** (NEW!) - **Model**: qwen2.5:32b-instruct - **Provider**: Ollama (local inference) - **Setup**: https://ollama.com/ - **Cost**: **$0** (FREE - no API costs!) - **Requirements**: - A100 80GB GPU (or similar) - ~25GB VRAM during inference - ~20GB storage for model download - Ollama installed - **Speed**: 5-10 tokens/sec on A100 (~100-200 samples/hour) - **Use case**: - βœ… Large datasets (>1000 samples) where cost matters - βœ… Privacy-sensitive research data - βœ… Offline processing - βœ… Strong multilingual support - **Setup guide**: See `QWEN_LOCAL_SETUP.md` --- ### 3. Llama (American) **Cells 13-14** - **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo - **Provider**: Together AI (hosting Meta's model) - **Developer**: Meta (American) - **API**: https://www.together.ai/ - **Cost**: ~$0.90 per 1M tokens (~$5-10 for 10k entries) - **Use case**: Open-source American model, good quality - **API Key**: `misc/credentials/together_api_key.txt` --- ### 4. Mixtral (French) **Cells 15-16** - **Model**: open-mixtral-8x22b - **Provider**: Mistral AI - **Developer**: Mistral AI (French) - **API**: https://mistral.ai/ - **Cost**: ~$2 per 1M tokens (~$10-20 for 10k entries) - **Use case**: European alternative, Mixture-of-Experts architecture - **API Key**: `misc/credentials/mistral_api_key.txt` - **Note**: Using open-mixtral-8x22b (cheaper than mistral-large) --- ### 5. Gemma (American Open Source) **Cells 17-18** - **Model**: google/gemma-2-27b-it - **Provider**: Together AI (hosting Google's model) - **Developer**: Google (American) - **API**: https://www.together.ai/ (same as Llama) - **Cost**: ~$0.80 per 1M tokens (~$4-8 for 10k entries) - **Use case**: American open-source alternative, competitive quality - **API Key**: `misc/credentials/together_api_key.txt` (same as Llama) - **Note**: Fully open-source, can be self-hosted --- ## Cost Comparison (10,000 entries) | Model | Provider | Cost | Time | Origin | |-------|----------|------|------|--------| | **Qwen-2.5-32B Local** | Ollama (local) | **$0** | ~50-100 hrs | πŸ‡¨πŸ‡³ Chinese | | **Deepseek** | DeepSeek | ~$1-2 | ~5-10 hrs | πŸ‡¨πŸ‡³ Chinese | | **Gemma 2** | Together AI | ~$4-8 | ~5-10 hrs | πŸ‡ΊπŸ‡Έ American (open) | | **Llama 3.1** | Together AI | ~$5-10 | ~5-10 hrs | πŸ‡ΊπŸ‡Έ American (open) | | **Mixtral** | Mistral AI | ~$10-20 | ~5-10 hrs | πŸ‡«πŸ‡· French (open) | | **Qwen API** | Alibaba | Variable | ~5-10 hrs | πŸ‡¨πŸ‡³ Chinese | **Note**: Local inference is FREE but slower. Good for large datasets where cost matters more than time. ## Recommended Testing Order ### 1. Start with Deepseek ```python # Cell 10 TEST_MODE = True TEST_SIZE = 10 ``` - **Why**: Cheapest, verify pipeline works - **Cost**: Pennies for 10 samples ### 2. Compare on Small Sample Pick 2-3 models and run on same 100 samples: ```python # In each cell: TEST_MODE = True TEST_SIZE = 100 ``` **Good combinations:** - Budget: Deepseek + Gemma - Quality: Llama + Mixtral - Geographic: Qwen + Llama + Mixtral ### 3. Production Run Choose best model from testing and run full dataset: ```python TEST_MODE = False MAX_ROWS = None # or 20000 ``` ## API Key Setup ### For Deepseek & Qwen (separate keys): ```bash echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt echo "your-qwen-key" > misc/credentials/qwen_api_key.txt ``` ### For Llama & Gemma (same Together AI key): ```bash echo "your-together-key" > misc/credentials/together_api_key.txt ``` Both Llama and Gemma use the same Together AI key! ### For Mixtral: ```bash echo "your-mistral-key" > misc/credentials/mistral_api_key.txt ``` ## Output Files Each LLM saves to a separate file: ``` data/CSV/ β”œβ”€β”€ deepseek_annotated_POI_test.csv # Deepseek test β”œβ”€β”€ deepseek_annotated_POI.csv # Deepseek full β”œβ”€β”€ qwen_annotated_POI_test.csv # Qwen API test β”œβ”€β”€ qwen_annotated_POI.csv # Qwen API full β”œβ”€β”€ qwen_local_annotated_POI_test.csv # Qwen Local test (NEW!) β”œβ”€β”€ qwen_local_annotated_POI.csv # Qwen Local full (NEW!) β”œβ”€β”€ llama_annotated_POI_test.csv # Llama test β”œβ”€β”€ llama_annotated_POI.csv # Llama full β”œβ”€β”€ mixtral_annotated_POI_test.csv # Mixtral test β”œβ”€β”€ mixtral_annotated_POI.csv # Mixtral full β”œβ”€β”€ gemma_annotated_POI_test.csv # Gemma test └── gemma_annotated_POI.csv # Gemma full ``` ## Comparing Results After running multiple LLMs, compare results: ```python import pandas as pd # Load results from different models deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv') qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv') qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') # NEW! llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv') mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv') gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv') # Compare profession distributions print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head()) print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head()) print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head()) # NEW! print("Llama professions:", llama_df['profession_llm'].value_counts().head()) print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head()) print("Gemma professions:", gemma_df['profession_llm'].value_counts().head()) # Compare specific cases print("\nIrene identification:") print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values) print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values) print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values) print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values) print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values) print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values) ``` ## Model Characteristics ### Deepseek - βœ… Very cheap - βœ… Good for testing - ⚠️ Less documentation - πŸ‡¨πŸ‡³ Chinese company ### Qwen (Qwen3-Max) - βœ… Latest version automatically used - βœ… Strong multilingual - βœ… Good Asian name recognition - πŸ’° Variable cost - πŸ‡¨πŸ‡³ Chinese company (Alibaba) ### Llama 3.1 70B - βœ… Open-source - βœ… Strong overall performance - βœ… Well-documented - βœ… American (Meta) - πŸ’° Mid-range cost ### Mixtral 8x22B - βœ… Open-source - βœ… MoE architecture (efficient) - βœ… European alternative - πŸ’° Mid-range cost - πŸ‡«πŸ‡· French company ### Gemma 2 27B - βœ… Fully open-source - βœ… Can self-host - βœ… American (Google) - βœ… Cheap via API - βœ… Good quality for size ### Qwen-2.5-32B Local (NEW!) - βœ… **FREE** - $0 cost (no API fees) - βœ… **FAST** - Local inference on A100 (5-10 tokens/sec) - βœ… **PRIVATE** - Data never leaves your machine - βœ… **OFFLINE** - Works without internet - βœ… **HIGH QUALITY** - 32B parameter model - βœ… Strong multilingual support - ⚠️ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed - πŸ‡¨πŸ‡³ Chinese company (Alibaba) - πŸ“¦ Model size: ~20GB download ## Decision Matrix ### If you prioritize... **FREE / Zero Cost**: Use **Qwen-2.5-32B Local** (no API fees!) **Cost** (with API): Use **Deepseek** or **Gemma** **Quality**: Use **Qwen-2.5-32B Local**, **Llama**, or **Mixtral** **Privacy**: Use **Qwen-2.5-32B Local** (data stays on your machine) **American/Open Source**: Use **Gemma** or **Llama** **Asian Names**: Use **Qwen** (API or Local - strong multilingual) **European Provider**: Use **Mixtral** **Testing**: Use **Deepseek** first, always! ## Running Multiple Models You can run all 6 models in sequence: ```python # 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k) # 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost) # 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k) # 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k) # 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k) # 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!) ``` Each saves to its own file, so you can compare results! ## Notes - **Llama and Gemma use the same API key** (Together AI) - All models use the **same 9 profession categories** - All models have **automatic retries** with exponential backoff - All models **save progress** every 10 rows - All models are **resumable** if interrupted ## Summary You now have **6 LLM options** to choose from: 1. πŸ§ͺ **Deepseek** - Test first (cheapest API) 2. πŸ‡¨πŸ‡³ **Qwen3-Max API** - Chinese, strong multilingual 3. πŸ‡ΊπŸ‡Έ **Llama 3.1 70B** - American, open-source 4. πŸ‡«πŸ‡· **Mixtral 8x22B** - French, open-source MoE 5. πŸ‡ΊπŸ‡Έ **Gemma 2 27B** - American open-source (Google) 6. πŸ’° **Qwen-2.5-32B Local** - FREE local inference (NEW!) Each in its own cell, easy to run and compare! πŸŽ‰ **Recommended workflow**: 1. Test with Deepseek (Cell 10) - verify pipeline works 2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama) 3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE!