| # LLM Models for Deepfake Annotation | |
| ## Overview | |
| The pipeline now includes **6 LLM options** in individual cells for easy comparison: | |
| 1. **Deepseek** - Testing (use first!) | |
| 2. **Qwen (API)** - Chinese (Alibaba Cloud) | |
| 3. **Llama** - American (Meta) | |
| 4. **Mixtral** - French (Mistral AI) | |
| 5. **Gemma** - American Open Source (Google) | |
| 6. **Qwen-2.5-32B Local** - FREE local inference (NEW!) | |
| ## The 6 LLMs | |
| ### 1. Deepseek (Testing) | |
| **Cell 10** | |
| - **Model**: deepseek-chat | |
| - **Provider**: DeepSeek | |
| - **API**: https://platform.deepseek.com/ | |
| - **Cost**: ~$0.14-0.28 per 1M tokens (~$1-2 for 10k entries) | |
| - **Use case**: **Test this first!** Cheapest option to verify pipeline works | |
| - **API Key**: `misc/credentials/deepseek_api_key.txt` | |
| --- | |
| ### 2. Qwen API (Chinese) | |
| **Cells 11-12** | |
| - **Model**: qwen-max (automatically uses Qwen3-Max) | |
| - **Provider**: Alibaba Cloud DashScope | |
| - **API**: https://dashscope.aliyun.com/ | |
| - **Cost**: Variable (check Alibaba pricing) | |
| - **Use case**: Chinese company, strong multilingual support | |
| - **API Key**: `misc/credentials/qwen_api_key.txt` | |
| - **Note**: Uses latest Qwen3-Max when you specify `qwen-max` | |
| --- | |
| ### 6. Qwen-2.5-32B Local (FREE!) | |
| **Cells 19-20** (NEW!) | |
| - **Model**: qwen2.5:32b-instruct | |
| - **Provider**: Ollama (local inference) | |
| - **Setup**: https://ollama.com/ | |
| - **Cost**: **$0** (FREE - no API costs!) | |
| - **Requirements**: | |
| - A100 80GB GPU (or similar) | |
| - ~25GB VRAM during inference | |
| - ~20GB storage for model download | |
| - Ollama installed | |
| - **Speed**: 5-10 tokens/sec on A100 (~100-200 samples/hour) | |
| - **Use case**: | |
| - β Large datasets (>1000 samples) where cost matters | |
| - β Privacy-sensitive research data | |
| - β Offline processing | |
| - β Strong multilingual support | |
| - **Setup guide**: See `QWEN_LOCAL_SETUP.md` | |
| --- | |
| ### 3. Llama (American) | |
| **Cells 13-14** | |
| - **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo | |
| - **Provider**: Together AI (hosting Meta's model) | |
| - **Developer**: Meta (American) | |
| - **API**: https://www.together.ai/ | |
| - **Cost**: ~$0.90 per 1M tokens (~$5-10 for 10k entries) | |
| - **Use case**: Open-source American model, good quality | |
| - **API Key**: `misc/credentials/together_api_key.txt` | |
| --- | |
| ### 4. Mixtral (French) | |
| **Cells 15-16** | |
| - **Model**: open-mixtral-8x22b | |
| - **Provider**: Mistral AI | |
| - **Developer**: Mistral AI (French) | |
| - **API**: https://mistral.ai/ | |
| - **Cost**: ~$2 per 1M tokens (~$10-20 for 10k entries) | |
| - **Use case**: European alternative, Mixture-of-Experts architecture | |
| - **API Key**: `misc/credentials/mistral_api_key.txt` | |
| - **Note**: Using open-mixtral-8x22b (cheaper than mistral-large) | |
| --- | |
| ### 5. Gemma (American Open Source) | |
| **Cells 17-18** | |
| - **Model**: google/gemma-2-27b-it | |
| - **Provider**: Together AI (hosting Google's model) | |
| - **Developer**: Google (American) | |
| - **API**: https://www.together.ai/ (same as Llama) | |
| - **Cost**: ~$0.80 per 1M tokens (~$4-8 for 10k entries) | |
| - **Use case**: American open-source alternative, competitive quality | |
| - **API Key**: `misc/credentials/together_api_key.txt` (same as Llama) | |
| - **Note**: Fully open-source, can be self-hosted | |
| --- | |
| ## Cost Comparison (10,000 entries) | |
| | Model | Provider | Cost | Time | Origin | | |
| |-------|----------|------|------|--------| | |
| | **Qwen-2.5-32B Local** | Ollama (local) | **$0** | ~50-100 hrs | π¨π³ Chinese | | |
| | **Deepseek** | DeepSeek | ~$1-2 | ~5-10 hrs | π¨π³ Chinese | | |
| | **Gemma 2** | Together AI | ~$4-8 | ~5-10 hrs | πΊπΈ American (open) | | |
| | **Llama 3.1** | Together AI | ~$5-10 | ~5-10 hrs | πΊπΈ American (open) | | |
| | **Mixtral** | Mistral AI | ~$10-20 | ~5-10 hrs | π«π· French (open) | | |
| | **Qwen API** | Alibaba | Variable | ~5-10 hrs | π¨π³ Chinese | | |
| **Note**: Local inference is FREE but slower. Good for large datasets where cost matters more than time. | |
| ## Recommended Testing Order | |
| ### 1. Start with Deepseek | |
| ```python | |
| # Cell 10 | |
| TEST_MODE = True | |
| TEST_SIZE = 10 | |
| ``` | |
| - **Why**: Cheapest, verify pipeline works | |
| - **Cost**: Pennies for 10 samples | |
| ### 2. Compare on Small Sample | |
| Pick 2-3 models and run on same 100 samples: | |
| ```python | |
| # In each cell: | |
| TEST_MODE = True | |
| TEST_SIZE = 100 | |
| ``` | |
| **Good combinations:** | |
| - Budget: Deepseek + Gemma | |
| - Quality: Llama + Mixtral | |
| - Geographic: Qwen + Llama + Mixtral | |
| ### 3. Production Run | |
| Choose best model from testing and run full dataset: | |
| ```python | |
| TEST_MODE = False | |
| MAX_ROWS = None # or 20000 | |
| ``` | |
| ## API Key Setup | |
| ### For Deepseek & Qwen (separate keys): | |
| ```bash | |
| echo "your-deepseek-key" > misc/credentials/deepseek_api_key.txt | |
| echo "your-qwen-key" > misc/credentials/qwen_api_key.txt | |
| ``` | |
| ### For Llama & Gemma (same Together AI key): | |
| ```bash | |
| echo "your-together-key" > misc/credentials/together_api_key.txt | |
| ``` | |
| Both Llama and Gemma use the same Together AI key! | |
| ### For Mixtral: | |
| ```bash | |
| echo "your-mistral-key" > misc/credentials/mistral_api_key.txt | |
| ``` | |
| ## Output Files | |
| Each LLM saves to a separate file: | |
| ``` | |
| data/CSV/ | |
| βββ deepseek_annotated_POI_test.csv # Deepseek test | |
| βββ deepseek_annotated_POI.csv # Deepseek full | |
| βββ qwen_annotated_POI_test.csv # Qwen API test | |
| βββ qwen_annotated_POI.csv # Qwen API full | |
| βββ qwen_local_annotated_POI_test.csv # Qwen Local test (NEW!) | |
| βββ qwen_local_annotated_POI.csv # Qwen Local full (NEW!) | |
| βββ llama_annotated_POI_test.csv # Llama test | |
| βββ llama_annotated_POI.csv # Llama full | |
| βββ mixtral_annotated_POI_test.csv # Mixtral test | |
| βββ mixtral_annotated_POI.csv # Mixtral full | |
| βββ gemma_annotated_POI_test.csv # Gemma test | |
| βββ gemma_annotated_POI.csv # Gemma full | |
| ``` | |
| ## Comparing Results | |
| After running multiple LLMs, compare results: | |
| ```python | |
| import pandas as pd | |
| # Load results from different models | |
| deepseek_df = pd.read_csv('data/CSV/deepseek_annotated_POI_test.csv') | |
| qwen_df = pd.read_csv('data/CSV/qwen_annotated_POI_test.csv') | |
| qwen_local_df = pd.read_csv('data/CSV/qwen_local_annotated_POI_test.csv') # NEW! | |
| llama_df = pd.read_csv('data/CSV/llama_annotated_POI_test.csv') | |
| mixtral_df = pd.read_csv('data/CSV/mixtral_annotated_POI_test.csv') | |
| gemma_df = pd.read_csv('data/CSV/gemma_annotated_POI_test.csv') | |
| # Compare profession distributions | |
| print("Deepseek professions:", deepseek_df['profession_llm'].value_counts().head()) | |
| print("Qwen API professions:", qwen_df['profession_llm'].value_counts().head()) | |
| print("Qwen Local professions:", qwen_local_df['profession_llm'].value_counts().head()) # NEW! | |
| print("Llama professions:", llama_df['profession_llm'].value_counts().head()) | |
| print("Mixtral professions:", mixtral_df['profession_llm'].value_counts().head()) | |
| print("Gemma professions:", gemma_df['profession_llm'].value_counts().head()) | |
| # Compare specific cases | |
| print("\nIrene identification:") | |
| print("Deepseek:", deepseek_df[deepseek_df['real_name'] == 'Irene']['full_name'].values) | |
| print("Qwen API:", qwen_df[qwen_df['real_name'] == 'Irene']['full_name'].values) | |
| print("Qwen Local:", qwen_local_df[qwen_local_df['real_name'] == 'Irene']['full_name'].values) | |
| print("Llama:", llama_df[llama_df['real_name'] == 'Irene']['full_name'].values) | |
| print("Mixtral:", mixtral_df[mixtral_df['real_name'] == 'Irene']['full_name'].values) | |
| print("Gemma:", gemma_df[gemma_df['real_name'] == 'Irene']['full_name'].values) | |
| ``` | |
| ## Model Characteristics | |
| ### Deepseek | |
| - β Very cheap | |
| - β Good for testing | |
| - β οΈ Less documentation | |
| - π¨π³ Chinese company | |
| ### Qwen (Qwen3-Max) | |
| - β Latest version automatically used | |
| - β Strong multilingual | |
| - β Good Asian name recognition | |
| - π° Variable cost | |
| - π¨π³ Chinese company (Alibaba) | |
| ### Llama 3.1 70B | |
| - β Open-source | |
| - β Strong overall performance | |
| - β Well-documented | |
| - β American (Meta) | |
| - π° Mid-range cost | |
| ### Mixtral 8x22B | |
| - β Open-source | |
| - β MoE architecture (efficient) | |
| - β European alternative | |
| - π° Mid-range cost | |
| - π«π· French company | |
| ### Gemma 2 27B | |
| - β Fully open-source | |
| - β Can self-host | |
| - β American (Google) | |
| - β Cheap via API | |
| - β Good quality for size | |
| ### Qwen-2.5-32B Local (NEW!) | |
| - β **FREE** - $0 cost (no API fees) | |
| - β **FAST** - Local inference on A100 (5-10 tokens/sec) | |
| - β **PRIVATE** - Data never leaves your machine | |
| - β **OFFLINE** - Works without internet | |
| - β **HIGH QUALITY** - 32B parameter model | |
| - β Strong multilingual support | |
| - β οΈ Requires: A100 80GB GPU, ~25GB VRAM, Ollama installed | |
| - π¨π³ Chinese company (Alibaba) | |
| - π¦ Model size: ~20GB download | |
| ## Decision Matrix | |
| ### If you prioritize... | |
| **FREE / Zero Cost**: Use **Qwen-2.5-32B Local** (no API fees!) | |
| **Cost** (with API): Use **Deepseek** or **Gemma** | |
| **Quality**: Use **Qwen-2.5-32B Local**, **Llama**, or **Mixtral** | |
| **Privacy**: Use **Qwen-2.5-32B Local** (data stays on your machine) | |
| **American/Open Source**: Use **Gemma** or **Llama** | |
| **Asian Names**: Use **Qwen** (API or Local - strong multilingual) | |
| **European Provider**: Use **Mixtral** | |
| **Testing**: Use **Deepseek** first, always! | |
| ## Running Multiple Models | |
| You can run all 6 models in sequence: | |
| ```python | |
| # 1. Run Cell 10 (Deepseek) - verify works (~$1-2 for 10k) | |
| # 2. Run Cell 12 (Qwen API) - Chinese perspective (~variable cost) | |
| # 3. Run Cell 14 (Llama) - American perspective (~$5-10 for 10k) | |
| # 4. Run Cell 16 (Mixtral) - European perspective (~$10-20 for 10k) | |
| # 5. Run Cell 18 (Gemma) - Open source perspective (~$4-8 for 10k) | |
| # 6. Run Cell 20 (Qwen-2.5-32B Local) - FREE local inference ($0!) | |
| ``` | |
| Each saves to its own file, so you can compare results! | |
| ## Notes | |
| - **Llama and Gemma use the same API key** (Together AI) | |
| - All models use the **same 9 profession categories** | |
| - All models have **automatic retries** with exponential backoff | |
| - All models **save progress** every 10 rows | |
| - All models are **resumable** if interrupted | |
| ## Summary | |
| You now have **6 LLM options** to choose from: | |
| 1. π§ͺ **Deepseek** - Test first (cheapest API) | |
| 2. π¨π³ **Qwen3-Max API** - Chinese, strong multilingual | |
| 3. πΊπΈ **Llama 3.1 70B** - American, open-source | |
| 4. π«π· **Mixtral 8x22B** - French, open-source MoE | |
| 5. πΊπΈ **Gemma 2 27B** - American open-source (Google) | |
| 6. π° **Qwen-2.5-32B Local** - FREE local inference (NEW!) | |
| Each in its own cell, easy to run and compare! π | |
| **Recommended workflow**: | |
| 1. Test with Deepseek (Cell 10) - verify pipeline works | |
| 2. For small datasets (<1000): Use API (Deepseek/Gemma/Llama) | |
| 3. For large datasets (>1000): Use Qwen-2.5-32B Local (Cell 20) - FREE! | |