# Quick Testing Instructions ## Start Here! 🚀 You mentioned you have Deepseek credits, so **start by testing with Deepseek first** before trying the other LLMs. ## Step-by-Step Testing ### 1. Make sure your Deepseek API key is in place Check if this file exists: ```bash cat misc/credentials/deepseek_api_key.txt ``` If not, create it: ```bash echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt ``` ### 2. Open the notebook ```bash jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb ``` ### 3. Run the cells in order 1. **Cell 0-4**: Introduction and setup (just markdown, no execution needed) 2. **Cell 5**: NER & Name Cleaning (processes `real_person_adapters.csv`) 3. **Cell 7**: Country/Nationality Mapping 4. **Cell 10**: 🌟 **DEEPSEEK ANNOTATION** (TEST THIS FIRST!) - Default: `TEST_MODE = True` (10 samples) - Will create: `data/CSV/deepseek_annotated_POI_test.csv` 5. **Cell 12**: Qwen/Llama/Mistral (run later after Deepseek works) ### 4. Review Deepseek Results After Cell 10 completes, check: - Console output shows summary statistics - Output file: `data/CSV/deepseek_annotated_POI_test.csv` Example output should look like: ``` ✅ Progress saved after 10 rows ✅ Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv === Summary Statistics === Total processed: 10 Gender distribution: Female 8 Male 2 ... ``` ### 5. If Deepseek Works Well Once you're satisfied with the Deepseek results: **Option A: Process full dataset with Deepseek** ```python # In Cell 10, change: TEST_MODE = False ``` **Option B: Try other LLMs for comparison** 1. Set up API keys for Qwen/Llama/Mistral (see `misc/credentials/README.md`) 2. Run Cell 12 with your chosen LLM: ```python SELECTED_LLM = 'qwen' # or 'llama' or 'mistral' TEST_MODE = True # Test first! ``` ## Expected Cost (Deepseek) - **10 samples** (test): ~$0.01 or less - **1,000 entries**: ~$0.10-0.20 - **10,000 entries**: ~$1-2 Much cheaper than the other options, making it perfect for testing! ## Troubleshooting ### "deepseek_api_key.txt not found" ```bash # Create the file with your key echo "your-api-key" > misc/credentials/deepseek_api_key.txt ``` ### "File does not exist: real_person_adapters.csv" Make sure the input file exists: ```bash ls -lh data/CSV/real_person_adapters.csv ``` ### API Rate Limiting The code includes automatic rate limiting (`time.sleep(1)` between requests). If you still get rate limited: - Increase the sleep time in Cell 10: change `time.sleep(1)` to `time.sleep(2)` ### Pipeline Interrupted No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off. ## What's Next? After testing with Deepseek: 1. **If results look good**: Scale up to full dataset with Deepseek 2. **Compare LLMs**: Test Qwen/Llama/Mistral on the same sample to see which gives best results 3. **Production run**: Choose your preferred LLM and process the full dataset ## File Outputs The pipeline creates these files: ``` data/CSV/ ├── NER_POI_step01_pre_annotation.csv # After Cell 5 (name cleaning) ├── NER_POI_step02_annotated.csv # After Cell 7 (country mapping) ├── deepseek_annotated_POI_test.csv # After Cell 10 (test mode) ├── deepseek_annotated_POI.csv # After Cell 10 (full mode) ├── qwen_annotated_POI_test.csv # After Cell 12 (if using Qwen) └── ... misc/ ├── deepseek_query_index.txt # Progress tracking └── ... ``` ## Quick Commands ```bash # View first few results head -20 data/CSV/deepseek_annotated_POI_test.csv # Count processed rows wc -l data/CSV/deepseek_annotated_POI_test.csv # Check progress cat misc/deepseek_query_index.txt # Reset progress (start from scratch) rm misc/deepseek_query_index.txt ``` --- **Ready to start?** Open the notebook and run Cell 5 → Cell 7 → Cell 10! 🎉