| # Quick Testing Instructions |
|
|
| ## Start Here! π |
|
|
| You mentioned you have Deepseek credits, so **start by testing with Deepseek first** before trying the other LLMs. |
|
|
| ## Step-by-Step Testing |
|
|
| ### 1. Make sure your Deepseek API key is in place |
|
|
| Check if this file exists: |
| ```bash |
| cat misc/credentials/deepseek_api_key.txt |
| ``` |
|
|
| If not, create it: |
| ```bash |
| echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt |
| ``` |
|
|
| ### 2. Open the notebook |
|
|
| ```bash |
| jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb |
| ``` |
|
|
| ### 3. Run the cells in order |
|
|
| 1. **Cell 0-4**: Introduction and setup (just markdown, no execution needed) |
| 2. **Cell 5**: NER & Name Cleaning (processes `real_person_adapters.csv`) |
| 3. **Cell 7**: Country/Nationality Mapping |
| 4. **Cell 10**: π **DEEPSEEK ANNOTATION** (TEST THIS FIRST!) |
| - Default: `TEST_MODE = True` (10 samples) |
| - Will create: `data/CSV/deepseek_annotated_POI_test.csv` |
| 5. **Cell 12**: Qwen/Llama/Mistral (run later after Deepseek works) |
|
|
| ### 4. Review Deepseek Results |
|
|
| After Cell 10 completes, check: |
| - Console output shows summary statistics |
| - Output file: `data/CSV/deepseek_annotated_POI_test.csv` |
|
|
| Example output should look like: |
| ``` |
| β
Progress saved after 10 rows |
| β
Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv |
| |
| === Summary Statistics === |
| Total processed: 10 |
| |
| Gender distribution: |
| Female 8 |
| Male 2 |
| ... |
| ``` |
|
|
| ### 5. If Deepseek Works Well |
|
|
| Once you're satisfied with the Deepseek results: |
|
|
| **Option A: Process full dataset with Deepseek** |
| ```python |
| # In Cell 10, change: |
| TEST_MODE = False |
| ``` |
|
|
| **Option B: Try other LLMs for comparison** |
| 1. Set up API keys for Qwen/Llama/Mistral (see `misc/credentials/README.md`) |
| 2. Run Cell 12 with your chosen LLM: |
| ```python |
| SELECTED_LLM = 'qwen' # or 'llama' or 'mistral' |
| TEST_MODE = True # Test first! |
| ``` |
|
|
| ## Expected Cost (Deepseek) |
|
|
| - **10 samples** (test): ~$0.01 or less |
| - **1,000 entries**: ~$0.10-0.20 |
| - **10,000 entries**: ~$1-2 |
|
|
| Much cheaper than the other options, making it perfect for testing! |
|
|
| ## Troubleshooting |
|
|
| ### "deepseek_api_key.txt not found" |
| ```bash |
| # Create the file with your key |
| echo "your-api-key" > misc/credentials/deepseek_api_key.txt |
| ``` |
|
|
| ### "File does not exist: real_person_adapters.csv" |
| Make sure the input file exists: |
| ```bash |
| ls -lh data/CSV/real_person_adapters.csv |
| ``` |
|
|
| ### API Rate Limiting |
| The code includes automatic rate limiting (`time.sleep(1)` between requests). If you still get rate limited: |
| - Increase the sleep time in Cell 10: change `time.sleep(1)` to `time.sleep(2)` |
|
|
| ### Pipeline Interrupted |
| No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off. |
|
|
| ## What's Next? |
|
|
| After testing with Deepseek: |
|
|
| 1. **If results look good**: Scale up to full dataset with Deepseek |
| 2. **Compare LLMs**: Test Qwen/Llama/Mistral on the same sample to see which gives best results |
| 3. **Production run**: Choose your preferred LLM and process the full dataset |
|
|
| ## File Outputs |
|
|
| The pipeline creates these files: |
|
|
| ``` |
| data/CSV/ |
| βββ NER_POI_step01_pre_annotation.csv # After Cell 5 (name cleaning) |
| βββ NER_POI_step02_annotated.csv # After Cell 7 (country mapping) |
| βββ deepseek_annotated_POI_test.csv # After Cell 10 (test mode) |
| βββ deepseek_annotated_POI.csv # After Cell 10 (full mode) |
| βββ qwen_annotated_POI_test.csv # After Cell 12 (if using Qwen) |
| βββ ... |
| |
| misc/ |
| βββ deepseek_query_index.txt # Progress tracking |
| βββ ... |
| ``` |
|
|
| ## Quick Commands |
|
|
| ```bash |
| # View first few results |
| head -20 data/CSV/deepseek_annotated_POI_test.csv |
| |
| # Count processed rows |
| wc -l data/CSV/deepseek_annotated_POI_test.csv |
| |
| # Check progress |
| cat misc/deepseek_query_index.txt |
| |
| # Reset progress (start from scratch) |
| rm misc/deepseek_query_index.txt |
| ``` |
|
|
| --- |
|
|
| **Ready to start?** Open the notebook and run Cell 5 β Cell 7 β Cell 10! π |
|
|