Quick Testing Instructions
Start Here! π
You mentioned you have Deepseek credits, so start by testing with Deepseek first before trying the other LLMs.
Step-by-Step Testing
1. Make sure your Deepseek API key is in place
Check if this file exists:
cat misc/credentials/deepseek_api_key.txt
If not, create it:
echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt
2. Open the notebook
jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb
3. Run the cells in order
- Cell 0-4: Introduction and setup (just markdown, no execution needed)
- Cell 5: NER & Name Cleaning (processes
real_person_adapters.csv) - Cell 7: Country/Nationality Mapping
- Cell 10: π DEEPSEEK ANNOTATION (TEST THIS FIRST!)
- Default:
TEST_MODE = True(10 samples) - Will create:
data/CSV/deepseek_annotated_POI_test.csv
- Default:
- Cell 12: Qwen/Llama/Mistral (run later after Deepseek works)
4. Review Deepseek Results
After Cell 10 completes, check:
- Console output shows summary statistics
- Output file:
data/CSV/deepseek_annotated_POI_test.csv
Example output should look like:
β
Progress saved after 10 rows
β
Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv
=== Summary Statistics ===
Total processed: 10
Gender distribution:
Female 8
Male 2
...
5. If Deepseek Works Well
Once you're satisfied with the Deepseek results:
Option A: Process full dataset with Deepseek
# In Cell 10, change:
TEST_MODE = False
Option B: Try other LLMs for comparison
- Set up API keys for Qwen/Llama/Mistral (see
misc/credentials/README.md) - Run Cell 12 with your chosen LLM:
SELECTED_LLM = 'qwen' # or 'llama' or 'mistral' TEST_MODE = True # Test first!
Expected Cost (Deepseek)
- 10 samples (test): ~$0.01 or less
- 1,000 entries: ~$0.10-0.20
- 10,000 entries: ~$1-2
Much cheaper than the other options, making it perfect for testing!
Troubleshooting
"deepseek_api_key.txt not found"
# Create the file with your key
echo "your-api-key" > misc/credentials/deepseek_api_key.txt
"File does not exist: real_person_adapters.csv"
Make sure the input file exists:
ls -lh data/CSV/real_person_adapters.csv
API Rate Limiting
The code includes automatic rate limiting (time.sleep(1) between requests). If you still get rate limited:
- Increase the sleep time in Cell 10: change
time.sleep(1)totime.sleep(2)
Pipeline Interrupted
No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off.
What's Next?
After testing with Deepseek:
- If results look good: Scale up to full dataset with Deepseek
- Compare LLMs: Test Qwen/Llama/Mistral on the same sample to see which gives best results
- Production run: Choose your preferred LLM and process the full dataset
File Outputs
The pipeline creates these files:
data/CSV/
βββ NER_POI_step01_pre_annotation.csv # After Cell 5 (name cleaning)
βββ NER_POI_step02_annotated.csv # After Cell 7 (country mapping)
βββ deepseek_annotated_POI_test.csv # After Cell 10 (test mode)
βββ deepseek_annotated_POI.csv # After Cell 10 (full mode)
βββ qwen_annotated_POI_test.csv # After Cell 12 (if using Qwen)
βββ ...
misc/
βββ deepseek_query_index.txt # Progress tracking
βββ ...
Quick Commands
# View first few results
head -20 data/CSV/deepseek_annotated_POI_test.csv
# Count processed rows
wc -l data/CSV/deepseek_annotated_POI_test.csv
# Check progress
cat misc/deepseek_query_index.txt
# Reset progress (start from scratch)
rm misc/deepseek_query_index.txt
Ready to start? Open the notebook and run Cell 5 β Cell 7 β Cell 10! π