File size: 3,992 Bytes
5f5806d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# Quick Testing Instructions
## Start Here! π
You mentioned you have Deepseek credits, so **start by testing with Deepseek first** before trying the other LLMs.
## Step-by-Step Testing
### 1. Make sure your Deepseek API key is in place
Check if this file exists:
```bash
cat misc/credentials/deepseek_api_key.txt
```
If not, create it:
```bash
echo "your-deepseek-api-key" > misc/credentials/deepseek_api_key.txt
```
### 2. Open the notebook
```bash
jupyter notebook jupyter_notebooks/Section_2-3-4_Figure_8_deepfake_adapters.ipynb
```
### 3. Run the cells in order
1. **Cell 0-4**: Introduction and setup (just markdown, no execution needed)
2. **Cell 5**: NER & Name Cleaning (processes `real_person_adapters.csv`)
3. **Cell 7**: Country/Nationality Mapping
4. **Cell 10**: π **DEEPSEEK ANNOTATION** (TEST THIS FIRST!)
- Default: `TEST_MODE = True` (10 samples)
- Will create: `data/CSV/deepseek_annotated_POI_test.csv`
5. **Cell 12**: Qwen/Llama/Mistral (run later after Deepseek works)
### 4. Review Deepseek Results
After Cell 10 completes, check:
- Console output shows summary statistics
- Output file: `data/CSV/deepseek_annotated_POI_test.csv`
Example output should look like:
```
β
Progress saved after 10 rows
β
Done! Final results saved to data/CSV/deepseek_annotated_POI_test.csv
=== Summary Statistics ===
Total processed: 10
Gender distribution:
Female 8
Male 2
...
```
### 5. If Deepseek Works Well
Once you're satisfied with the Deepseek results:
**Option A: Process full dataset with Deepseek**
```python
# In Cell 10, change:
TEST_MODE = False
```
**Option B: Try other LLMs for comparison**
1. Set up API keys for Qwen/Llama/Mistral (see `misc/credentials/README.md`)
2. Run Cell 12 with your chosen LLM:
```python
SELECTED_LLM = 'qwen' # or 'llama' or 'mistral'
TEST_MODE = True # Test first!
```
## Expected Cost (Deepseek)
- **10 samples** (test): ~$0.01 or less
- **1,000 entries**: ~$0.10-0.20
- **10,000 entries**: ~$1-2
Much cheaper than the other options, making it perfect for testing!
## Troubleshooting
### "deepseek_api_key.txt not found"
```bash
# Create the file with your key
echo "your-api-key" > misc/credentials/deepseek_api_key.txt
```
### "File does not exist: real_person_adapters.csv"
Make sure the input file exists:
```bash
ls -lh data/CSV/real_person_adapters.csv
```
### API Rate Limiting
The code includes automatic rate limiting (`time.sleep(1)` between requests). If you still get rate limited:
- Increase the sleep time in Cell 10: change `time.sleep(1)` to `time.sleep(2)`
### Pipeline Interrupted
No problem! The code saves progress every 10 rows. Just re-run the cell and it will resume from where it left off.
## What's Next?
After testing with Deepseek:
1. **If results look good**: Scale up to full dataset with Deepseek
2. **Compare LLMs**: Test Qwen/Llama/Mistral on the same sample to see which gives best results
3. **Production run**: Choose your preferred LLM and process the full dataset
## File Outputs
The pipeline creates these files:
```
data/CSV/
βββ NER_POI_step01_pre_annotation.csv # After Cell 5 (name cleaning)
βββ NER_POI_step02_annotated.csv # After Cell 7 (country mapping)
βββ deepseek_annotated_POI_test.csv # After Cell 10 (test mode)
βββ deepseek_annotated_POI.csv # After Cell 10 (full mode)
βββ qwen_annotated_POI_test.csv # After Cell 12 (if using Qwen)
βββ ...
misc/
βββ deepseek_query_index.txt # Progress tracking
βββ ...
```
## Quick Commands
```bash
# View first few results
head -20 data/CSV/deepseek_annotated_POI_test.csv
# Count processed rows
wc -l data/CSV/deepseek_annotated_POI_test.csv
# Check progress
cat misc/deepseek_query_index.txt
# Reset progress (start from scratch)
rm misc/deepseek_query_index.txt
```
---
**Ready to start?** Open the notebook and run Cell 5 β Cell 7 β Cell 10! π
|