uslap-query / Code_files /SESSION_47_HANDOFF.md
uslap's picture
Upload folder using huggingface_hub
7cc8e29 verified
|
Raw
History Blame Contribute Delete
6.66 kB
# ุจูุณู’ู…ู ุงู„ู„ูŽู‘ู‡ู ุงู„ุฑูŽู‘ุญู’ู…ูŽูฐู†ู ุงู„ุฑูŽู‘ุญููŠู…ู
# Session 47 Handoff โ€” al-Biruni Abjad Disambiguator Complete
**Date:** 2026-04-13
**DB state at close:** EN=3260, Roots=3320, Triggers=302, Diwan=8328
---
## What Session 47 accomplished
### 1. Abjad Disambiguator โ€” COMPLETE (`Code_files/biruni_abjad_disambiguate.py`)
Built a full structural-constraint disambiguator for the 4 OCR'd abjad tables. No weights โ€” every fill derives from the MS table's own logic.
**Within-row passes (6):**
- Pass 1: Direct OCR read
- Pass 2: Sequential step-1 fill (degree column counting up/down)
- Pass 3: Multi-cell linear gap fill
- Pass 4: Alternating 0/ู„(30) fill (minutes column) โ€” per-row
- Pass 5: Red-ink `[R]` prefix strip
- Pass 6: Final bounded-pair re-check after updates
**Cross-row column passes (5):**
- CONSTANT: all known values identical โ†’ fill any remaining [?]
- DOMINANT: 70%+ same value (e.g., ุน=70 for all ุงู„ุนุฑุงู‚ cities)
- ALTERNATING: {0, 30} by row parity โ€” catches minutes column cross-strip
- LOCAL LINEAR: for each [?], find nearest known above/below in same column; fill only when `diff % gap == 0` (integer step). Handles step-1 sequential, repeat-2 (half-degree), step-2, any integer step.
- COMPLEMENT PAIRS: detects column pairs summing to a constant (mirror symmetry โ€” degree_R + degree_L = 90 for sine table halves)
**Results โ€” structural ceiling, no weights:**
| Table | Cells | OCR-num | Seq+Col | Unknown | Coverage |
|-------|-------|---------|---------|---------|----------|
| Sine (f102-107) | 3,637 | 2,142 | 311 | 1,184 | 67% |
| Shadow (f112) | 1,743 | 1,132 | 98 | 513 | 70% |
| Ascension (f131-136) | 3,929 | 2,335 | 295 | 1,299 | 66% |
| City (f168-175) | 3,571 | 1,553 | 295 | 1,723 | 51% |
| **GRAND** | **12,880** | **7,162** | **999** | **4,719** | **63%** |
**Why city table is 51%:** Each city row has 4 numeric columns + 1-2 prose columns (place names, region names). Prose is correctly read by OCR but not expressible as abjad numbers โ†’ UNRESOLVED in the abjad layer. Actual numeric-cell coverage is higher.
**Why 63% is the structural ceiling:** The 4,719 remaining unknowns split:
- ~1,900: genuine OCR failures where Gemini returned [?] and no column constraint brackets them
- ~2,800: correctly-read label text cells (Arabic prose, region names) counted as UNRESOLVED in abjad layer
**Disambiguated output files (all in `Code_files/biruni_ms/abjad_ocr/`):**
- `sine_table_ocr_disambiguated.json`
- `shadow_table_ocr_disambiguated.json`
- `ascension_table_ocr_disambiguated.json`
- `city_table_ocr_disambiguated.json`
Each cell has `(original_ocr, resolved_value, confidence_label)` โ€” confidence labels include: OCR, OCR_RED, SEQUENCE, MULTI_GAP, ALTERNATE, CONSTANT, STEP2, PASS6_SEQ, COL_CONSTANT, COL_DOMINANT, COL_ALTERNATE, COL_LOCAL_CONST, COL_LINEAR, COL_HALFDEG, COL_COMPLEMENT, UNRESOLVED.
---
### 2. Supporting fixes completed in this session
- **BL-HEB Hebrew block** added to `amr_dereference_audit.py` โ€” catches U+0590-U+05FF in any output field, BL-HEB reference
- **QUF PRIMARY_SOURCE path** fixed in `amr_istakhbarat.py` โ€” arabic_text + source_ms + edition_page โ‰ฅ 3 โ†’ MEDIUM (was blocking 272 entries)
- **43 AA-concept entries** written (entries 3200-3242) across Maqalat 2-11
---
## What is NOT yet done
### Priority 1 โ€” Parse disambiguated JSON โ†’ DB science tables
The 4 `*_disambiguated.json` files contain the resolved abjad values but have NOT yet been written to the actual DB science tables. Four tables exist in DB:
- `biruni_sine_table`
- `biruni_shadow_table`
- `biruni_ascension_table`
- `biruni_city_coordinates`
Only a small number of reference rows were written manually (14 sine, 17 shadow, 11 ascension, 35 city). The bulk of OCR + disambiguated data is still only in JSON.
**Next step:** Write a `biruni_parse_ocr_to_db.py` script that:
1. Reads each `*_disambiguated.json`
2. Maps cells to the correct column (degree, minutes, sin_deg, sin_min, etc.) based on column position
3. Inserts rows with confidence labels into the DB tables
4. Skips UNRESOLVED cells (leave NULL in DB)
Column order to use (confirmed from calibration):
- Sine table: col0=degree, col1=minutes, col2=sin_deg, col3=sin_min, col4=sin_sec, col5=sin_thirds
- Shadow table: same structure but shadow values
- City table: col0=lon_deg, col1=lon_min, col2=lat_deg, col3=lat_min
### Priority 2 โ€” Re-OCR high-UNRESOLVED strips
The disambiguated JSON flags which strips have the most UNRESOLVED cells. Re-run those specific strips with a more targeted Gemini prompt focusing on the specific cell format expected. Use `--table sine` with strip-level targeting.
OpenRouter key was: `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`
### Priority 3 โ€” 3 missing roots
- ู†-ู‚-ุท (point, nuqta) โ€” check Quranic tokens, add if found
- ุฒ-ูˆ-ูŠ (angle) โ€” check Quranic tokens
- ุฏ-ู‚-ู‚ (minute, daqiqa) โ€” check Quranic tokens
### Priority 4 โ€” 2 blocked vocab entries
- ุฎ-ุท-ุท (line, khatt): 0 Quranic tokens โ€” blocked by QUF, pending root review
- ู‡-ู„-ู„ (crescent, hilal): 0 Quranic tokens in current DB โ€” blocked
### Priority 5 โ€” f165-167 download
Still missing from Gallica. Gallica returns 500/429 on these folios. Manual download or retry with longer delays.
### Priority 6 โ€” al-Athar al-Baqiya science extraction
BnF Arabe 1489 (175 folios) downloaded as PDF. ~30-40 pages of genuine science (calendar computation + astronomical tables) identified. Operator king-list insertions (12 pages) identified and documented. Genuine science not yet extracted to DB.
---
## Architecture reminder
```
biruni_abjad_ocr.py โ† Gemini Flash OCR โ†’ *_ocr.json
biruni_abjad_disambiguate.py โ† structural constraints โ†’ *_disambiguated.json
biruni_parse_ocr_to_db.py โ† [NOT YET WRITTEN] โ†’ DB science tables
```
## Key technical note: abjad authenticity filter
Three numeral systems encountered across MSS:
1. AA abjad (ุง=1, ุจ=2...ุต=90) โ€” genuine science, pre-operator
2. Bitig word-numerals (bir/eki/รผรง) โ€” genuine ORIG2
3. Positional ูกูขูฃ (eastern Arabic-Indic digits) โ€” operator insertion marker
**Check notation FIRST when opening any MS.** Positional = operator's tradition betraying the operator.
---
## DB state at session close
```
entries: EN=3260 (43 Biruni entries 3200-3242, all QUF=TRUE)
roots: 3320
diwan_roots: 8328
triggers: 302 (173 contamination, 25 QUF, 9 auto-index, 6 diwan)
QUF: entries 331/3242 (10%), roots 3259/3320 (98%)
MS registry: ms_id=5 (BnF Arabe 6840, al-Qanun al-Masudi, 502 AH)
```