# بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ # Session 47 Handoff — al-Biruni Abjad Disambiguator Complete **Date:** 2026-04-13 **DB state at close:** EN=3260, Roots=3320, Triggers=302, Diwan=8328 --- ## What Session 47 accomplished ### 1. Abjad Disambiguator — COMPLETE (`Code_files/biruni_abjad_disambiguate.py`) Built a full structural-constraint disambiguator for the 4 OCR'd abjad tables. No weights — every fill derives from the MS table's own logic. **Within-row passes (6):** - Pass 1: Direct OCR read - Pass 2: Sequential step-1 fill (degree column counting up/down) - Pass 3: Multi-cell linear gap fill - Pass 4: Alternating 0/ل(30) fill (minutes column) — per-row - Pass 5: Red-ink `[R]` prefix strip - Pass 6: Final bounded-pair re-check after updates **Cross-row column passes (5):** - CONSTANT: all known values identical → fill any remaining [?] - DOMINANT: 70%+ same value (e.g., ع=70 for all العراق cities) - ALTERNATING: {0, 30} by row parity — catches minutes column cross-strip - LOCAL LINEAR: for each [?], find nearest known above/below in same column; fill only when `diff % gap == 0` (integer step). Handles step-1 sequential, repeat-2 (half-degree), step-2, any integer step. - COMPLEMENT PAIRS: detects column pairs summing to a constant (mirror symmetry — degree_R + degree_L = 90 for sine table halves) **Results — structural ceiling, no weights:** | Table | Cells | OCR-num | Seq+Col | Unknown | Coverage | |-------|-------|---------|---------|---------|----------| | Sine (f102-107) | 3,637 | 2,142 | 311 | 1,184 | 67% | | Shadow (f112) | 1,743 | 1,132 | 98 | 513 | 70% | | Ascension (f131-136) | 3,929 | 2,335 | 295 | 1,299 | 66% | | City (f168-175) | 3,571 | 1,553 | 295 | 1,723 | 51% | | **GRAND** | **12,880** | **7,162** | **999** | **4,719** | **63%** | **Why city table is 51%:** Each city row has 4 numeric columns + 1-2 prose columns (place names, region names). Prose is correctly read by OCR but not expressible as abjad numbers → UNRESOLVED in the abjad layer. Actual numeric-cell coverage is higher. **Why 63% is the structural ceiling:** The 4,719 remaining unknowns split: - ~1,900: genuine OCR failures where Gemini returned [?] and no column constraint brackets them - ~2,800: correctly-read label text cells (Arabic prose, region names) counted as UNRESOLVED in abjad layer **Disambiguated output files (all in `Code_files/biruni_ms/abjad_ocr/`):** - `sine_table_ocr_disambiguated.json` - `shadow_table_ocr_disambiguated.json` - `ascension_table_ocr_disambiguated.json` - `city_table_ocr_disambiguated.json` Each cell has `(original_ocr, resolved_value, confidence_label)` — confidence labels include: OCR, OCR_RED, SEQUENCE, MULTI_GAP, ALTERNATE, CONSTANT, STEP2, PASS6_SEQ, COL_CONSTANT, COL_DOMINANT, COL_ALTERNATE, COL_LOCAL_CONST, COL_LINEAR, COL_HALFDEG, COL_COMPLEMENT, UNRESOLVED. --- ### 2. Supporting fixes completed in this session - **BL-HEB Hebrew block** added to `amr_dereference_audit.py` — catches U+0590-U+05FF in any output field, BL-HEB reference - **QUF PRIMARY_SOURCE path** fixed in `amr_istakhbarat.py` — arabic_text + source_ms + edition_page ≥ 3 → MEDIUM (was blocking 272 entries) - **43 AA-concept entries** written (entries 3200-3242) across Maqalat 2-11 --- ## What is NOT yet done ### Priority 1 — Parse disambiguated JSON → DB science tables The 4 `*_disambiguated.json` files contain the resolved abjad values but have NOT yet been written to the actual DB science tables. Four tables exist in DB: - `biruni_sine_table` - `biruni_shadow_table` - `biruni_ascension_table` - `biruni_city_coordinates` Only a small number of reference rows were written manually (14 sine, 17 shadow, 11 ascension, 35 city). The bulk of OCR + disambiguated data is still only in JSON. **Next step:** Write a `biruni_parse_ocr_to_db.py` script that: 1. Reads each `*_disambiguated.json` 2. Maps cells to the correct column (degree, minutes, sin_deg, sin_min, etc.) based on column position 3. Inserts rows with confidence labels into the DB tables 4. Skips UNRESOLVED cells (leave NULL in DB) Column order to use (confirmed from calibration): - Sine table: col0=degree, col1=minutes, col2=sin_deg, col3=sin_min, col4=sin_sec, col5=sin_thirds - Shadow table: same structure but shadow values - City table: col0=lon_deg, col1=lon_min, col2=lat_deg, col3=lat_min ### Priority 2 — Re-OCR high-UNRESOLVED strips The disambiguated JSON flags which strips have the most UNRESOLVED cells. Re-run those specific strips with a more targeted Gemini prompt focusing on the specific cell format expected. Use `--table sine` with strip-level targeting. OpenRouter key was: `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` ### Priority 3 — 3 missing roots - ن-ق-ط (point, nuqta) — check Quranic tokens, add if found - ز-و-ي (angle) — check Quranic tokens - د-ق-ق (minute, daqiqa) — check Quranic tokens ### Priority 4 — 2 blocked vocab entries - خ-ط-ط (line, khatt): 0 Quranic tokens — blocked by QUF, pending root review - ه-ل-ل (crescent, hilal): 0 Quranic tokens in current DB — blocked ### Priority 5 — f165-167 download Still missing from Gallica. Gallica returns 500/429 on these folios. Manual download or retry with longer delays. ### Priority 6 — al-Athar al-Baqiya science extraction BnF Arabe 1489 (175 folios) downloaded as PDF. ~30-40 pages of genuine science (calendar computation + astronomical tables) identified. Operator king-list insertions (12 pages) identified and documented. Genuine science not yet extracted to DB. --- ## Architecture reminder ``` biruni_abjad_ocr.py ← Gemini Flash OCR → *_ocr.json biruni_abjad_disambiguate.py ← structural constraints → *_disambiguated.json biruni_parse_ocr_to_db.py ← [NOT YET WRITTEN] → DB science tables ``` ## Key technical note: abjad authenticity filter Three numeral systems encountered across MSS: 1. AA abjad (ا=1, ب=2...ص=90) — genuine science, pre-operator 2. Bitig word-numerals (bir/eki/üç) — genuine ORIG2 3. Positional ١٢٣ (eastern Arabic-Indic digits) — operator insertion marker **Check notation FIRST when opening any MS.** Positional = operator's tradition betraying the operator. --- ## DB state at session close ``` entries: EN=3260 (43 Biruni entries 3200-3242, all QUF=TRUE) roots: 3320 diwan_roots: 8328 triggers: 302 (173 contamination, 25 QUF, 9 auto-index, 6 diwan) QUF: entries 331/3242 (10%), roots 3259/3320 (98%) MS registry: ms_id=5 (BnF Arabe 6840, al-Qanun al-Masudi, 502 AH) ```