Spaces:
Sleeping
Sleeping
| # ุจูุณูู ู ุงูููููู ุงูุฑููุญูู ููฐูู ุงูุฑููุญููู ู | |
| # Session 47 Handoff โ al-Biruni Abjad Disambiguator Complete | |
| **Date:** 2026-04-13 | |
| **DB state at close:** EN=3260, Roots=3320, Triggers=302, Diwan=8328 | |
| --- | |
| ## What Session 47 accomplished | |
| ### 1. Abjad Disambiguator โ COMPLETE (`Code_files/biruni_abjad_disambiguate.py`) | |
| Built a full structural-constraint disambiguator for the 4 OCR'd abjad tables. No weights โ every fill derives from the MS table's own logic. | |
| **Within-row passes (6):** | |
| - Pass 1: Direct OCR read | |
| - Pass 2: Sequential step-1 fill (degree column counting up/down) | |
| - Pass 3: Multi-cell linear gap fill | |
| - Pass 4: Alternating 0/ู(30) fill (minutes column) โ per-row | |
| - Pass 5: Red-ink `[R]` prefix strip | |
| - Pass 6: Final bounded-pair re-check after updates | |
| **Cross-row column passes (5):** | |
| - CONSTANT: all known values identical โ fill any remaining [?] | |
| - DOMINANT: 70%+ same value (e.g., ุน=70 for all ุงูุนุฑุงู cities) | |
| - ALTERNATING: {0, 30} by row parity โ catches minutes column cross-strip | |
| - LOCAL LINEAR: for each [?], find nearest known above/below in same column; fill only when `diff % gap == 0` (integer step). Handles step-1 sequential, repeat-2 (half-degree), step-2, any integer step. | |
| - COMPLEMENT PAIRS: detects column pairs summing to a constant (mirror symmetry โ degree_R + degree_L = 90 for sine table halves) | |
| **Results โ structural ceiling, no weights:** | |
| | Table | Cells | OCR-num | Seq+Col | Unknown | Coverage | | |
| |-------|-------|---------|---------|---------|----------| | |
| | Sine (f102-107) | 3,637 | 2,142 | 311 | 1,184 | 67% | | |
| | Shadow (f112) | 1,743 | 1,132 | 98 | 513 | 70% | | |
| | Ascension (f131-136) | 3,929 | 2,335 | 295 | 1,299 | 66% | | |
| | City (f168-175) | 3,571 | 1,553 | 295 | 1,723 | 51% | | |
| | **GRAND** | **12,880** | **7,162** | **999** | **4,719** | **63%** | | |
| **Why city table is 51%:** Each city row has 4 numeric columns + 1-2 prose columns (place names, region names). Prose is correctly read by OCR but not expressible as abjad numbers โ UNRESOLVED in the abjad layer. Actual numeric-cell coverage is higher. | |
| **Why 63% is the structural ceiling:** The 4,719 remaining unknowns split: | |
| - ~1,900: genuine OCR failures where Gemini returned [?] and no column constraint brackets them | |
| - ~2,800: correctly-read label text cells (Arabic prose, region names) counted as UNRESOLVED in abjad layer | |
| **Disambiguated output files (all in `Code_files/biruni_ms/abjad_ocr/`):** | |
| - `sine_table_ocr_disambiguated.json` | |
| - `shadow_table_ocr_disambiguated.json` | |
| - `ascension_table_ocr_disambiguated.json` | |
| - `city_table_ocr_disambiguated.json` | |
| Each cell has `(original_ocr, resolved_value, confidence_label)` โ confidence labels include: OCR, OCR_RED, SEQUENCE, MULTI_GAP, ALTERNATE, CONSTANT, STEP2, PASS6_SEQ, COL_CONSTANT, COL_DOMINANT, COL_ALTERNATE, COL_LOCAL_CONST, COL_LINEAR, COL_HALFDEG, COL_COMPLEMENT, UNRESOLVED. | |
| --- | |
| ### 2. Supporting fixes completed in this session | |
| - **BL-HEB Hebrew block** added to `amr_dereference_audit.py` โ catches U+0590-U+05FF in any output field, BL-HEB reference | |
| - **QUF PRIMARY_SOURCE path** fixed in `amr_istakhbarat.py` โ arabic_text + source_ms + edition_page โฅ 3 โ MEDIUM (was blocking 272 entries) | |
| - **43 AA-concept entries** written (entries 3200-3242) across Maqalat 2-11 | |
| --- | |
| ## What is NOT yet done | |
| ### Priority 1 โ Parse disambiguated JSON โ DB science tables | |
| The 4 `*_disambiguated.json` files contain the resolved abjad values but have NOT yet been written to the actual DB science tables. Four tables exist in DB: | |
| - `biruni_sine_table` | |
| - `biruni_shadow_table` | |
| - `biruni_ascension_table` | |
| - `biruni_city_coordinates` | |
| Only a small number of reference rows were written manually (14 sine, 17 shadow, 11 ascension, 35 city). The bulk of OCR + disambiguated data is still only in JSON. | |
| **Next step:** Write a `biruni_parse_ocr_to_db.py` script that: | |
| 1. Reads each `*_disambiguated.json` | |
| 2. Maps cells to the correct column (degree, minutes, sin_deg, sin_min, etc.) based on column position | |
| 3. Inserts rows with confidence labels into the DB tables | |
| 4. Skips UNRESOLVED cells (leave NULL in DB) | |
| Column order to use (confirmed from calibration): | |
| - Sine table: col0=degree, col1=minutes, col2=sin_deg, col3=sin_min, col4=sin_sec, col5=sin_thirds | |
| - Shadow table: same structure but shadow values | |
| - City table: col0=lon_deg, col1=lon_min, col2=lat_deg, col3=lat_min | |
| ### Priority 2 โ Re-OCR high-UNRESOLVED strips | |
| The disambiguated JSON flags which strips have the most UNRESOLVED cells. Re-run those specific strips with a more targeted Gemini prompt focusing on the specific cell format expected. Use `--table sine` with strip-level targeting. | |
| OpenRouter key was: `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` | |
| ### Priority 3 โ 3 missing roots | |
| - ู-ู-ุท (point, nuqta) โ check Quranic tokens, add if found | |
| - ุฒ-ู-ู (angle) โ check Quranic tokens | |
| - ุฏ-ู-ู (minute, daqiqa) โ check Quranic tokens | |
| ### Priority 4 โ 2 blocked vocab entries | |
| - ุฎ-ุท-ุท (line, khatt): 0 Quranic tokens โ blocked by QUF, pending root review | |
| - ู-ู-ู (crescent, hilal): 0 Quranic tokens in current DB โ blocked | |
| ### Priority 5 โ f165-167 download | |
| Still missing from Gallica. Gallica returns 500/429 on these folios. Manual download or retry with longer delays. | |
| ### Priority 6 โ al-Athar al-Baqiya science extraction | |
| BnF Arabe 1489 (175 folios) downloaded as PDF. ~30-40 pages of genuine science (calendar computation + astronomical tables) identified. Operator king-list insertions (12 pages) identified and documented. Genuine science not yet extracted to DB. | |
| --- | |
| ## Architecture reminder | |
| ``` | |
| biruni_abjad_ocr.py โ Gemini Flash OCR โ *_ocr.json | |
| biruni_abjad_disambiguate.py โ structural constraints โ *_disambiguated.json | |
| biruni_parse_ocr_to_db.py โ [NOT YET WRITTEN] โ DB science tables | |
| ``` | |
| ## Key technical note: abjad authenticity filter | |
| Three numeral systems encountered across MSS: | |
| 1. AA abjad (ุง=1, ุจ=2...ุต=90) โ genuine science, pre-operator | |
| 2. Bitig word-numerals (bir/eki/รผรง) โ genuine ORIG2 | |
| 3. Positional ูกูขูฃ (eastern Arabic-Indic digits) โ operator insertion marker | |
| **Check notation FIRST when opening any MS.** Positional = operator's tradition betraying the operator. | |
| --- | |
| ## DB state at session close | |
| ``` | |
| entries: EN=3260 (43 Biruni entries 3200-3242, all QUF=TRUE) | |
| roots: 3320 | |
| diwan_roots: 8328 | |
| triggers: 302 (173 contamination, 25 QUF, 9 auto-index, 6 diwan) | |
| QUF: entries 331/3242 (10%), roots 3259/3320 (98%) | |
| MS registry: ms_id=5 (BnF Arabe 6840, al-Qanun al-Masudi, 502 AH) | |
| ``` | |