Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.19.0
ุจูุณูู ู ุงูููููู ุงูุฑููุญูู ููฐูู ุงูุฑููุญููู ู
Session 47 Handoff โ al-Biruni Abjad Disambiguator Complete
Date: 2026-04-13
DB state at close: EN=3260, Roots=3320, Triggers=302, Diwan=8328
What Session 47 accomplished
1. Abjad Disambiguator โ COMPLETE (Code_files/biruni_abjad_disambiguate.py)
Built a full structural-constraint disambiguator for the 4 OCR'd abjad tables. No weights โ every fill derives from the MS table's own logic.
Within-row passes (6):
- Pass 1: Direct OCR read
- Pass 2: Sequential step-1 fill (degree column counting up/down)
- Pass 3: Multi-cell linear gap fill
- Pass 4: Alternating 0/ู(30) fill (minutes column) โ per-row
- Pass 5: Red-ink
[R]prefix strip - Pass 6: Final bounded-pair re-check after updates
Cross-row column passes (5):
- CONSTANT: all known values identical โ fill any remaining [?]
- DOMINANT: 70%+ same value (e.g., ุน=70 for all ุงูุนุฑุงู cities)
- ALTERNATING: {0, 30} by row parity โ catches minutes column cross-strip
- LOCAL LINEAR: for each [?], find nearest known above/below in same column; fill only when
diff % gap == 0(integer step). Handles step-1 sequential, repeat-2 (half-degree), step-2, any integer step. - COMPLEMENT PAIRS: detects column pairs summing to a constant (mirror symmetry โ degree_R + degree_L = 90 for sine table halves)
Results โ structural ceiling, no weights:
| Table | Cells | OCR-num | Seq+Col | Unknown | Coverage |
|---|---|---|---|---|---|
| Sine (f102-107) | 3,637 | 2,142 | 311 | 1,184 | 67% |
| Shadow (f112) | 1,743 | 1,132 | 98 | 513 | 70% |
| Ascension (f131-136) | 3,929 | 2,335 | 295 | 1,299 | 66% |
| City (f168-175) | 3,571 | 1,553 | 295 | 1,723 | 51% |
| GRAND | 12,880 | 7,162 | 999 | 4,719 | 63% |
Why city table is 51%: Each city row has 4 numeric columns + 1-2 prose columns (place names, region names). Prose is correctly read by OCR but not expressible as abjad numbers โ UNRESOLVED in the abjad layer. Actual numeric-cell coverage is higher.
Why 63% is the structural ceiling: The 4,719 remaining unknowns split:
- ~1,900: genuine OCR failures where Gemini returned [?] and no column constraint brackets them
- ~2,800: correctly-read label text cells (Arabic prose, region names) counted as UNRESOLVED in abjad layer
Disambiguated output files (all in Code_files/biruni_ms/abjad_ocr/):
sine_table_ocr_disambiguated.jsonshadow_table_ocr_disambiguated.jsonascension_table_ocr_disambiguated.jsoncity_table_ocr_disambiguated.json
Each cell has (original_ocr, resolved_value, confidence_label) โ confidence labels include: OCR, OCR_RED, SEQUENCE, MULTI_GAP, ALTERNATE, CONSTANT, STEP2, PASS6_SEQ, COL_CONSTANT, COL_DOMINANT, COL_ALTERNATE, COL_LOCAL_CONST, COL_LINEAR, COL_HALFDEG, COL_COMPLEMENT, UNRESOLVED.
2. Supporting fixes completed in this session
- BL-HEB Hebrew block added to
amr_dereference_audit.pyโ catches U+0590-U+05FF in any output field, BL-HEB reference - QUF PRIMARY_SOURCE path fixed in
amr_istakhbarat.pyโ arabic_text + source_ms + edition_page โฅ 3 โ MEDIUM (was blocking 272 entries) - 43 AA-concept entries written (entries 3200-3242) across Maqalat 2-11
What is NOT yet done
Priority 1 โ Parse disambiguated JSON โ DB science tables
The 4 *_disambiguated.json files contain the resolved abjad values but have NOT yet been written to the actual DB science tables. Four tables exist in DB:
biruni_sine_tablebiruni_shadow_tablebiruni_ascension_tablebiruni_city_coordinates
Only a small number of reference rows were written manually (14 sine, 17 shadow, 11 ascension, 35 city). The bulk of OCR + disambiguated data is still only in JSON.
Next step: Write a biruni_parse_ocr_to_db.py script that:
- Reads each
*_disambiguated.json - Maps cells to the correct column (degree, minutes, sin_deg, sin_min, etc.) based on column position
- Inserts rows with confidence labels into the DB tables
- Skips UNRESOLVED cells (leave NULL in DB)
Column order to use (confirmed from calibration):
- Sine table: col0=degree, col1=minutes, col2=sin_deg, col3=sin_min, col4=sin_sec, col5=sin_thirds
- Shadow table: same structure but shadow values
- City table: col0=lon_deg, col1=lon_min, col2=lat_deg, col3=lat_min
Priority 2 โ Re-OCR high-UNRESOLVED strips
The disambiguated JSON flags which strips have the most UNRESOLVED cells. Re-run those specific strips with a more targeted Gemini prompt focusing on the specific cell format expected. Use --table sine with strip-level targeting.
OpenRouter key was: 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
Priority 3 โ 3 missing roots
- ู-ู-ุท (point, nuqta) โ check Quranic tokens, add if found
- ุฒ-ู-ู (angle) โ check Quranic tokens
- ุฏ-ู-ู (minute, daqiqa) โ check Quranic tokens
Priority 4 โ 2 blocked vocab entries
- ุฎ-ุท-ุท (line, khatt): 0 Quranic tokens โ blocked by QUF, pending root review
- ู-ู-ู (crescent, hilal): 0 Quranic tokens in current DB โ blocked
Priority 5 โ f165-167 download
Still missing from Gallica. Gallica returns 500/429 on these folios. Manual download or retry with longer delays.
Priority 6 โ al-Athar al-Baqiya science extraction
BnF Arabe 1489 (175 folios) downloaded as PDF. ~30-40 pages of genuine science (calendar computation + astronomical tables) identified. Operator king-list insertions (12 pages) identified and documented. Genuine science not yet extracted to DB.
Architecture reminder
biruni_abjad_ocr.py โ Gemini Flash OCR โ *_ocr.json
biruni_abjad_disambiguate.py โ structural constraints โ *_disambiguated.json
biruni_parse_ocr_to_db.py โ [NOT YET WRITTEN] โ DB science tables
Key technical note: abjad authenticity filter
Three numeral systems encountered across MSS:
- AA abjad (ุง=1, ุจ=2...ุต=90) โ genuine science, pre-operator
- Bitig word-numerals (bir/eki/รผรง) โ genuine ORIG2
- Positional ูกูขูฃ (eastern Arabic-Indic digits) โ operator insertion marker
Check notation FIRST when opening any MS. Positional = operator's tradition betraying the operator.
DB state at session close
entries: EN=3260 (43 Biruni entries 3200-3242, all QUF=TRUE)
roots: 3320
diwan_roots: 8328
triggers: 302 (173 contamination, 25 QUF, 9 auto-index, 6 diwan)
QUF: entries 331/3242 (10%), roots 3259/3320 (98%)
MS registry: ms_id=5 (BnF Arabe 6840, al-Qanun al-Masudi, 502 AH)