uslap-query / Code_files /SESSION_47_HANDOFF.md
uslap's picture
Upload folder using huggingface_hub
7cc8e29 verified
|
Raw
History Blame Contribute Delete
6.66 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

ุจูุณู’ู…ู ุงู„ู„ูŽู‘ู‡ู ุงู„ุฑูŽู‘ุญู’ู…ูŽูฐู†ู ุงู„ุฑูŽู‘ุญููŠู…ู

Session 47 Handoff โ€” al-Biruni Abjad Disambiguator Complete

Date: 2026-04-13
DB state at close: EN=3260, Roots=3320, Triggers=302, Diwan=8328


What Session 47 accomplished

1. Abjad Disambiguator โ€” COMPLETE (Code_files/biruni_abjad_disambiguate.py)

Built a full structural-constraint disambiguator for the 4 OCR'd abjad tables. No weights โ€” every fill derives from the MS table's own logic.

Within-row passes (6):

  • Pass 1: Direct OCR read
  • Pass 2: Sequential step-1 fill (degree column counting up/down)
  • Pass 3: Multi-cell linear gap fill
  • Pass 4: Alternating 0/ู„(30) fill (minutes column) โ€” per-row
  • Pass 5: Red-ink [R] prefix strip
  • Pass 6: Final bounded-pair re-check after updates

Cross-row column passes (5):

  • CONSTANT: all known values identical โ†’ fill any remaining [?]
  • DOMINANT: 70%+ same value (e.g., ุน=70 for all ุงู„ุนุฑุงู‚ cities)
  • ALTERNATING: {0, 30} by row parity โ€” catches minutes column cross-strip
  • LOCAL LINEAR: for each [?], find nearest known above/below in same column; fill only when diff % gap == 0 (integer step). Handles step-1 sequential, repeat-2 (half-degree), step-2, any integer step.
  • COMPLEMENT PAIRS: detects column pairs summing to a constant (mirror symmetry โ€” degree_R + degree_L = 90 for sine table halves)

Results โ€” structural ceiling, no weights:

Table Cells OCR-num Seq+Col Unknown Coverage
Sine (f102-107) 3,637 2,142 311 1,184 67%
Shadow (f112) 1,743 1,132 98 513 70%
Ascension (f131-136) 3,929 2,335 295 1,299 66%
City (f168-175) 3,571 1,553 295 1,723 51%
GRAND 12,880 7,162 999 4,719 63%

Why city table is 51%: Each city row has 4 numeric columns + 1-2 prose columns (place names, region names). Prose is correctly read by OCR but not expressible as abjad numbers โ†’ UNRESOLVED in the abjad layer. Actual numeric-cell coverage is higher.

Why 63% is the structural ceiling: The 4,719 remaining unknowns split:

  • ~1,900: genuine OCR failures where Gemini returned [?] and no column constraint brackets them
  • ~2,800: correctly-read label text cells (Arabic prose, region names) counted as UNRESOLVED in abjad layer

Disambiguated output files (all in Code_files/biruni_ms/abjad_ocr/):

  • sine_table_ocr_disambiguated.json
  • shadow_table_ocr_disambiguated.json
  • ascension_table_ocr_disambiguated.json
  • city_table_ocr_disambiguated.json

Each cell has (original_ocr, resolved_value, confidence_label) โ€” confidence labels include: OCR, OCR_RED, SEQUENCE, MULTI_GAP, ALTERNATE, CONSTANT, STEP2, PASS6_SEQ, COL_CONSTANT, COL_DOMINANT, COL_ALTERNATE, COL_LOCAL_CONST, COL_LINEAR, COL_HALFDEG, COL_COMPLEMENT, UNRESOLVED.


2. Supporting fixes completed in this session

  • BL-HEB Hebrew block added to amr_dereference_audit.py โ€” catches U+0590-U+05FF in any output field, BL-HEB reference
  • QUF PRIMARY_SOURCE path fixed in amr_istakhbarat.py โ€” arabic_text + source_ms + edition_page โ‰ฅ 3 โ†’ MEDIUM (was blocking 272 entries)
  • 43 AA-concept entries written (entries 3200-3242) across Maqalat 2-11

What is NOT yet done

Priority 1 โ€” Parse disambiguated JSON โ†’ DB science tables

The 4 *_disambiguated.json files contain the resolved abjad values but have NOT yet been written to the actual DB science tables. Four tables exist in DB:

  • biruni_sine_table
  • biruni_shadow_table
  • biruni_ascension_table
  • biruni_city_coordinates

Only a small number of reference rows were written manually (14 sine, 17 shadow, 11 ascension, 35 city). The bulk of OCR + disambiguated data is still only in JSON.

Next step: Write a biruni_parse_ocr_to_db.py script that:

  1. Reads each *_disambiguated.json
  2. Maps cells to the correct column (degree, minutes, sin_deg, sin_min, etc.) based on column position
  3. Inserts rows with confidence labels into the DB tables
  4. Skips UNRESOLVED cells (leave NULL in DB)

Column order to use (confirmed from calibration):

  • Sine table: col0=degree, col1=minutes, col2=sin_deg, col3=sin_min, col4=sin_sec, col5=sin_thirds
  • Shadow table: same structure but shadow values
  • City table: col0=lon_deg, col1=lon_min, col2=lat_deg, col3=lat_min

Priority 2 โ€” Re-OCR high-UNRESOLVED strips

The disambiguated JSON flags which strips have the most UNRESOLVED cells. Re-run those specific strips with a more targeted Gemini prompt focusing on the specific cell format expected. Use --table sine with strip-level targeting.

OpenRouter key was: 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

Priority 3 โ€” 3 missing roots

  • ู†-ู‚-ุท (point, nuqta) โ€” check Quranic tokens, add if found
  • ุฒ-ูˆ-ูŠ (angle) โ€” check Quranic tokens
  • ุฏ-ู‚-ู‚ (minute, daqiqa) โ€” check Quranic tokens

Priority 4 โ€” 2 blocked vocab entries

  • ุฎ-ุท-ุท (line, khatt): 0 Quranic tokens โ€” blocked by QUF, pending root review
  • ู‡-ู„-ู„ (crescent, hilal): 0 Quranic tokens in current DB โ€” blocked

Priority 5 โ€” f165-167 download

Still missing from Gallica. Gallica returns 500/429 on these folios. Manual download or retry with longer delays.

Priority 6 โ€” al-Athar al-Baqiya science extraction

BnF Arabe 1489 (175 folios) downloaded as PDF. ~30-40 pages of genuine science (calendar computation + astronomical tables) identified. Operator king-list insertions (12 pages) identified and documented. Genuine science not yet extracted to DB.


Architecture reminder

biruni_abjad_ocr.py          โ† Gemini Flash OCR โ†’ *_ocr.json
biruni_abjad_disambiguate.py โ† structural constraints โ†’ *_disambiguated.json
biruni_parse_ocr_to_db.py    โ† [NOT YET WRITTEN] โ†’ DB science tables

Key technical note: abjad authenticity filter

Three numeral systems encountered across MSS:

  1. AA abjad (ุง=1, ุจ=2...ุต=90) โ€” genuine science, pre-operator
  2. Bitig word-numerals (bir/eki/รผรง) โ€” genuine ORIG2
  3. Positional ูกูขูฃ (eastern Arabic-Indic digits) โ€” operator insertion marker

Check notation FIRST when opening any MS. Positional = operator's tradition betraying the operator.


DB state at session close

entries:     EN=3260 (43 Biruni entries 3200-3242, all QUF=TRUE)
roots:       3320
diwan_roots: 8328
triggers:    302 (173 contamination, 25 QUF, 9 auto-index, 6 diwan)
QUF:         entries 331/3242 (10%), roots 3259/3320 (98%)
MS registry: ms_id=5 (BnF Arabe 6840, al-Qanun al-Masudi, 502 AH)