uslap-query / Code_files /MIGRATION_GUIDE.md
uslap's picture
Upload folder using huggingface_hub
7cc8e29 verified
|
Raw
History Blame Contribute Delete
16.2 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

USLaP Database Migration Guide

بِسْمِ Ψ§Ω„Ω„ΩŽΩ‘Ω‡Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­Ω’Ω…ΩŽΩ°Ω†Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­ΩΩŠΩ…Ω

This guide provides step-by-step instructions for migrating data from the Excel master file to the full USLaP SQLite relational database. The migration creates a normalized schema designed for world-vocabulary scale (target: 25,000+ entries, 1M+ searchable objects).

Overview

The migration process:

  1. Creates a fresh SQLite database with the full relational schema (uslap_lattice.db)
  2. Reads data from structured Excel sheets (skips the consolidated echo sheet)
  3. Normalizes data into proper relational tables with foreign key relationships
  4. Registers the extract_consonants() Python UDF for phonetic search
  5. Generates the word_fingerprints table for O(log n) cluster expansion
  6. Creates backups of any existing databases
  7. Verifies data integrity and foreign key constraints

Prerequisites

Required Files

  • USLaP_Final_Data_Consolidated_Master_v3.xlsx – Master Excel file (in workplace root)
  • create_uslap_db.sql – Complete SQLite schema (in Code_files/)
  • migrate_to_sqlite.py – Migration script (in Code_files/)
  • USLaP_Engine.py – Contains consonant extraction logic (in Code_files/)

Python Dependencies

pip install openpyxl

The script requires Python 3.6+ and the openpyxl library for reading Excel files. All other dependencies are in the Python standard library.

Running the Migration

Step 1: Verify File Locations

Ensure all files are in the correct locations:

USLaP workplace/
β”œβ”€β”€ USLaP_Final_Data_Consolidated_Master_v3.xlsx
└── Code_files/
    β”œβ”€β”€ create_uslap_db.sql
    β”œβ”€β”€ migrate_to_sqlite.py
    β”œβ”€β”€ USLaP_Engine.py
    └── [existing .db files]

Step 2: Run the Migration Script

From the USLaP workplace directory, run:

cd "/Users/mmsetubal/Documents/USLaP workplace"
python3 "Code_files/migrate_to_sqlite.py"

Important Notes:

  • The script will automatically create a backup of any existing uslap_lattice.db file
  • Migration may take 1-2 minutes depending on Excel file size
  • All operations are wrapped in a transaction; on failure, the database is rolled back

Step 3: Monitor Migration Output

The script provides real-time progress:

══════════════════════════════════════════════════════════════════════
  USLaP Migration: Excel β†’ SQLite Relational Database
  بِسْمِ Ψ§Ω„Ω„ΩŽΩ‘Ω‡Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­Ω’Ω…ΩŽΩ°Ω†Ω Ψ§Ω„Ψ±ΩŽΩ‘Ψ­ΩΩŠΩ…Ω
══════════════════════════════════════════════════════════════════════

πŸ“– Loading Excel file: USLaP_Final_Data_Consolidated_Master_v3.xlsx
  Found 62 sheets

πŸ—„οΈ  Creating database: Code_files/uslap_lattice.db
  Executing schema...
  Registering extract_consonants() UDF...

πŸ“Š Migrating data...
  Migrating A1_ENTRIES...
    Migrated 59 entries
  Migrating A1_Π—ΠΠŸΠ˜Π‘Π˜ (Russian entries)...
    Migrated 0 entries
  [Additional sheets...]

πŸ“ˆ Migration Statistics:
────────────────────────────────────────
  Total entries: 59
  Total roots: 0
  Child entries: 3
  Word fingerprints: 6
  Engine queue items: 0

πŸ” Verifying foreign key constraints...
βœ“ All foreign key constraints satisfied

══════════════════════════════════════════════════════════════════════
βœ… MIGRATION COMPLETED SUCCESSFULLY
βœ… Database: Code_files/uslap_lattice.db
══════════════════════════════════════════════════════════════════════

Verification Steps

Verify Database Structure

After migration, verify all tables were created:

sqlite3 "Code_files/uslap_lattice.db" ".tables"

Expected output (30+ tables):

child_entries          decay_levels          operation_cycles
child_entry_links      derivatives           operators
cluster_cache          detection_patterns    operator_aliases
country_names          engine_queue          operation_codes
cross_refs             entries               phonetic_mappings
change_log             events                phonetic_shifts
dp_codes               host_civilizations    quran_refs
intel_reports          languages             roots
names_of_allah         networks              scholars
nt_codes               op_codes              script_corridors
qur_verification       session_index         sync_status
word_fingerprints

Verify Row Counts

Run the verification queries in the database:

-- Check entry counts
SELECT 'entries' as tbl, COUNT(*) FROM entries
UNION ALL SELECT 'roots', COUNT(*) FROM roots
UNION ALL SELECT 'derivatives', COUNT(*) FROM derivatives
UNION ALL SELECT 'cross_refs', COUNT(*) FROM cross_refs
UNION ALL SELECT 'child_entries', COUNT(*) FROM child_entries
UNION ALL SELECT 'word_fingerprints', COUNT(*) FROM word_fingerprints;

-- Check FK integrity
PRAGMA integrity_check;
PRAGMA foreign_key_check;

-- Check fingerprint coverage
SELECT COUNT(DISTINCT entry_id) as entries_with_fingerprints FROM word_fingerprints;

Expected Results:

  • entries: ~390–404 entries total across all languages
  • roots: ~200 unique roots
  • derivatives: ~632 word forms
  • cross_refs: ~155 relationships
  • child_entries: ~12 operational intelligence entries
  • word_fingerprints: ~1,400+ entries (covers all searchable terms)

Test the UDF Function

Verify the extract_consonants() function works:

sqlite3 "Code_files/uslap_lattice.db" "SELECT extract_consonants('example')"

Expected output: xmpl

Test Basic Queries

-- Get entries with high confidence scores
SELECT entry_id, en_term, score FROM entries WHERE score >= 8 ORDER BY score DESC LIMIT 10;

-- Test phonetic search via word_fingerprints
SELECT wf.raw_word, wf.consonant_skeleton, e.en_term, e.score
FROM word_fingerprints wf
LEFT JOIN entries e ON wf.entry_id = e.entry_id
WHERE wf.consonant_skeleton = extract_consonants('test')
LIMIT 5;

-- View operational intelligence (CHILD schema)
SELECT child_id, shell_name, operation_role, parent_op FROM child_entries;

Database Schema Details

Critical Tables for Engine Operations

1. word_fingerprints – Phonetic Search Index

The most critical table for performance. Enables O(log n) cluster expansion via the composite index on (consonant_skeleton, language).

Triggers: Automatic population when entries/derivatives/child entries are inserted or updated.

Index: idx_fingerprints_lookup enables instant phonetic matching.

2. engine_queue – Write Conflict Prevention

Prevents direct writes from the engine to core tables. All proposed changes go through this queue for user approval via the Oversight Dashboard.

Purpose: Maintains Excel as the primary write interface while enabling engine proposals.

3. session_index – Engine Session Tracking

Tracks every engine run with performance metrics and error logging.

4. child_entries & child_entry_links – Operational Intelligence

CHILD schema integration links operational intelligence (SLV, SQLB, RUS entries) with main A1 entries.

Foreign Key Enforcement

Foreign keys are strictly enforced (PRAGMA foreign_keys = ON). The migration script temporarily disables them during data insertion to avoid constraint violations, then re-enables and verifies all constraints.

Python UDF: extract_consonants()

Registration

The UDF is automatically registered on every database connection via:

conn.create_function("extract_consonants", 1, extract_consonants)

CRITICAL: Must be registered BEFORE any INSERT operations or trigger execution.

Implementation

The function extracts consonant skeletons from words:

  • Removes vowels (a, e, i, o, u)
  • Handles digraphs (sh, ch, gh, th, ph, wh, qu) as single units
  • Normalizes to lowercase
  • Returns empty string for null/empty input

Source: Logic matches PhoneticReversal.extract_consonants() in USLaP_Engine.py.

SQL Triggers Using the UDF

The schema includes triggers that automatically populate word_fingerprints:

CREATE TRIGGER update_fingerprints_on_entry_insert
AFTER INSERT ON entries
BEGIN
    INSERT INTO word_fingerprints (entry_id, language, raw_word, consonant_skeleton)
    SELECT NEW.entry_id, 'en', NEW.en_term, extract_consonants(NEW.en_term)
    WHERE NEW.en_term IS NOT NULL AND NEW.en_term != '';
    -- ... similar for ru_term, fa_term, ar_word
END;

Excel ↔ Database Sync Strategy

Current Architecture

  • Excel is primary write interface: All user-facing writes go through Excel
  • Database is read/query layer: Engine reads from database, proposes changes via queue
  • Sync direction: Excel β†’ Database (one-way during migration)

Maintaining Sync

After migration, keep databases in sync:

  1. Engine proposals: When USLaP_Engine.py detects new patterns, it writes to engine_queue
  2. User approval: User reviews proposals in Oversight Dashboard (USLaP_Oversight_Dashboard.html)
  3. Approved changes: User applies approved changes to Excel manually
  4. Re-sync: Run migration script periodically to update database with Excel changes

Migration Script Updates

The migrate_to_sqlite.py script can be re-run at any time. It will:

  1. Create timestamped backup of existing database
  2. Start fresh with current Excel data
  3. Preserve any engine_queue items that haven't been processed

Switching USLaP_Engine.py to SQLite Reads

Current State

USLaP_Engine.py currently reads directly from Excel via openpyxl.

Target State

Update USLaP_Engine.py to use the database access layer (db_access_layer.py) for:

  1. Entry lookups (instead of reading Excel sheets)
  2. Phonetic search (using word_fingerprints table)
  3. Cluster expansion (O(log n) via indexed searches)
  4. Queue operations (proposing changes via engine_queue)

Implementation Steps

  1. Import db_access_layer module
  2. Replace Excel reading with database queries:
    # Old: reading from Excel
    # New: using database
    from db_access_layer import search_word, PhoneticSearchOperations
    
    results = search_word("example")
    similar = PhoneticSearchOperations.find_similar_words("example", conn)
    
  3. Update cluster expansion to use word_fingerprints index
  4. Route all proposed changes through engine_queue instead of direct writes

Performance Benefits

  • Phonetic search: O(log n) vs O(n) linear scan
  • Cluster expansion: Instant via pre-computed fingerprints
  • Memory usage: Database queries vs loading entire Excel file
  • Concurrency: Multiple engine sessions can query simultaneously

Troubleshooting

Common Issues

1. "Excel file not found"

Solution: Ensure USLaP_Final_Data_Consolidated_Master_v3.xlsx is in the workplace root directory.

2. "Schema file not found"

Solution: Ensure create_uslap_db.sql is in Code_files/ directory.

3. Foreign key constraint violations

Solution: The migration script temporarily disables foreign keys during insertion. If errors persist:

sqlite3 "Code_files/uslap_lattice.db" "PRAGMA foreign_key_check"

Check for circular dependencies or missing reference data.

4. UDF not registered

Solution: Ensure extract_consonants() is registered before any inserts. The migration script does this automatically. For custom connections, use:

from migrate_to_sqlite import extract_consonants
conn.create_function("extract_consonants", 1, extract_consonants)

5. Low row counts after migration

Possible causes:

  • Excel sheet names don't match expected names
  • Header row detection failed
  • Data is in unexpected format

Debug: Run the migration script with additional print statements or examine the Excel sheet structure.

Recovery Procedures

Database Corruption

If the database becomes corrupted:

  1. Restore from latest backup in Code_files/backups/
  2. Or re-run migration script (creates fresh database)

Failed Migration

If migration fails mid-process:

  1. Script automatically rolls back transaction
  2. Old database remains unchanged (if backup was created)
  3. Check error output for specific issue
  4. Fix underlying problem (Excel format, disk space, permissions)
  5. Re-run migration

Data Loss Prevention

  • Migration always creates timestamped backups
  • Excel master file remains unchanged (read-only during migration)
  • Transaction rollback on any error

Performance Optimization

Index Usage

The schema includes optimal indexes for:

  1. Phonetic search: idx_fingerprints_lookup on (consonant_skeleton, language)
  2. Root-based queries: idx_entries_root on entries(root_id)
  3. Score sorting: idx_entries_score on entries(score DESC)
  4. Full-text search: FTS5 virtual table entries_fts

Query Patterns

For best performance:

  • Use word_fingerprints for phonetic searches
  • Use entries_fts for full-text keyword searches
  • Use cluster_cache for repeated expansion of same roots
  • Limit results with LIMIT clauses for UI responsiveness

Scaling Considerations

The schema is designed for 1M+ searchable objects. At that scale:

  • Consider increasing SQLite cache size: PRAGMA cache_size = -2000; (2GB)
  • Use WAL mode for concurrent reads: PRAGMA journal_mode = WAL;
  • Regular VACUUM to maintain performance: VACUUM;

Appendix A: Migration Script Details

Sheets Migrated

The script reads from these structured sheets only:

  • A1_ENTRIES – English entries
  • A1_Π—ΠΠŸΠ˜Π‘Π˜ – Russian entries
  • PERSIAN_A1_MADΔ€KHIL – Persian entries
  • BITIG_A1_ENTRIES – ORIG2/Turkic entries
  • CHILD_SCHEMA – Operational intelligence
  • A4_DERIVATIVES – Word forms
  • A5_CROSS_REFS – Entry relationships
  • A3_QURAN_REFS – Verse references
  • M1_PHONETIC_SHIFTS – Phonetic mechanism
  • M2_DETECTION_PATTERNS – Detection patterns
  • M4_NETWORKS – Network definitions
  • M3_SCHOLARS – Scholar biographies
  • M5_QUR_VERIFICATION – Qur'an verification

Sheets Skipped

  • EXCEL_DATA_CONSOLIDATED – Echo sheet (not a primary source)
  • Various protocol, correction, and warning sheets

Data Flow

  1. Read Excel sheet β†’ Clean column names β†’ Map to schema β†’ Insert
  2. Extract unique roots from entries β†’ Create roots table entries
  3. Triggers automatically create word_fingerprints
  4. Verify foreign key integrity
  5. Commit transaction

Appendix B: Database Access Layer

The db_access_layer.py module provides:

  • DatabaseConnection – Context manager for connections
  • EntryOperations, RootOperations – CRUD operations
  • PhoneticSearchOperations – O(log n) cluster expansion
  • EngineQueueOperations, SessionOperations – Engine control
  • AnalyticsOperations – Statistics and analysis
  • High-level API functions: search_word(), add_new_entry(), run_engine_session()

Example Usage

from db_access_layer import search_word, get_connection, EntryOperations

# High-level search
results = search_word("example")
print(f"Found {len(results['exact_matches'])} exact matches")

# Direct operations
with get_connection() as conn:
    entries = EntryOperations.get_high_score_entries(conn, min_score=8)
    print(f"High-score entries: {len(entries)}")

Support

For issues with migration:

  1. Check error messages in console output
  2. Verify file permissions and locations
  3. Ensure Excel file isn't open in another program
  4. Check Python version and openpyxl installation

To report bugs or request enhancements, use the project's issue tracking system.

ΩˆΩŽΨ§Ω„Ω„ΩŽΩ‘Ω‡Ω Ψ£ΩŽΨΉΩ’Ω„ΩŽΩ…Ω