rubentsui commited on
Commit
d883c53
·
verified ·
1 Parent(s): de7f1df

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ twl_concordancer.db filter=lfs diff=lfs merge=lfs -text
AGENT.md ADDED
@@ -0,0 +1,1014 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TWL Concordancer — Build Instructions
2
+
3
+ ## Project Overview
4
+
5
+ Build a bilingual concordancer web app for Taiwan Law (TWL) that allows users to search for keywords/regex across aligned Chinese-English legal texts, with expandable context at paragraph and article levels.
6
+
7
+ ## Directory Structure
8
+
9
+ ```
10
+ /Users/rubentsui/NLP/TWL/
11
+ ├── 2026-03-27/
12
+ │ ├── TWL.2026-03-27.json # Raw bilingual corpus (123MB)
13
+ │ └── TWL.2026-03-27.aligned.json # Aligned corpus (to be generated)
14
+ ├── .opencode/skills/twl-align-corpus/
15
+ │ ├── SKILL.md
16
+ │ └── scripts/
17
+ │ └── align_corpus.py # Alignment script
18
+ └── twl-concordancer/
19
+ ├── concordancer.py # Streamlit app (to be built)
20
+ ├── db.py # Database query layer (to be built)
21
+ ├── build_db.py # Ingestion script (to be built)
22
+ └── twl_concordancer.db # SQLite database (to be generated)
23
+ ```
24
+
25
+ ---
26
+
27
+ ## Step 1: Understand the Input Corpus
28
+
29
+ ### Source File: `2026-03-27/TWL.2026-03-27.json`
30
+
31
+ **Structure:**
32
+ ```json
33
+ {
34
+ "metadata": {
35
+ "release_date": "2026-3-27",
36
+ "description": "Taiwan Law Chinese-English Bilingual Corpus"
37
+ },
38
+ "laws": [...],
39
+ "orders": [...]
40
+ }
41
+ ```
42
+
43
+ **Each law/order has this structure:**
44
+ ```json
45
+ {
46
+ "law_id": "A0000001",
47
+ "zh": {
48
+ "LawLevel": "憲法",
49
+ "LawName": "中華民國憲法",
50
+ "LawURL": "https://law.moj.gov.tw/LawClass/LawAll.aspx?pcode=A0000001",
51
+ "LawCategory": "憲法",
52
+ "LawModifiedDate": "19470101",
53
+ "LawEffectiveDate": "",
54
+ "LawHasEngVersion": "Y",
55
+ "EngLawName": "Constitution of the Republic of China (Taiwan)",
56
+ "LawForeword": "中華民國國民大會受全體國民之付託...",
57
+ "LawArticles": [
58
+ {
59
+ "ArticleType": "C",
60
+ "ArticleNo": "",
61
+ "ArticleContent": "第 一 章 總綱"
62
+ },
63
+ {
64
+ "ArticleType": "A",
65
+ "ArticleNo": "第 1 條",
66
+ "ArticleContent": "中華民國基於三民主義,為民有民治民享之民主共和國。"
67
+ }
68
+ ]
69
+ },
70
+ "en": {
71
+ "LawLevel": "憲法",
72
+ "EngLawName": "Constitution of the Republic of China (Taiwan)",
73
+ "LawName": "中華民國憲法",
74
+ "EngLawURL": "https://law.moj.gov.tw/ENG/LawClass/LawAll.aspx?pcode=A0000001",
75
+ "EngLawForeword": "The National Assembly of the Republic of China...",
76
+ "EngLawArticles": [
77
+ {
78
+ "EngArticleType": "C",
79
+ "EngArticleNo": "",
80
+ "EngArticleContent": " Chapter I. General Provisions"
81
+ },
82
+ {
83
+ "EngArticleType": "A",
84
+ "EngArticleNo": "Article 1",
85
+ "EngArticleContent": "The Republic of China, founded on the Three Principles of the People, shall be a democratic republic of the people, to be governed by the people and for the people."
86
+ }
87
+ ]
88
+ }
89
+ }
90
+ ```
91
+
92
+ **Key details:**
93
+ - `ArticleType`: `"A"` = numbered article, `"C"` = chapter/section header
94
+ - `ArticleContent` may contain `\n` for paragraph breaks within an article
95
+ - English articles use `EngArticleNo` (e.g., `"Article 1"`) vs Chinese `ArticleNo` (e.g., `"第 1 條"`)
96
+ - Chinese article numbers use Chinese numerals: 一, 二, 三, etc.
97
+ - The `zh` object has `LawArticles` key; the `en` object has `EngLawArticles` key
98
+ - Some English articles may be empty or missing
99
+
100
+ ---
101
+
102
+ ## Step 2: Align the Corpus
103
+
104
+ ### Prerequisites
105
+ - Python 3.10+ with: `vecalign`, `sentence-transformers`, `torch`, `numpy`, `regex`, `sentence_splitter`, `tqdm`
106
+ - VecAlignMulti2 directory at `/Users/rubentsui/NLP/VecAlignMulti2/` containing `dp_utils.py`, `vecalign.py`, `score.py`
107
+ - GPU: CUDA (recommended for full corpus) or MPS (Apple Silicon)
108
+
109
+ ### Run Alignment
110
+
111
+ ```bash
112
+ python3 .opencode/skills/twl-align-corpus/scripts/align_corpus.py \
113
+ 2026-03-27/TWL.2026-03-27.json \
114
+ 2026-03-27/TWL.2026-03-27.aligned.json \
115
+ --device cuda \
116
+ --alignment-max-size 7 \
117
+ --model LaBSE
118
+ ```
119
+
120
+ **Options:**
121
+ - `--device`: `cuda`, `mps`, or `cpu` (auto-detects if not specified)
122
+ - `--resume`: Resume from existing output file if interrupted
123
+ - `--law-ids A0000001 A0000002`: Process specific laws only
124
+ - `--model`: Embedding model (default: `LaBSE`)
125
+
126
+ ### Alignment Output Structure
127
+
128
+ The aligned JSON has this structure:
129
+ ```json
130
+ {
131
+ "metadata": {
132
+ "release_date": "2026-3-27",
133
+ "description": "Taiwan Law Aligned Corpus (article, paragraph, sentence)",
134
+ "alignment_model": "LaBSE",
135
+ "alignment_max_size": 7,
136
+ "device": "cuda",
137
+ "created_at": "2026-04-03T..."
138
+ },
139
+ "laws": [
140
+ {
141
+ "law_id": "A0000001",
142
+ "zh_name": "中華民國憲法",
143
+ "en_name": "Constitution of the Republic of China (Taiwan)",
144
+ "category": "憲法",
145
+ "foreword_alignment": [
146
+ {
147
+ "score": 0.1658,
148
+ "zh_indices": [0],
149
+ "en_indices": [0],
150
+ "zh": "中華民國國民大會受全體國民之付託...",
151
+ "en": "The National Assembly of the Republic of China..."
152
+ }
153
+ ],
154
+ "articles": [
155
+ {
156
+ "article_no_zh": "第 1 條",
157
+ "article_no_en": "Article 1",
158
+ "article_type": "A",
159
+ "paragraphs": [
160
+ {
161
+ "zh_indices": [0],
162
+ "en_indices": [0],
163
+ "zh": "中華民國基於三民主義...",
164
+ "en": "The Republic of China, founded on...",
165
+ "score": 0.278,
166
+ "sentences": [
167
+ {
168
+ "score": 0.278,
169
+ "zh_indices": [0],
170
+ "en_indices": [0],
171
+ "zh": "中華民國基於三民主義,為民有民治民享之民主共和國。",
172
+ "en": "The Republic of China, founded on the Three Principles of the People, shall be a democratic republic of the people, to be governed by the people and for the people."
173
+ }
174
+ ]
175
+ }
176
+ ]
177
+ }
178
+ ]
179
+ }
180
+ ],
181
+ "orders": [...]
182
+ }
183
+ ```
184
+
185
+ **Notes:**
186
+ - Articles are matched by number (第 X 條 ↔ Article X); unmatched articles are paired sequentially
187
+ - Paragraphs are created by splitting `ArticleContent` on `\n`
188
+ - Sentences within each paragraph are aligned using vecalign with LaBSE embeddings
189
+ - `score` is cosine distance (lower = better alignment)
190
+
191
+ ---
192
+
193
+ ## Step 3: Build SQLite Database
194
+
195
+ ### Schema (Lean — No Duplication)
196
+
197
+ Text is stored **only at sentence level**. Paragraph and article text are reconstructed from sentences on demand.
198
+
199
+ ```sql
200
+ CREATE TABLE laws (
201
+ id INTEGER PRIMARY KEY,
202
+ law_id TEXT UNIQUE, -- A0000001
203
+ type TEXT, -- 'law' or 'order'
204
+ zh_name TEXT,
205
+ en_name TEXT,
206
+ category TEXT,
207
+ effective_date TEXT,
208
+ modified_date TEXT
209
+ );
210
+
211
+ CREATE TABLE articles (
212
+ id INTEGER PRIMARY KEY,
213
+ law_id INTEGER REFERENCES laws(id),
214
+ article_no_zh TEXT,
215
+ article_no_en TEXT,
216
+ article_type TEXT, -- 'A' = article, 'C' = chapter
217
+ article_index INTEGER -- ordering within law
218
+ );
219
+
220
+ CREATE TABLE paragraphs (
221
+ id INTEGER PRIMARY KEY,
222
+ article_id INTEGER REFERENCES articles(id),
223
+ law_id INTEGER REFERENCES laws(id),
224
+ paragraph_index INTEGER -- ordering within article
225
+ );
226
+
227
+ CREATE TABLE sentences (
228
+ id INTEGER PRIMARY KEY,
229
+ paragraph_id INTEGER REFERENCES paragraphs(id),
230
+ article_id INTEGER REFERENCES articles(id),
231
+ law_id INTEGER REFERENCES laws(id),
232
+ zh_text TEXT NOT NULL,
233
+ en_text TEXT NOT NULL,
234
+ alignment_score REAL,
235
+ zh_sentence_idx INTEGER, -- position within zh paragraph
236
+ en_sentence_idx INTEGER -- position within en paragraph
237
+ );
238
+
239
+ -- Indexes for context expansion
240
+ CREATE INDEX idx_sentences_paragraph ON sentences(paragraph_id);
241
+ CREATE INDEX idx_sentences_article ON sentences(article_id);
242
+ CREATE INDEX idx_sentences_law ON sentences(law_id);
243
+ CREATE INDEX idx_paragraphs_article ON paragraphs(article_id);
244
+ CREATE INDEX idx_articles_law ON articles(law_id);
245
+ ```
246
+
247
+ ### Ingestion Script (`build_db.py`)
248
+
249
+ ```python
250
+ #!/usr/bin/env python3
251
+ """Build SQLite database from TWL aligned corpus JSON."""
252
+
253
+ import sys
254
+ import json
255
+ import sqlite3
256
+ import argparse
257
+ from pathlib import Path
258
+
259
+ DDL = """
260
+ CREATE TABLE IF NOT EXISTS laws (
261
+ id INTEGER PRIMARY KEY,
262
+ law_id TEXT UNIQUE,
263
+ type TEXT,
264
+ zh_name TEXT,
265
+ en_name TEXT,
266
+ category TEXT,
267
+ effective_date TEXT,
268
+ modified_date TEXT
269
+ );
270
+
271
+ CREATE TABLE IF NOT EXISTS articles (
272
+ id INTEGER PRIMARY KEY,
273
+ law_id INTEGER REFERENCES laws(id),
274
+ article_no_zh TEXT,
275
+ article_no_en TEXT,
276
+ article_type TEXT,
277
+ article_index INTEGER
278
+ );
279
+
280
+ CREATE TABLE IF NOT EXISTS paragraphs (
281
+ id INTEGER PRIMARY KEY,
282
+ article_id INTEGER REFERENCES articles(id),
283
+ law_id INTEGER REFERENCES laws(id),
284
+ paragraph_index INTEGER
285
+ );
286
+
287
+ CREATE TABLE IF NOT EXISTS sentences (
288
+ id INTEGER PRIMARY KEY,
289
+ paragraph_id INTEGER REFERENCES paragraphs(id),
290
+ article_id INTEGER REFERENCES articles(id),
291
+ law_id INTEGER REFERENCES laws(id),
292
+ zh_text TEXT NOT NULL,
293
+ en_text TEXT NOT NULL,
294
+ alignment_score REAL,
295
+ zh_sentence_idx INTEGER,
296
+ en_sentence_idx INTEGER
297
+ );
298
+
299
+ CREATE INDEX IF NOT EXISTS idx_sentences_paragraph ON sentences(paragraph_id);
300
+ CREATE INDEX IF NOT EXISTS idx_sentences_article ON sentences(article_id);
301
+ CREATE INDEX IF NOT EXISTS idx_sentences_law ON sentences(law_id);
302
+ CREATE INDEX IF NOT EXISTS idx_paragraphs_article ON paragraphs(article_id);
303
+ CREATE INDEX IF NOT EXISTS idx_articles_law ON articles(law_id);
304
+ """
305
+
306
+
307
+ def build_db(input_file, db_file, append=False):
308
+ conn = sqlite3.connect(db_file)
309
+ conn.execute("PRAGMA journal_mode=WAL")
310
+ conn.execute("PRAGMA synchronous=NORMAL")
311
+ conn.execute("PRAGMA foreign_keys=ON")
312
+ cur = conn.cursor()
313
+
314
+ if not append:
315
+ cur.executescript("""
316
+ DROP TABLE IF EXISTS sentences;
317
+ DROP TABLE IF EXISTS paragraphs;
318
+ DROP TABLE IF EXISTS articles;
319
+ DROP TABLE IF EXISTS laws;
320
+ """)
321
+
322
+ cur.executescript(DDL)
323
+
324
+ with open(input_file, encoding="utf-8") as f:
325
+ corpus = json.load(f)
326
+
327
+ law_count = article_count = paragraph_count = sentence_count = 0
328
+
329
+ for entry_type, key in [("law", "laws"), ("order", "orders")]:
330
+ items = corpus.get(key, [])
331
+ for item in items:
332
+ law_id = item.get("law_id") or item.get("order_id", "")
333
+ zh_name = item.get("zh_name", "")
334
+ en_name = item.get("en_name", "")
335
+ category = item.get("category", "")
336
+
337
+ try:
338
+ cur.execute(
339
+ "INSERT INTO laws (law_id, type, zh_name, en_name, category) VALUES (?, ?, ?, ?, ?)",
340
+ (law_id, entry_type, zh_name, en_name, category),
341
+ )
342
+ except sqlite3.IntegrityError:
343
+ if append:
344
+ cur.execute("SELECT id FROM laws WHERE law_id = ?", (law_id,))
345
+ row = cur.fetchone()
346
+ if row:
347
+ cur.execute(
348
+ "UPDATE laws SET zh_name=?, en_name=?, category=? WHERE id=?",
349
+ (zh_name, en_name, category, row[0]),
350
+ )
351
+ law_db_id = row[0]
352
+ else:
353
+ continue
354
+ else:
355
+ raise
356
+ else:
357
+ law_db_id = cur.lastrowid
358
+
359
+ law_count += 1
360
+ articles = item.get("articles", [])
361
+
362
+ for art_idx, art in enumerate(articles):
363
+ article_no_zh = art.get("article_no_zh", "")
364
+ article_no_en = art.get("article_no_en", "")
365
+ article_type = art.get("article_type", "")
366
+
367
+ cur.execute(
368
+ "INSERT INTO articles (law_id, article_no_zh, article_no_en, article_type, article_index) VALUES (?, ?, ?, ?, ?)",
369
+ (law_db_id, article_no_zh, article_no_en, article_type, art_idx),
370
+ )
371
+ article_db_id = cur.lastrowid
372
+ article_count += 1
373
+
374
+ paragraphs = art.get("paragraphs", [])
375
+ for para_idx, para in enumerate(paragraphs):
376
+ cur.execute(
377
+ "INSERT INTO paragraphs (article_id, law_id, paragraph_index) VALUES (?, ?, ?)",
378
+ (article_db_id, law_db_id, para_idx),
379
+ )
380
+ paragraph_db_id = cur.lastrowid
381
+ paragraph_count += 1
382
+
383
+ aligned_sentences = para.get("sentences", [])
384
+ for sent_idx, sent in enumerate(aligned_sentences):
385
+ zh_text = sent.get("zh", "")
386
+ en_text = sent.get("en", "")
387
+ score = sent.get("score", 0.0)
388
+ zh_sidx = sent.get("zh_indices", [sent_idx])[0] if sent.get("zh_indices") else sent_idx
389
+ en_sidx = sent.get("en_indices", [sent_idx])[0] if sent.get("en_indices") else sent_idx
390
+
391
+ if zh_text or en_text:
392
+ cur.execute(
393
+ "INSERT INTO sentences (paragraph_id, article_id, law_id, zh_text, en_text, alignment_score, zh_sentence_idx, en_sentence_idx) VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
394
+ (paragraph_db_id, article_db_id, law_db_id, zh_text, en_text, score, zh_sidx, en_sidx),
395
+ )
396
+ sentence_count += 1
397
+
398
+ if law_count % 100 == 0:
399
+ conn.commit()
400
+ print(f" Processed {law_count} {entry_type}s, {sentence_count} sentences...")
401
+
402
+ conn.commit()
403
+ print(f"\nDatabase built: {db_file}")
404
+ print(f" Laws/orders: {law_count}")
405
+ print(f" Articles: {article_count}")
406
+ print(f" Paragraphs: {paragraph_count}")
407
+ print(f" Sentences: {sentence_count}")
408
+ conn.close()
409
+
410
+
411
+ if __name__ == "__main__":
412
+ parser = argparse.ArgumentParser()
413
+ parser.add_argument("input_file")
414
+ parser.add_argument("db_file")
415
+ parser.add_argument("--append", action="store_true")
416
+ args = parser.parse_args()
417
+ build_db(args.input_file, args.db_file, append=args.append)
418
+ ```
419
+
420
+ **Run:**
421
+ ```bash
422
+ python3 twl-concordancer/build_db.py \
423
+ 2026-03-27/TWL.2026-03-27.aligned.json \
424
+ twl-concordancer/twl_concordancer.db
425
+ ```
426
+
427
+ ---
428
+
429
+ ## Step 4: Database Query Layer (`db.py`)
430
+
431
+ ```python
432
+ """Database query layer for TWL Concordancer."""
433
+
434
+ import sqlite3
435
+ import regex
436
+ from pathlib import Path
437
+
438
+ DEFAULT_DB = Path(__file__).parent / "twl_concordancer.db"
439
+
440
+
441
+ def get_conn(db_path=None):
442
+ if db_path is None:
443
+ db_path = DEFAULT_DB
444
+ conn = sqlite3.connect(str(db_path))
445
+ conn.row_factory = sqlite3.Row
446
+ conn.execute("PRAGMA journal_mode=WAL")
447
+ return conn
448
+
449
+
450
+ def search_sentences(conn, query, use_regex=False, law_id=None, article_no=None,
451
+ max_score=None, lang="both", limit=100, offset=0):
452
+ if use_regex:
453
+ return _search_regex(conn, query, law_id, article_no, max_score, lang, limit, offset)
454
+ return _search_like(conn, query, law_id, article_no, max_score, lang, limit, offset)
455
+
456
+
457
+ def _search_like(conn, query, law_id, article_no, max_score, lang, limit, offset):
458
+ """LIKE-based search. Works for both Chinese and English."""
459
+ terms = query.strip()
460
+ if not terms:
461
+ return [], 0
462
+
463
+ where = []
464
+ params = []
465
+
466
+ if lang == "zh":
467
+ where.append("s.zh_text LIKE ?")
468
+ params.append(f"%{terms}%")
469
+ elif lang == "en":
470
+ where.append("s.en_text LIKE ?")
471
+ params.append(f"%{terms}%")
472
+ else:
473
+ where.append("(s.zh_text LIKE ? OR s.en_text LIKE ?)")
474
+ params.extend([f"%{terms}%", f"%{terms}%"])
475
+
476
+ if max_score is not None:
477
+ where.append("s.alignment_score <= ?")
478
+ params.append(max_score)
479
+ if law_id:
480
+ where.append("l.law_id = ?")
481
+ params.append(law_id)
482
+ if article_no:
483
+ pat = f"%{article_no}%"
484
+ where.append("(a.article_no_zh LIKE ? OR a.article_no_en LIKE ?)")
485
+ params.extend([pat, pat])
486
+
487
+ where_clause = " AND ".join(where)
488
+
489
+ count_sql = f"""
490
+ SELECT count(*) FROM sentences s
491
+ JOIN laws l ON s.law_id = l.id
492
+ JOIN articles a ON s.article_id = a.id
493
+ WHERE {where_clause}
494
+ """
495
+
496
+ data_sql = f"""
497
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score,
498
+ l.law_id, l.zh_name, l.en_name, l.type,
499
+ a.article_no_zh, a.article_no_en, a.article_type,
500
+ s.zh_sentence_idx, s.en_sentence_idx
501
+ FROM sentences s
502
+ JOIN laws l ON s.law_id = l.id
503
+ JOIN articles a ON s.article_id = a.id
504
+ WHERE {where_clause}
505
+ ORDER BY s.alignment_score
506
+ LIMIT ? OFFSET ?
507
+ """
508
+ data_params = params + [limit, offset]
509
+
510
+ cur = conn.execute(count_sql, params)
511
+ total = cur.fetchone()[0]
512
+
513
+ cur = conn.execute(data_sql, data_params)
514
+ rows = [dict(r) for r in cur.fetchall()]
515
+ return rows, total
516
+
517
+
518
+ def _search_regex(conn, pattern, law_id, article_no, max_score, lang, limit, offset):
519
+ """Regex search using custom REGEXP function."""
520
+ try:
521
+ regex.compile(pattern)
522
+ except regex.error:
523
+ return [], 0
524
+
525
+ where = "1=1"
526
+ params = []
527
+
528
+ if lang == "zh":
529
+ where += " AND s.zh_text REGEXP ?"
530
+ params.append(pattern)
531
+ elif lang == "en":
532
+ where += " AND s.en_text REGEXP ?"
533
+ params.append(pattern)
534
+ else:
535
+ where += " AND (s.zh_text REGEXP ? OR s.en_text REGEXP ?)"
536
+ params.extend([pattern, pattern])
537
+
538
+ if max_score is not None:
539
+ where += " AND s.alignment_score <= ?"
540
+ params.append(max_score)
541
+ if law_id:
542
+ where += " AND l.law_id = ?"
543
+ params.append(law_id)
544
+ if article_no:
545
+ where += " AND (a.article_no_zh LIKE ? OR a.article_no_en LIKE ?)"
546
+ pat = f"%{article_no}%"
547
+ params.extend([pat, pat])
548
+
549
+ count_sql = f"""
550
+ SELECT count(*) FROM sentences s
551
+ JOIN laws l ON s.law_id = l.id
552
+ JOIN articles a ON s.article_id = a.id
553
+ WHERE {where}
554
+ """
555
+
556
+ data_sql = f"""
557
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score,
558
+ l.law_id, l.zh_name, l.en_name, l.type,
559
+ a.article_no_zh, a.article_no_en, a.article_type,
560
+ s.zh_sentence_idx, s.en_sentence_idx
561
+ FROM sentences s
562
+ JOIN laws l ON s.law_id = l.id
563
+ JOIN articles a ON s.article_id = a.id
564
+ WHERE {where}
565
+ ORDER BY s.alignment_score
566
+ LIMIT ? OFFSET ?
567
+ """
568
+ data_params = params + [limit, offset]
569
+
570
+ # Register REGEXP function
571
+ conn.create_function("REGEXP", 2, lambda pat, txt: bool(regex.search(pat, txt)) if txt else False)
572
+
573
+ cur = conn.execute(count_sql, params)
574
+ total = cur.fetchone()[0]
575
+
576
+ cur = conn.execute(data_sql, data_params)
577
+ rows = [dict(r) for r in cur.fetchall()]
578
+ return rows, total
579
+
580
+
581
+ def get_paragraph(conn, sentence_id):
582
+ """Get all sentences in the same paragraph as the given sentence.
583
+ Returns BOTH zh and en text for every sentence."""
584
+ cur = conn.execute(
585
+ """
586
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score,
587
+ s.zh_sentence_idx, s.en_sentence_idx,
588
+ p.paragraph_index, a.article_no_zh, a.article_no_en
589
+ FROM sentences s
590
+ JOIN paragraphs p ON s.paragraph_id = p.id
591
+ JOIN articles a ON s.article_id = a.id
592
+ WHERE s.paragraph_id = (SELECT paragraph_id FROM sentences WHERE id = ?)
593
+ ORDER BY s.zh_sentence_idx
594
+ """,
595
+ (sentence_id,),
596
+ )
597
+ rows = [dict(r) for r in cur.fetchall()]
598
+ if not rows:
599
+ return None
600
+ return {
601
+ "paragraph_index": rows[0]["paragraph_index"],
602
+ "article_no_zh": rows[0]["article_no_zh"],
603
+ "article_no_en": rows[0]["article_no_en"],
604
+ "sentences": rows,
605
+ }
606
+
607
+
608
+ def get_article(conn, sentence_id):
609
+ """Get all sentences in the same article as the given sentence.
610
+ Returns BOTH zh and en text for every sentence, grouped by paragraph."""
611
+ cur = conn.execute(
612
+ """
613
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score,
614
+ s.zh_sentence_idx, s.en_sentence_idx,
615
+ p.paragraph_index, p.id as paragraph_id,
616
+ a.article_no_zh, a.article_no_en, a.article_type
617
+ FROM sentences s
618
+ JOIN paragraphs p ON s.paragraph_id = p.id
619
+ JOIN articles a ON s.article_id = a.id
620
+ WHERE s.article_id = (SELECT article_id FROM sentences WHERE id = ?)
621
+ ORDER BY p.paragraph_index, s.zh_sentence_idx
622
+ """,
623
+ (sentence_id,),
624
+ )
625
+ rows = [dict(r) for r in cur.fetchall()]
626
+ if not rows:
627
+ return None
628
+
629
+ paragraphs = {}
630
+ for r in rows:
631
+ pidx = r["paragraph_index"]
632
+ if pidx not in paragraphs:
633
+ paragraphs[pidx] = {
634
+ "paragraph_index": pidx,
635
+ "paragraph_id": r["paragraph_id"],
636
+ "sentences": [],
637
+ }
638
+ paragraphs[pidx]["sentences"].append({
639
+ "id": r["id"],
640
+ "zh_text": r["zh_text"],
641
+ "en_text": r["en_text"],
642
+ "alignment_score": r["alignment_score"],
643
+ "zh_sentence_idx": r["zh_sentence_idx"],
644
+ "en_sentence_idx": r["en_sentence_idx"],
645
+ })
646
+
647
+ return {
648
+ "article_no_zh": rows[0]["article_no_zh"],
649
+ "article_no_en": rows[0]["article_no_en"],
650
+ "article_type": rows[0]["article_type"],
651
+ "paragraphs": [paragraphs[k] for k in sorted(paragraphs.keys())],
652
+ }
653
+
654
+
655
+ def list_laws(conn, law_type=None, category=None):
656
+ where = []
657
+ params = []
658
+ if law_type:
659
+ where.append("type = ?")
660
+ params.append(law_type)
661
+ if category:
662
+ where.append("category = ?")
663
+ params.append(category)
664
+
665
+ where_clause = " AND ".join(where) if where else "1=1"
666
+ cur = conn.execute(
667
+ f"SELECT law_id, type, zh_name, en_name, category FROM laws WHERE {where_clause} ORDER BY law_id",
668
+ params,
669
+ )
670
+ return [dict(r) for r in cur.fetchall()]
671
+
672
+
673
+ def list_categories(conn, law_type=None):
674
+ if law_type:
675
+ where = "WHERE type = ? AND category IS NOT NULL AND category != ''"
676
+ params = [law_type]
677
+ else:
678
+ where = "WHERE category IS NOT NULL AND category != ''"
679
+ params = []
680
+ cur = conn.execute(
681
+ f"SELECT DISTINCT category FROM laws {where} ORDER BY category",
682
+ params,
683
+ )
684
+ return [r["category"] for r in cur.fetchall()]
685
+
686
+
687
+ def get_law_articles(conn, law_id):
688
+ cur = conn.execute(
689
+ """
690
+ SELECT article_no_zh, article_no_en, article_type, article_index
691
+ FROM articles
692
+ WHERE law_id = (SELECT id FROM laws WHERE law_id = ?)
693
+ ORDER BY article_index
694
+ """,
695
+ (law_id,),
696
+ )
697
+ return [dict(r) for r in cur.fetchall()]
698
+ ```
699
+
700
+ ---
701
+
702
+ ## Step 5: Streamlit App (`concordancer.py`)
703
+
704
+ ### Critical Requirements
705
+
706
+ 1. **Use `regex` library, NOT `re`** — The `re` module has issues with some Unicode patterns. Use `import regex` instead.
707
+
708
+ 2. **Use `st.html()` NOT `st.markdown(..., unsafe_allow_html=True)`** for expanded paragraph/article views. `st.markdown()` strips text before the first HTML tag. `st.html()` renders raw HTML correctly.
709
+
710
+ 3. **Always HTML-escape text before inserting into HTML** — Use `html.escape()` to prevent `<`, `>`, `&` in legal text from being interpreted as HTML.
711
+
712
+ 4. **Match detection uses sentence `id`, NOT indices** — Compare `s["id"] == sid` to highlight the matched sentence in expanded views. Do NOT use `zh_sentence_idx` comparison — it breaks when searching English-only.
713
+
714
+ ### Full Implementation
715
+
716
+ ```python
717
+ """TWL Bilingual Concordancer — Streamlit App."""
718
+
719
+ import html
720
+ import regex
721
+ import streamlit as st
722
+ import db
723
+ from pathlib import Path
724
+
725
+ st.set_page_config(page_title="TWL Concordancer", page_icon="⚖️", layout="wide")
726
+
727
+ DB_PATH = Path(__file__).parent / "twl_concordancer.db"
728
+
729
+ st.markdown(
730
+ """
731
+ <style>
732
+ .zh-text, .en-text {
733
+ line-height: 1.8;
734
+ padding: 6px 10px;
735
+ border-radius: 4px;
736
+ font-size: 0.95rem;
737
+ white-space: pre-wrap;
738
+ }
739
+ .zh-text {
740
+ font-family: "Noto Serif TC", "Source Han Serif TC", "MingLiU", serif;
741
+ }
742
+ .en-text {
743
+ font-family: "Source Sans 3", "Segoe UI", sans-serif;
744
+ }
745
+ .zh-text.match, .en-text.match {
746
+ background-color: #fff9e6;
747
+ border-left: 3px solid #f5c518;
748
+ }
749
+ </style>
750
+ """,
751
+ unsafe_allow_html=True,
752
+ )
753
+
754
+
755
+ def _highlight(text, query):
756
+ """Highlight query in text. Always HTML-escapes text first."""
757
+ if not text or not query:
758
+ return html.escape(text)
759
+ escaped_text = html.escape(text)
760
+ escaped_query = html.escape(query)
761
+ return regex.sub(
762
+ rf"({regex.escape(escaped_query)})",
763
+ r'<mark style="background:#fef08a;padding:1px 2px;border-radius:2px">\1</mark>',
764
+ escaped_text,
765
+ flags=regex.IGNORECASE | regex.V1,
766
+ )
767
+
768
+
769
+ if "page" not in st.session_state:
770
+ st.session_state.page = 0
771
+ if "expanded" not in st.session_state:
772
+ st.session_state.expanded = {}
773
+
774
+ st.title("⚖️ TWL Bilingual Concordancer")
775
+ st.caption("Taiwan Law Chinese–English Aligned Corpus")
776
+
777
+ conn = db.get_conn(DB_PATH)
778
+
779
+ # ── Sidebar Filters ──────────────────────────────────────────
780
+ with st.sidebar:
781
+ st.header("Filters")
782
+
783
+ law_types = ["All", "law", "order"]
784
+ selected_type = st.selectbox("Type", law_types, index=0)
785
+ type_filter = None if selected_type == "All" else selected_type
786
+
787
+ categories = db.list_categories(conn, type_filter)
788
+ selected_cat = st.selectbox("Category", ["All"] + categories, index=0)
789
+ cat_filter = None if selected_cat == "All" else selected_cat
790
+
791
+ laws = db.list_laws(conn, type_filter, cat_filter)
792
+ law_options = ["All"] + [f"{l['law_id']} — {l['zh_name']}" for l in laws]
793
+ selected_law = st.selectbox("Law / Order", law_options, index=0)
794
+ law_id_filter = None
795
+ if selected_law != "All":
796
+ law_id_filter = selected_law.split(" — ")[0]
797
+
798
+ max_score = st.slider("Max alignment score (lower = better)", 0.0, 1.0, 1.0, 0.05)
799
+ max_score_filter = None if max_score >= 1.0 else max_score
800
+
801
+ lang_options = ["Both", "Chinese only", "English only"]
802
+ selected_lang = st.radio("Search language", lang_options, index=0)
803
+ lang_filter = {"Both": "both", "Chinese only": "zh", "English only": "en"}[selected_lang]
804
+
805
+ st.divider()
806
+ st.caption(f"{len(laws)} laws/orders in database")
807
+
808
+ st.divider()
809
+
810
+ # ── Search Bar ───────────────────────────────────────────────
811
+ col1, col2, col3 = st.columns([4, 1, 1])
812
+ with col1:
813
+ query = st.text_input("Search", placeholder="Enter keyword or regex…", key="search_query")
814
+ with col2:
815
+ use_regex = st.checkbox("Regex", value=False)
816
+ with col3:
817
+ per_page = st.selectbox("Per page", [10, 25, 50, 100], index=1)
818
+
819
+ # Article filter (only shown when a law is selected)
820
+ article_filter = None
821
+ if law_id_filter:
822
+ articles = db.get_law_articles(conn, law_id_filter)
823
+ art_options = ["All"] + [
824
+ f"{a['article_no_zh']} / {a['article_no_en']}"
825
+ for a in articles if a["article_no_zh"] or a["article_no_en"]
826
+ ]
827
+ selected_art = st.selectbox("Article", art_options, index=0)
828
+ if selected_art != "All":
829
+ article_filter = selected_art.split(" / ")[0]
830
+
831
+ # ── Search & Display ─────────────────────────────────────────
832
+ if query:
833
+ results, total = db.search_sentences(
834
+ conn, query, use_regex=use_regex,
835
+ law_id=law_id_filter, article_no=article_filter,
836
+ max_score=max_score_filter, lang=lang_filter,
837
+ limit=per_page, offset=st.session_state.page * per_page,
838
+ )
839
+
840
+ st.write(f"**{total}** sentence pair{'s' if total != 1 else ''} found")
841
+
842
+ # Pagination
843
+ if total > per_page:
844
+ total_pages = (total + per_page - 1) // per_page
845
+ cols = st.columns([1, 4, 1])
846
+ with cols[0]:
847
+ if st.button("← Previous", disabled=st.session_state.page == 0, use_container_width=True):
848
+ st.session_state.page -= 1
849
+ st.session_state.expanded = {}
850
+ with cols[1]:
851
+ st.write(f"Page {st.session_state.page + 1} of {total_pages}")
852
+ with cols[2]:
853
+ if st.button("Next →", disabled=(st.session_state.page + 1) * per_page >= total, use_container_width=True):
854
+ st.session_state.page += 1
855
+ st.session_state.expanded = {}
856
+
857
+ for row in results:
858
+ sid = row["id"]
859
+ score = row["alignment_score"]
860
+ law_ref = f"{row['law_id']} {row['zh_name']}"
861
+ art_ref = f"{row['article_no_zh']} / {row['article_no_en']}" if row["article_no_zh"] or row["article_no_en"] else ""
862
+
863
+ with st.container(border=True):
864
+ st.markdown(f"`{law_ref}`{' | ' + art_ref if art_ref else ''} | Score: `{score:.4f}`")
865
+
866
+ # Sentence-level display (always bilingual)
867
+ zh_text = row["zh_text"] or ""
868
+ en_text = row["en_text"] or ""
869
+
870
+ if query and not use_regex:
871
+ zh_display = _highlight(zh_text, query)
872
+ en_display = _highlight(en_text, query)
873
+ else:
874
+ zh_display = html.escape(zh_text)
875
+ en_display = html.escape(en_text)
876
+
877
+ col_zh, col_en = st.columns(2)
878
+ with col_zh:
879
+ st.html(f'<div class="zh-text">{zh_display}</div>')
880
+ with col_en:
881
+ st.html(f'<div class="en-text">{en_display}</div>')
882
+
883
+ # Expand buttons
884
+ exp_col1, exp_col2 = st.columns(2)
885
+ with exp_col1:
886
+ if st.button("▸ Paragraph", key=f"para_{sid}"):
887
+ st.session_state.expanded[f"para_{sid}"] = not st.session_state.expanded.get(f"para_{sid}", False)
888
+ with exp_col2:
889
+ if st.button("▸ Article", key=f"art_{sid}"):
890
+ st.session_state.expanded[f"art_{sid}"] = not st.session_state.expanded.get(f"art_{sid}", False)
891
+
892
+ # ── Expanded Paragraph View ──────────────────────
893
+ if st.session_state.expanded.get(f"para_{sid}"):
894
+ para = db.get_paragraph(conn, sid)
895
+ if para:
896
+ with st.container(border=True):
897
+ st.markdown(f"**Paragraph** ({para['article_no_zh']} / {para['article_no_en']})")
898
+ for s in para["sentences"]:
899
+ p_zh = s["zh_text"] or ""
900
+ p_en = s["en_text"] or ""
901
+ if query and not use_regex:
902
+ p_zh_display = _highlight(p_zh, query)
903
+ p_en_display = _highlight(p_en, query)
904
+ else:
905
+ p_zh_display = html.escape(p_zh)
906
+ p_en_display = html.escape(p_en)
907
+
908
+ # CRITICAL: Match by sentence id, NOT by index
909
+ is_match = s["id"] == sid
910
+ marker = "◀" if is_match else ""
911
+
912
+ c1, c2 = st.columns(2)
913
+ with c1:
914
+ st.html(f'<div class="zh-text{" match" if is_match else ""}>{marker} {p_zh_display}</div>')
915
+ with c2:
916
+ st.html(f'<div class="en-text{" match" if is_match else ""}>{marker} {p_en_display}</div>')
917
+
918
+ # ── Expanded Article View ────────────────────────
919
+ if st.session_state.expanded.get(f"art_{sid}"):
920
+ article = db.get_article(conn, sid)
921
+ if article:
922
+ with st.container(border=True):
923
+ st.markdown(f"**Article** ({article['article_no_zh']} / {article['article_no_en']})")
924
+ for pi, para in enumerate(article["paragraphs"]):
925
+ st.markdown(f"*Paragraph {pi + 1}*")
926
+ for s in para["sentences"]:
927
+ a_zh = s["zh_text"] or ""
928
+ a_en = s["en_text"] or ""
929
+ if query and not use_regex:
930
+ a_zh_display = _highlight(a_zh, query)
931
+ a_en_display = _highlight(a_en, query)
932
+ else:
933
+ a_zh_display = html.escape(a_zh)
934
+ a_en_display = html.escape(a_en)
935
+
936
+ # CRITICAL: Match by sentence id, NOT by index
937
+ is_match = s["id"] == sid
938
+ marker = "◀" if is_match else ""
939
+
940
+ c1, c2 = st.columns(2)
941
+ with c1:
942
+ st.html(f'<div class="zh-text{" match" if is_match else ""}>{marker} {a_zh_display}</div>')
943
+ with c2:
944
+ st.html(f'<div class="en-text{" match" if is_match else ""}>{marker} {a_en_display}</div>')
945
+
946
+ elif not query:
947
+ st.info("Enter a search term above to find aligned sentence pairs.")
948
+
949
+ st.divider()
950
+ st.caption("TWL Concordancer | Taiwan Law Bilingual Corpus")
951
+
952
+ conn.close()
953
+ ```
954
+
955
+ ---
956
+
957
+ ## Step 6: Run
958
+
959
+ ```bash
960
+ # 1. Align the corpus (on CUDA machine)
961
+ python3 .opencode/skills/twl-align-corpus/scripts/align_corpus.py \
962
+ 2026-03-27/TWL.2026-03-27.json \
963
+ 2026-03-27/TWL.2026-03-27.aligned.json \
964
+ --device cuda --resume
965
+
966
+ # 2. Build SQLite database
967
+ python3 twl-concordancer/build_db.py \
968
+ 2026-03-27/TWL.2026-03-27.aligned.json \
969
+ twl-concordancer/twl_concordancer.db
970
+
971
+ # 3. Launch Streamlit app
972
+ streamlit run twl-concordancer/concordancer.py
973
+ ```
974
+
975
+ ---
976
+
977
+ ## Known Pitfalls & Fixes
978
+
979
+ ### 1. `list_categories` SQL syntax error
980
+ **Bug:** `SELECT DISTINCT category FROM laws AND category IS NOT NULL` — missing `WHERE` when `law_type` is `None`.
981
+ **Fix:** Always include `WHERE` clause.
982
+
983
+ ### 2. Paragraph/Article shows only text AFTER the match
984
+ **Bug:** `st.markdown(..., unsafe_allow_html=True)` strips text before the first HTML tag (`<mark>`).
985
+ **Fix:** Use `st.html()` instead of `st.markdown()` for all HTML rendering.
986
+
987
+ ### 3. Match highlighting highlights wrong sentence
988
+ **Bug:** Comparing `s["zh_sentence_idx"] == row["zh_sentence_idx"]` fails when searching English-only (the matched sentence's Chinese index may differ).
989
+ **Fix:** Compare by sentence `id`: `is_match = s["id"] == sid`.
990
+
991
+ ### 4. HTML injection from legal text
992
+ **Bug:** Legal text contains `<`, `>`, `&` characters that break HTML rendering.
993
+ **Fix:** Always `html.escape()` text BEFORE inserting `<mark>` tags.
994
+
995
+ ### 5. `get_article()` missing `id` key
996
+ **Bug:** The dict builder in `get_article()` dropped the `s.id` column from the SQL result.
997
+ **Fix:** Include `"id": r["id"]` in the sentence dict.
998
+
999
+ ### 6. Regex library
1000
+ **Requirement:** Use `regex` (PyPI), NOT `re` (stdlib). The `regex` library handles Unicode better and provides `regex.V1` flag for proper escape behavior.
1001
+
1002
+ ---
1003
+
1004
+ ## Expected Database Size
1005
+
1006
+ | Metric | Full Corpus (est.) |
1007
+ |--------|-------------------|
1008
+ | Laws | ~967 |
1009
+ | Orders | ~2,193 |
1010
+ | Articles | ~50,000+ |
1011
+ | Sentences | ~300,000+ |
1012
+ | DB Size | ~200-400MB |
1013
+
1014
+ LIKE search on 300K rows is fast enough (<100ms) with proper indexes. No FTS5 needed.
README.md CHANGED
@@ -1,20 +1,23 @@
1
  ---
2
- title: TWL
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
  pinned: false
11
- short_description: Taiwan Law Database
12
  license: mit
13
  ---
14
 
15
- # Welcome to Streamlit!
16
 
17
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
18
 
19
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
20
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
1
  ---
2
+ title: TWL Concordancer
3
+ emoji: ⚖️
4
+ colorFrom: blue
5
+ colorTo: gray
6
+ sdk: streamlit
7
+ sdk_version: 1.39.0
8
+ app_file: app.py
 
9
  pinned: false
 
10
  license: mit
11
  ---
12
 
13
+ # TWL Concordancer
14
 
15
+ Streamlit bilingual concordancer for the Taiwan Law Chinese-English aligned corpus.
16
 
17
+ This Space expects the SQLite database file `twl_concordancer.db` to be present at the repository root.
18
+
19
+ ## Local Development
20
+
21
+ ```bash
22
+ streamlit run app.py
23
+ ```
__pycache__/app.cpython-314.pyc ADDED
Binary file (243 Bytes). View file
 
__pycache__/concordancer.cpython-314.pyc ADDED
Binary file (18.4 kB). View file
 
__pycache__/db.cpython-314.pyc ADDED
Binary file (14.9 kB). View file
 
app.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ """Hugging Face Spaces entrypoint for the TWL concordancer."""
2
+
3
+ import concordancer # noqa: F401
build_db.py ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Build SQLite database from TWL aligned corpus JSON.
4
+
5
+ Usage:
6
+ python3 build_db.py 2026-03-27/TWL.2026-03-27.aligned.json twl_concordancer.db
7
+ python3 build_db.py 2026-03-27/A0000001.aligned.json twl_concordancer.db --append
8
+ """
9
+
10
+ import sys
11
+ import json
12
+ import sqlite3
13
+ import argparse
14
+ from pathlib import Path
15
+
16
+
17
+ DDL = """
18
+ CREATE TABLE IF NOT EXISTS laws (
19
+ id INTEGER PRIMARY KEY,
20
+ law_id TEXT UNIQUE,
21
+ type TEXT,
22
+ zh_name TEXT,
23
+ en_name TEXT,
24
+ category TEXT,
25
+ effective_date TEXT,
26
+ modified_date TEXT
27
+ );
28
+
29
+ CREATE TABLE IF NOT EXISTS articles (
30
+ id INTEGER PRIMARY KEY,
31
+ law_id INTEGER REFERENCES laws(id),
32
+ article_no_zh TEXT,
33
+ article_no_en TEXT,
34
+ article_type TEXT,
35
+ article_index INTEGER
36
+ );
37
+
38
+ CREATE TABLE IF NOT EXISTS paragraphs (
39
+ id INTEGER PRIMARY KEY,
40
+ article_id INTEGER REFERENCES articles(id),
41
+ law_id INTEGER REFERENCES laws(id),
42
+ paragraph_index INTEGER
43
+ );
44
+
45
+ CREATE TABLE IF NOT EXISTS sentences (
46
+ id INTEGER PRIMARY KEY,
47
+ paragraph_id INTEGER REFERENCES paragraphs(id),
48
+ article_id INTEGER REFERENCES articles(id),
49
+ law_id INTEGER REFERENCES laws(id),
50
+ zh_text TEXT NOT NULL,
51
+ en_text TEXT NOT NULL,
52
+ alignment_score REAL,
53
+ zh_sentence_idx INTEGER,
54
+ en_sentence_idx INTEGER
55
+ );
56
+
57
+ CREATE INDEX IF NOT EXISTS idx_sentences_paragraph ON sentences(paragraph_id);
58
+ CREATE INDEX IF NOT EXISTS idx_sentences_article ON sentences(article_id);
59
+ CREATE INDEX IF NOT EXISTS idx_sentences_law ON sentences(law_id);
60
+ CREATE INDEX IF NOT EXISTS idx_paragraphs_article ON paragraphs(article_id);
61
+ CREATE INDEX IF NOT EXISTS idx_articles_law ON articles(law_id);
62
+ """
63
+
64
+ TRIGGERS = ""
65
+
66
+
67
+ def build_db(input_file, db_file, append=False):
68
+ conn = sqlite3.connect(db_file)
69
+ conn.execute("PRAGMA journal_mode=WAL")
70
+ conn.execute("PRAGMA synchronous=NORMAL")
71
+ conn.execute("PRAGMA foreign_keys=ON")
72
+ cur = conn.cursor()
73
+
74
+ if not append:
75
+ cur.executescript(
76
+ "DROP TABLE IF EXISTS sentences; DROP TABLE IF EXISTS paragraphs; DROP TABLE IF EXISTS articles; DROP TABLE IF EXISTS laws; DROP TABLE IF EXISTS sentence_fts;"
77
+ )
78
+
79
+ cur.executescript(DDL)
80
+ cur.executescript(TRIGGERS)
81
+
82
+ with open(input_file, encoding="utf-8") as f:
83
+ corpus = json.load(f)
84
+
85
+ law_count = 0
86
+ article_count = 0
87
+ paragraph_count = 0
88
+ sentence_count = 0
89
+
90
+ for entry_type, key in [("law", "laws"), ("order", "orders")]:
91
+ items = corpus.get(key, [])
92
+ for item in items:
93
+ law_id = item.get("law_id") or item.get("order_id", "")
94
+ zh_name = item.get("zh_name", "")
95
+ en_name = item.get("en_name", "")
96
+ category = item.get("category", "")
97
+
98
+ try:
99
+ cur.execute(
100
+ "INSERT INTO laws (law_id, type, zh_name, en_name, category) VALUES (?, ?, ?, ?, ?)",
101
+ (law_id, entry_type, zh_name, en_name, category),
102
+ )
103
+ except sqlite3.IntegrityError:
104
+ if append:
105
+ cur.execute("SELECT id FROM laws WHERE law_id = ?", (law_id,))
106
+ row = cur.fetchone()
107
+ if row:
108
+ cur.execute(
109
+ "UPDATE laws SET zh_name = ?, en_name = ?, category = ? WHERE id = ?",
110
+ (zh_name, en_name, category, row[0]),
111
+ )
112
+ law_db_id = row[0]
113
+ else:
114
+ cur.execute(
115
+ "INSERT INTO laws (law_id, type, zh_name, en_name, category) VALUES (?, ?, ?, ?, ?)",
116
+ (law_id, entry_type, zh_name, en_name, category),
117
+ )
118
+ law_db_id = cur.lastrowid
119
+ else:
120
+ raise
121
+ else:
122
+ law_db_id = cur.lastrowid
123
+
124
+ law_count += 1
125
+
126
+ articles = item.get("articles", [])
127
+ for art_idx, art in enumerate(articles):
128
+ article_no_zh = art.get("article_no_zh", "")
129
+ article_no_en = art.get("article_no_en", "")
130
+ article_type = art.get("article_type", "")
131
+
132
+ cur.execute(
133
+ "INSERT INTO articles (law_id, article_no_zh, article_no_en, article_type, article_index) VALUES (?, ?, ?, ?, ?)",
134
+ (law_db_id, article_no_zh, article_no_en, article_type, art_idx),
135
+ )
136
+ article_db_id = cur.lastrowid
137
+ article_count += 1
138
+
139
+ paragraphs = art.get("paragraphs", [])
140
+ for para_idx, para in enumerate(paragraphs):
141
+ cur.execute(
142
+ "INSERT INTO paragraphs (article_id, law_id, paragraph_index) VALUES (?, ?, ?)",
143
+ (article_db_id, law_db_id, para_idx),
144
+ )
145
+ paragraph_db_id = cur.lastrowid
146
+ paragraph_count += 1
147
+
148
+ aligned_sentences = para.get("sentences", [])
149
+ for sent_idx, sent in enumerate(aligned_sentences):
150
+ zh_text = sent.get("zh", "")
151
+ en_text = sent.get("en", "")
152
+ score = sent.get("score", 0.0)
153
+ zh_sidx = (
154
+ sent.get("zh_indices", [sent_idx])[0]
155
+ if sent.get("zh_indices")
156
+ else sent_idx
157
+ )
158
+ en_sidx = (
159
+ sent.get("en_indices", [sent_idx])[0]
160
+ if sent.get("en_indices")
161
+ else sent_idx
162
+ )
163
+
164
+ if zh_text or en_text:
165
+ cur.execute(
166
+ "INSERT INTO sentences (paragraph_id, article_id, law_id, zh_text, en_text, alignment_score, zh_sentence_idx, en_sentence_idx) VALUES (?, ?, ?, ?, ?, ?, ?, ?)",
167
+ (
168
+ paragraph_db_id,
169
+ article_db_id,
170
+ law_db_id,
171
+ zh_text,
172
+ en_text,
173
+ score,
174
+ zh_sidx,
175
+ en_sidx,
176
+ ),
177
+ )
178
+ sentence_count += 1
179
+
180
+ if law_count % 100 == 0:
181
+ conn.commit()
182
+ print(
183
+ f" Processed {law_count} {entry_type}s, {sentence_count} sentences..."
184
+ )
185
+
186
+ conn.commit()
187
+
188
+ print(f"\nDatabase built: {db_file}")
189
+ print(f" Laws/orders: {law_count}")
190
+ print(f" Articles: {article_count}")
191
+ print(f" Paragraphs: {paragraph_count}")
192
+ print(f" Sentences: {sentence_count}")
193
+
194
+ cur.execute("SELECT count(*) FROM sentences")
195
+ sent_count = cur.fetchone()[0]
196
+ print(f" Sentence records: {sent_count}")
197
+
198
+ conn.close()
199
+
200
+
201
+ if __name__ == "__main__":
202
+ parser = argparse.ArgumentParser(
203
+ description="Build SQLite concordancer DB from TWL aligned JSON"
204
+ )
205
+ parser.add_argument("input_file", help="Path to aligned corpus JSON")
206
+ parser.add_argument("db_file", help="Output SQLite database path")
207
+ parser.add_argument(
208
+ "--append",
209
+ action="store_true",
210
+ help="Append to existing database instead of replacing",
211
+ )
212
+ args = parser.parse_args()
213
+
214
+ build_db(args.input_file, args.db_file, append=args.append)
concordancer.py ADDED
@@ -0,0 +1,369 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """TWL Bilingual Concordancer — Streamlit App."""
2
+
3
+ import html
4
+ from pathlib import Path
5
+
6
+ import regex
7
+ import streamlit as st
8
+
9
+ import db
10
+
11
+ st.set_page_config(page_title="TWL Concordancer", page_icon="⚖️", layout="wide")
12
+
13
+ DB_PATH = Path(__file__).parent / "twl_concordancer.db"
14
+
15
+ st.markdown(
16
+ """
17
+ <style>
18
+ section[data-testid="stSidebar"] > div:first-child {
19
+ top: 0;
20
+ height: 100vh;
21
+ }
22
+ section[data-testid="stSidebar"] div[data-testid="stSidebarContent"] {
23
+ padding-top: 0rem !important;
24
+ margin-top: 0rem !important;
25
+ }
26
+ section[data-testid="stSidebar"] div[data-testid="stSidebarHeader"] {
27
+ min-height: 0rem !important;
28
+ height: 0.25rem !important;
29
+ padding-top: 0rem !important;
30
+ padding-bottom: 0rem !important;
31
+ margin-bottom: 0rem !important;
32
+ }
33
+ section[data-testid="stSidebar"] div[data-testid="stSidebarUserContent"] {
34
+ padding-top: 0rem !important;
35
+ margin-top: 0rem !important;
36
+ }
37
+ div[data-testid="stMainBlockContainer"],
38
+ .main .block-container {
39
+ padding-top: 1.2rem;
40
+ }
41
+ .zh-text, .en-text {
42
+ line-height: 1.8;
43
+ padding: 6px 10px;
44
+ border-radius: 4px;
45
+ white-space: pre-wrap;
46
+ color: var(--text-color);
47
+ word-break: break-word;
48
+ }
49
+ .zh-text {
50
+ font-family: "Microsoft JhengHei", "Source Han Sans", "Noto Sans CJK TC Regular", "Hiragino Sans CNS", "LantingHei TC", "Source Han Serif", sans-serif;
51
+ font-size: 20px;
52
+ letter-spacing: 0.01em;
53
+ }
54
+ .en-text {
55
+ font-family: "Source Pro", Consolas, "LingWai TC", Menlo, "Courier New", Arial, sans-serif;
56
+ font-size: 15px;
57
+ }
58
+ .zh-text.match, .en-text.match {
59
+ background-color: color-mix(in srgb, var(--primary-color) 12%, var(--background-color));
60
+ border-left: 3px solid #f5c518;
61
+ color: var(--text-color);
62
+ }
63
+ mark {
64
+ background: #f5c518;
65
+ color: #111827;
66
+ padding: 1px 2px;
67
+ border-radius: 2px;
68
+ }
69
+ </style>
70
+ """,
71
+ unsafe_allow_html=True,
72
+ )
73
+
74
+
75
+ def _highlight(text, query):
76
+ if not text or not query:
77
+ return html.escape(text)
78
+ escaped_text = html.escape(text)
79
+ escaped_query = html.escape(query)
80
+ return regex.sub(
81
+ rf"({regex.escape(escaped_query)})",
82
+ r'<mark style="background:#fef08a;padding:1px 2px;border-radius:2px">\1</mark>',
83
+ escaped_text,
84
+ flags=regex.IGNORECASE | regex.V1,
85
+ )
86
+
87
+
88
+ def _highlight_regex(text, pattern):
89
+ if not text or not pattern:
90
+ return html.escape(text)
91
+ try:
92
+ compiled = regex.compile(pattern, flags=regex.IGNORECASE | regex.V1)
93
+ except regex.error:
94
+ return html.escape(text)
95
+
96
+ parts = []
97
+ last_end = 0
98
+ for match in compiled.finditer(text):
99
+ start, end = match.span()
100
+ if start == end:
101
+ continue
102
+ parts.append(html.escape(text[last_end:start]))
103
+ parts.append(
104
+ f'<mark style="background:#fef08a;padding:1px 2px;border-radius:2px">{html.escape(text[start:end])}</mark>'
105
+ )
106
+ last_end = end
107
+ parts.append(html.escape(text[last_end:]))
108
+ return "".join(parts)
109
+
110
+
111
+ def _join_sentences(sentences, lang):
112
+ parts = [
113
+ (s.get("zh_text", "") if lang == "zh" else s.get("en_text", "")).strip()
114
+ for s in sentences
115
+ ]
116
+ parts = [p for p in parts if p]
117
+ if not parts:
118
+ return ""
119
+ if lang == "zh":
120
+ return "".join(parts)
121
+ return " ".join(parts)
122
+
123
+
124
+ if "page" not in st.session_state:
125
+ st.session_state.page = 0
126
+ if "expanded" not in st.session_state:
127
+ st.session_state.expanded = {}
128
+ if "search_signature" not in st.session_state:
129
+ st.session_state.search_signature = None
130
+
131
+ st.title("⚖️ 全國法規資料庫 華英檢索系統")
132
+ st.caption("Taiwan Law (TWL) Chinese–English Aligned Corpus - Bilingual Concordancer")
133
+
134
+ conn = db.get_conn(DB_PATH)
135
+
136
+ with st.sidebar:
137
+ st.header("搜尋範圍過濾 Filters")
138
+
139
+ law_types = ["All", "law", "order"]
140
+ selected_type = st.selectbox("法規/命令 Type", law_types, index=0)
141
+ type_filter = None if selected_type == "All" else selected_type
142
+
143
+ categories = db.list_categories(conn, type_filter)
144
+ selected_cat = st.selectbox("機關 Category", ["All"] + categories, index=0)
145
+ cat_filter = None if selected_cat == "All" else selected_cat
146
+
147
+ laws = db.list_laws(conn, type_filter, cat_filter)
148
+ law_options = ["All"] + [f"{l['law_id']} — {l['zh_name']}" for l in laws]
149
+ selected_law = st.selectbox("單一法規/命令 Law/Order", law_options, index=0)
150
+ law_id_filter = None
151
+ if selected_law != "All":
152
+ law_id_filter = selected_law.split(" — ")[0]
153
+
154
+ max_score = st.slider("Max alignment score (lower = better)", 0.0, 1.0, 1.0, 0.05)
155
+ max_score_filter = None if max_score >= 1.0 else max_score
156
+
157
+ lang_options = {
158
+ "中英 / Both": "both",
159
+ "中文 / Chinese": "zh",
160
+ "英文 / English": "en",
161
+ }
162
+ selected_lang = st.radio("搜尋語言 / Search language", list(lang_options), index=0)
163
+ lang_filter = lang_options[selected_lang]
164
+
165
+ st.divider()
166
+ st.caption(f"{len(laws)} laws/orders in database")
167
+
168
+ with st.form("search_form", clear_on_submit=False):
169
+ col1, col2, col3 = st.columns([4, 1, 1])
170
+ with col1:
171
+ query = st.text_input(
172
+ "Search", placeholder="Enter keyword or regex…", key="search_query"
173
+ )
174
+ with col2:
175
+ use_regex = st.checkbox("Regex", value=False)
176
+ submitted = st.form_submit_button("Submit", use_container_width=True)
177
+ with col3:
178
+ per_page = st.selectbox("Per page", [10, 25, 50, 100], index=1)
179
+
180
+ article_filter = None
181
+ if law_id_filter:
182
+ articles = db.get_law_articles(conn, law_id_filter)
183
+ art_options = ["All"] + [
184
+ f"{a['article_no_zh']} / {a['article_no_en']}"
185
+ for a in articles
186
+ if a["article_no_zh"] or a["article_no_en"]
187
+ ]
188
+ selected_art = st.selectbox("Article", art_options, index=0)
189
+ if selected_art != "All":
190
+ parts = selected_art.split(" / ")
191
+ article_filter = parts[0] if parts else None
192
+
193
+ search_signature = (
194
+ query,
195
+ use_regex,
196
+ per_page,
197
+ cat_filter,
198
+ law_id_filter,
199
+ article_filter,
200
+ max_score_filter,
201
+ lang_filter,
202
+ )
203
+ if st.session_state.search_signature != search_signature:
204
+ st.session_state.page = 0
205
+ st.session_state.expanded = {}
206
+ st.session_state.search_signature = search_signature
207
+
208
+ if query:
209
+ results, total = db.search_sentences(
210
+ conn,
211
+ query,
212
+ use_regex=use_regex,
213
+ law_id=law_id_filter,
214
+ category=cat_filter,
215
+ article_no=article_filter,
216
+ max_score=max_score_filter,
217
+ lang=lang_filter,
218
+ limit=per_page,
219
+ offset=st.session_state.page * per_page,
220
+ )
221
+
222
+ st.write(f"**{total}** sentence pair{'s' if total != 1 else ''} found")
223
+
224
+ if total > per_page:
225
+ total_pages = (total + per_page - 1) // per_page
226
+ cols = st.columns([1, 4, 1])
227
+ with cols[0]:
228
+ if st.button(
229
+ "← Previous",
230
+ disabled=st.session_state.page == 0,
231
+ use_container_width=True,
232
+ ):
233
+ st.session_state.page -= 1
234
+ st.session_state.expanded = {}
235
+ st.rerun()
236
+ with cols[1]:
237
+ st.write(f"Page {st.session_state.page + 1} of {total_pages}")
238
+ with cols[2]:
239
+ if st.button(
240
+ "Next →",
241
+ disabled=(st.session_state.page + 1) * per_page >= total,
242
+ use_container_width=True,
243
+ ):
244
+ st.session_state.page += 1
245
+ st.session_state.expanded = {}
246
+ st.rerun()
247
+
248
+ for row in results:
249
+ sid = row["id"]
250
+ score = row["alignment_score"]
251
+ law_ref = f"{row['law_id']} {row['zh_name']}"
252
+ art_ref = (
253
+ f"{row['article_no_zh']} / {row['article_no_en']}"
254
+ if row["article_no_zh"] or row["article_no_en"]
255
+ else ""
256
+ )
257
+
258
+ with st.container(border=True):
259
+ st.markdown(
260
+ f"`{law_ref}`{' | ' + art_ref if art_ref else ''} | Score: `{score:.4f}`"
261
+ )
262
+
263
+ zh_text = row["zh_text"] or ""
264
+ en_text = row["en_text"] or ""
265
+
266
+ if query and use_regex:
267
+ zh_display = _highlight_regex(zh_text, query)
268
+ en_display = _highlight_regex(en_text, query)
269
+ elif query and not use_regex:
270
+ zh_display = _highlight(zh_text, query)
271
+ en_display = _highlight(en_text, query)
272
+ else:
273
+ zh_display = html.escape(zh_text)
274
+ en_display = html.escape(en_text)
275
+
276
+ col_zh, col_en = st.columns([2, 3])
277
+ with col_zh:
278
+ st.markdown(
279
+ f'<div class="zh-text">{zh_display}</div>', unsafe_allow_html=True
280
+ )
281
+ with col_en:
282
+ st.markdown(
283
+ f'<div class="en-text">{en_display}</div>', unsafe_allow_html=True
284
+ )
285
+
286
+ exp_col1, exp_col2 = st.columns(2)
287
+ with exp_col1:
288
+ if st.button("▸ Paragraph", key=f"para_{sid}"):
289
+ st.session_state.expanded[
290
+ f"para_{sid}"
291
+ ] = not st.session_state.expanded.get(f"para_{sid}", False)
292
+ with exp_col2:
293
+ if st.button("▸ Article", key=f"art_{sid}"):
294
+ st.session_state.expanded[
295
+ f"art_{sid}"
296
+ ] = not st.session_state.expanded.get(f"art_{sid}", False)
297
+
298
+ if st.session_state.expanded.get(f"para_{sid}"):
299
+ para = db.get_paragraph(conn, sid)
300
+ if para:
301
+ with st.container(border=True):
302
+ st.markdown(
303
+ f"**Paragraph** ({para['article_no_zh']} / {para['article_no_en']})"
304
+ )
305
+ para_zh = _join_sentences(para["sentences"], "zh")
306
+ para_en = _join_sentences(para["sentences"], "en")
307
+ if query and use_regex:
308
+ para_zh_display = _highlight_regex(para_zh, query)
309
+ para_en_display = _highlight_regex(para_en, query)
310
+ elif query and not use_regex:
311
+ para_zh_display = _highlight(para_zh, query)
312
+ para_en_display = _highlight(para_en, query)
313
+ else:
314
+ para_zh_display = html.escape(para_zh)
315
+ para_en_display = html.escape(para_en)
316
+ c1, c2 = st.columns([2, 3])
317
+ with c1:
318
+ st.markdown(
319
+ f'<div class="zh-text match">{para_zh_display}</div>',
320
+ unsafe_allow_html=True,
321
+ )
322
+ with c2:
323
+ st.markdown(
324
+ f'<div class="en-text match">{para_en_display}</div>',
325
+ unsafe_allow_html=True,
326
+ )
327
+
328
+ if st.session_state.expanded.get(f"art_{sid}"):
329
+ article = db.get_article(conn, sid)
330
+ if article:
331
+ with st.container(border=True):
332
+ st.markdown(
333
+ f"**Article** ({article['article_no_zh']} / {article['article_no_en']})"
334
+ )
335
+ for pi, para in enumerate(article["paragraphs"]):
336
+ st.markdown(f"*Paragraph {pi + 1}*")
337
+ art_zh = _join_sentences(para["sentences"], "zh")
338
+ art_en = _join_sentences(para["sentences"], "en")
339
+ if query and use_regex:
340
+ art_zh_display = _highlight_regex(art_zh, query)
341
+ art_en_display = _highlight_regex(art_en, query)
342
+ elif query and not use_regex:
343
+ art_zh_display = _highlight(art_zh, query)
344
+ art_en_display = _highlight(art_en, query)
345
+ else:
346
+ art_zh_display = html.escape(art_zh)
347
+ art_en_display = html.escape(art_en)
348
+ contains_match = any(
349
+ s["id"] == sid for s in para["sentences"]
350
+ )
351
+ c1, c2 = st.columns([2, 3])
352
+ with c1:
353
+ st.markdown(
354
+ f'<div class="zh-text{" match" if contains_match else ""}">{art_zh_display}</div>',
355
+ unsafe_allow_html=True,
356
+ )
357
+ with c2:
358
+ st.markdown(
359
+ f'<div class="en-text{" match" if contains_match else ""}">{art_en_display}</div>',
360
+ unsafe_allow_html=True,
361
+ )
362
+
363
+ elif not query:
364
+ st.info("Enter a search term above to find aligned sentence pairs.")
365
+
366
+ st.divider()
367
+ st.caption("TWL Concordancer | Taiwan Law Bilingual Corpus")
368
+
369
+ conn.close()
db.py ADDED
@@ -0,0 +1,385 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Database query layer for TWL Concordancer."""
2
+
3
+ import sqlite3
4
+ from pathlib import Path
5
+ from typing import Iterable
6
+
7
+ import regex
8
+
9
+ DEFAULT_DB = Path(__file__).parent / "twl_concordancer.db"
10
+
11
+
12
+ def get_conn(db_path=None):
13
+ if db_path is None:
14
+ db_path = DEFAULT_DB
15
+ conn = sqlite3.connect(str(db_path))
16
+ conn.row_factory = sqlite3.Row
17
+ conn.execute("PRAGMA busy_timeout=5000")
18
+ try:
19
+ conn.execute("PRAGMA journal_mode=WAL")
20
+ except sqlite3.OperationalError:
21
+ # Some hosted environments are fine with plain read access but do not
22
+ # allow switching journal mode.
23
+ pass
24
+ return conn
25
+
26
+
27
+ def _expand_category_options(categories: Iterable[str]):
28
+ expanded = set()
29
+ for category in categories:
30
+ category = (category or "").strip()
31
+ if not category:
32
+ continue
33
+ expanded.add(category)
34
+ parts = [part.strip() for part in category.split(">") if part.strip()]
35
+ if len(parts) <= 1:
36
+ continue
37
+ for depth in range(1, len(parts)):
38
+ expanded.add(">".join(parts[:depth]) + ">")
39
+ return sorted(expanded, key=_category_sort_key)
40
+
41
+
42
+ def _category_sort_key(category: str):
43
+ category = (category or "").strip()
44
+ is_repealed = category.startswith("廢止法規>")
45
+ return (1 if is_repealed else 0, category)
46
+
47
+
48
+ def _build_order_by(lang):
49
+ if lang == "zh":
50
+ return """
51
+ ORDER BY
52
+ CASE
53
+ WHEN s.en_text IS NOT NULL AND trim(s.en_text) != '' THEN 0
54
+ ELSE 1
55
+ END,
56
+ s.alignment_score,
57
+ s.id
58
+ """
59
+ if lang == "en":
60
+ return """
61
+ ORDER BY
62
+ CASE
63
+ WHEN s.zh_text IS NOT NULL AND trim(s.zh_text) != '' THEN 0
64
+ ELSE 1
65
+ END,
66
+ s.alignment_score,
67
+ s.id
68
+ """
69
+ return """
70
+ ORDER BY
71
+ CASE
72
+ WHEN s.zh_text IS NOT NULL AND trim(s.zh_text) != ''
73
+ AND s.en_text IS NOT NULL AND trim(s.en_text) != '' THEN 0
74
+ ELSE 1
75
+ END,
76
+ s.alignment_score,
77
+ s.id
78
+ """
79
+
80
+
81
+ def search_sentences(
82
+ conn,
83
+ query,
84
+ use_regex=False,
85
+ law_id=None,
86
+ category=None,
87
+ article_no=None,
88
+ max_score=None,
89
+ lang="both",
90
+ limit=100,
91
+ offset=0,
92
+ ):
93
+ if use_regex:
94
+ return _search_regex(
95
+ conn, query, law_id, category, article_no, max_score, lang, limit, offset
96
+ )
97
+ return _search_like(
98
+ conn, query, law_id, category, article_no, max_score, lang, limit, offset
99
+ )
100
+
101
+
102
+ def _search_like(
103
+ conn, query, law_id, category, article_no, max_score, lang, limit, offset
104
+ ):
105
+ terms = query.strip()
106
+ if not terms:
107
+ return [], 0
108
+
109
+ where = []
110
+ params = []
111
+
112
+ if lang == "zh":
113
+ where.append("s.zh_text LIKE ?")
114
+ params.append(f"%{terms}%")
115
+ elif lang == "en":
116
+ where.append("s.en_text LIKE ?")
117
+ params.append(f"%{terms}%")
118
+ else:
119
+ where.append("(s.zh_text LIKE ? OR s.en_text LIKE ?)")
120
+ params.extend([f"%{terms}%", f"%{terms}%"])
121
+
122
+ if max_score is not None:
123
+ where.append("s.alignment_score <= ?")
124
+ params.append(max_score)
125
+ if law_id:
126
+ where.append("l.law_id = ?")
127
+ params.append(law_id)
128
+ if category:
129
+ if category.endswith(">"):
130
+ where.append("l.category LIKE ?")
131
+ params.append(f"{category}%")
132
+ else:
133
+ where.append("l.category = ?")
134
+ params.append(category)
135
+ if article_no:
136
+ pat = f"%{article_no}%"
137
+ where.append("(a.article_no_zh LIKE ? OR a.article_no_en LIKE ?)")
138
+ params.extend([pat, pat])
139
+
140
+ where_clause = " AND ".join(where)
141
+
142
+ count_sql = f"""
143
+ SELECT count(*) FROM sentences s
144
+ JOIN laws l ON s.law_id = l.id
145
+ JOIN articles a ON s.article_id = a.id
146
+ WHERE {where_clause}
147
+ """
148
+
149
+ order_by = _build_order_by(lang)
150
+
151
+ data_sql = f"""
152
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score,
153
+ l.law_id, l.zh_name, l.en_name, l.type,
154
+ a.article_no_zh, a.article_no_en, a.article_type,
155
+ s.zh_sentence_idx, s.en_sentence_idx
156
+ FROM sentences s
157
+ JOIN laws l ON s.law_id = l.id
158
+ JOIN articles a ON s.article_id = a.id
159
+ WHERE {where_clause}
160
+ {order_by}
161
+ LIMIT ? OFFSET ?
162
+ """
163
+ data_params = params + [limit, offset]
164
+
165
+ cur = conn.execute(count_sql, params)
166
+ total = cur.fetchone()[0]
167
+
168
+ cur = conn.execute(data_sql, data_params)
169
+ rows = [dict(r) for r in cur.fetchall()]
170
+
171
+ return rows, total
172
+
173
+
174
+ def _search_regex(
175
+ conn, pattern, law_id, category, article_no, max_score, lang, limit, offset
176
+ ):
177
+ try:
178
+ regex.compile(pattern)
179
+ except regex.error:
180
+ return [], 0
181
+
182
+ where = "1=1"
183
+ params = []
184
+
185
+ if lang == "zh":
186
+ where += " AND s.zh_text REGEXP ?"
187
+ params.append(pattern)
188
+ elif lang == "en":
189
+ where += " AND s.en_text REGEXP ?"
190
+ params.append(pattern)
191
+ else:
192
+ where += " AND (s.zh_text REGEXP ? OR s.en_text REGEXP ?)"
193
+ params.extend([pattern, pattern])
194
+
195
+ if max_score is not None:
196
+ where += " AND s.alignment_score <= ?"
197
+ params.append(max_score)
198
+ if law_id:
199
+ where += " AND l.law_id = ?"
200
+ params.append(law_id)
201
+ if category:
202
+ if category.endswith(">"):
203
+ where += " AND l.category LIKE ?"
204
+ params.append(f"{category}%")
205
+ else:
206
+ where += " AND l.category = ?"
207
+ params.append(category)
208
+ if article_no:
209
+ where += " AND (a.article_no_zh LIKE ? OR a.article_no_en LIKE ?)"
210
+ pat = f"%{article_no}%"
211
+ params.extend([pat, pat])
212
+
213
+ count_sql = f"""
214
+ SELECT count(*) FROM sentences s
215
+ JOIN laws l ON s.law_id = l.id
216
+ JOIN articles a ON s.article_id = a.id
217
+ WHERE {where}
218
+ """
219
+
220
+ order_by = _build_order_by(lang)
221
+
222
+ data_sql = f"""
223
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score,
224
+ l.law_id, l.zh_name, l.en_name, l.type,
225
+ a.article_no_zh, a.article_no_en, a.article_type,
226
+ s.zh_sentence_idx, s.en_sentence_idx
227
+ FROM sentences s
228
+ JOIN laws l ON s.law_id = l.id
229
+ JOIN articles a ON s.article_id = a.id
230
+ WHERE {where}
231
+ {order_by}
232
+ LIMIT ? OFFSET ?
233
+ """
234
+ data_params = params + [limit, offset]
235
+
236
+ conn.create_function(
237
+ "REGEXP",
238
+ 2,
239
+ lambda pat, txt: bool(regex.search(pat, txt, flags=regex.V1)) if txt else False,
240
+ )
241
+
242
+ cur = conn.execute(count_sql, params)
243
+ total = cur.fetchone()[0]
244
+
245
+ cur = conn.execute(data_sql, data_params)
246
+ rows = [dict(r) for r in cur.fetchall()]
247
+
248
+ return rows, total
249
+
250
+
251
+ def get_paragraph(conn, sentence_id):
252
+ cur = conn.execute(
253
+ """
254
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score, s.zh_sentence_idx, s.en_sentence_idx,
255
+ p.paragraph_index, a.article_no_zh, a.article_no_en
256
+ FROM sentences s
257
+ JOIN paragraphs p ON s.paragraph_id = p.id
258
+ JOIN articles a ON s.article_id = a.id
259
+ WHERE s.paragraph_id = (SELECT paragraph_id FROM sentences WHERE id = ?)
260
+ ORDER BY s.zh_sentence_idx
261
+ """,
262
+ (sentence_id,),
263
+ )
264
+ rows = [dict(r) for r in cur.fetchall()]
265
+ if not rows:
266
+ return None
267
+ return {
268
+ "paragraph_index": rows[0]["paragraph_index"],
269
+ "article_no_zh": rows[0]["article_no_zh"],
270
+ "article_no_en": rows[0]["article_no_en"],
271
+ "sentences": rows,
272
+ }
273
+
274
+
275
+ def get_article(conn, sentence_id):
276
+ cur = conn.execute(
277
+ """
278
+ SELECT s.id, s.zh_text, s.en_text, s.alignment_score, s.zh_sentence_idx, s.en_sentence_idx,
279
+ p.paragraph_index, p.id as paragraph_id,
280
+ a.article_no_zh, a.article_no_en, a.article_type
281
+ FROM sentences s
282
+ JOIN paragraphs p ON s.paragraph_id = p.id
283
+ JOIN articles a ON s.article_id = a.id
284
+ WHERE s.article_id = (SELECT article_id FROM sentences WHERE id = ?)
285
+ ORDER BY p.paragraph_index, s.zh_sentence_idx
286
+ """,
287
+ (sentence_id,),
288
+ )
289
+ rows = [dict(r) for r in cur.fetchall()]
290
+ if not rows:
291
+ return None
292
+
293
+ paragraphs = {}
294
+ for r in rows:
295
+ pidx = r["paragraph_index"]
296
+ if pidx not in paragraphs:
297
+ paragraphs[pidx] = {
298
+ "paragraph_index": pidx,
299
+ "paragraph_id": r["paragraph_id"],
300
+ "sentences": [],
301
+ }
302
+ paragraphs[pidx]["sentences"].append(
303
+ {
304
+ "id": r["id"],
305
+ "zh_text": r["zh_text"],
306
+ "en_text": r["en_text"],
307
+ "alignment_score": r["alignment_score"],
308
+ "zh_sentence_idx": r["zh_sentence_idx"],
309
+ "en_sentence_idx": r["en_sentence_idx"],
310
+ }
311
+ )
312
+
313
+ return {
314
+ "article_no_zh": rows[0]["article_no_zh"],
315
+ "article_no_en": rows[0]["article_no_en"],
316
+ "article_type": rows[0]["article_type"],
317
+ "paragraphs": [paragraphs[k] for k in sorted(paragraphs.keys())],
318
+ }
319
+
320
+
321
+ def list_laws(conn, law_type=None, category=None):
322
+ where = []
323
+ params = []
324
+ if law_type:
325
+ where.append("type = ?")
326
+ params.append(law_type)
327
+ if category:
328
+ if category.endswith(">"):
329
+ where.append("category LIKE ?")
330
+ params.append(f"{category}%")
331
+ else:
332
+ where.append("category = ?")
333
+ params.append(category)
334
+
335
+ where_clause = " AND ".join(where) if where else "1=1"
336
+ cur = conn.execute(
337
+ f"SELECT law_id, type, zh_name, en_name, category FROM laws WHERE {where_clause} ORDER BY law_id",
338
+ params,
339
+ )
340
+ return [dict(r) for r in cur.fetchall()]
341
+
342
+
343
+ def list_categories(conn, law_type=None):
344
+ if law_type:
345
+ where = "WHERE type = ? AND category IS NOT NULL AND category != ''"
346
+ params = [law_type]
347
+ else:
348
+ where = "WHERE category IS NOT NULL AND category != ''"
349
+ params = []
350
+ cur = conn.execute(
351
+ f"SELECT DISTINCT category FROM laws {where} ORDER BY category",
352
+ params,
353
+ )
354
+ return _expand_category_options(r["category"] for r in cur.fetchall())
355
+
356
+
357
+ def get_law_articles(conn, law_id):
358
+ cur = conn.execute(
359
+ """
360
+ SELECT article_no_zh, article_no_en, article_type, article_index
361
+ FROM articles
362
+ WHERE law_id = (SELECT id FROM laws WHERE law_id = ?)
363
+ ORDER BY article_index
364
+ """,
365
+ (law_id,),
366
+ )
367
+ return [dict(r) for r in cur.fetchall()]
368
+
369
+
370
+ def get_law_full_text(conn, law_id):
371
+ cur = conn.execute(
372
+ """
373
+ SELECT s.zh_text, s.en_text, s.alignment_score,
374
+ a.article_no_zh, a.article_no_en, a.article_type, a.article_index,
375
+ p.paragraph_index
376
+ FROM sentences s
377
+ JOIN paragraphs p ON s.paragraph_id = p.id
378
+ JOIN articles a ON s.article_id = a.id
379
+ JOIN laws l ON s.law_id = l.id
380
+ WHERE l.law_id = ?
381
+ ORDER BY a.article_index, p.paragraph_index, s.zh_sentence_idx
382
+ """,
383
+ (law_id,),
384
+ )
385
+ return [dict(r) for r in cur.fetchall()]
list_unmatched_sentences.sql ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ SELECT
2
+ l.law_id AS "Law/Order ID",
3
+ CASE
4
+ WHEN COALESCE(TRIM(a.article_no_zh), '') != ''
5
+ AND COALESCE(TRIM(a.article_no_en), '') != ''
6
+ THEN a.article_no_zh || ' / ' || a.article_no_en
7
+ ELSE COALESCE(NULLIF(TRIM(a.article_no_zh), ''), TRIM(a.article_no_en))
8
+ END AS "Article No.",
9
+ CASE
10
+ WHEN TRIM(COALESCE(s.zh_text, '')) != ''
11
+ AND TRIM(COALESCE(s.en_text, '')) = ''
12
+ THEN TRIM(s.zh_text)
13
+ ELSE ''
14
+ END AS "Source sentence(s) with no matched Target sentence(s)",
15
+ CASE
16
+ WHEN TRIM(COALESCE(s.en_text, '')) != ''
17
+ AND TRIM(COALESCE(s.zh_text, '')) = ''
18
+ THEN TRIM(s.en_text)
19
+ ELSE ''
20
+ END AS "Target sentence(s) with no matched Source sentence(s)"
21
+ FROM sentences s
22
+ JOIN laws l ON s.law_id = l.id
23
+ JOIN articles a ON s.article_id = a.id
24
+ WHERE s.alignment_score = 0.0
25
+ AND (
26
+ (TRIM(COALESCE(s.zh_text, '')) != '' AND TRIM(COALESCE(s.en_text, '')) = '')
27
+ OR (TRIM(COALESCE(s.en_text, '')) != '' AND TRIM(COALESCE(s.zh_text, '')) = '')
28
+ )
29
+ ORDER BY
30
+ l.law_id,
31
+ a.id,
32
+ COALESCE(s.zh_sentence_idx, 999999),
33
+ COALESCE(s.en_sentence_idx, 999999),
34
+ s.id;
requirements.txt CHANGED
@@ -1,3 +1 @@
1
- altair
2
- pandas
3
- streamlit
 
1
+ regex>=2024.0.0
 
 
twl_concordancer.db ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f19d8a6a060ab2d1a71aa2c2a77d86ed698b1b977e043827c9b0f62cca72731b
3
+ size 129490944
unmatched_sentences.tsv ADDED
The diff for this file is too large to render. See raw diff
 
unmatched_sentences_by_article.tsv ADDED
The diff for this file is too large to render. See raw diff