hungnha commited on Feb 25

Commit

92c9b4d

1 Parent(s): bf7ec12

Thay đổi promt

Browse files

Files changed (23) hide show

.gitattributes +2 -0
.gitignore +3 -0
README.md +324 -27
colab.ipynb +0 -476
core/gradio/gradio_rag.py +2 -182
core/gradio/user_gradio.py +99 -34
core/hash_file/hash_data_goc.py +23 -26
core/hash_file/hash_file.py +20 -25
core/preprocessing/docling_processor.py +24 -27
core/preprocessing/pdf_parser.py +11 -11
core/rag/chunk.py +52 -52
core/rag/embedding_model.py +19 -20
core/rag/generator.py +27 -21
core/rag/{retrival.py → retrieval.py} +57 -59
core/rag/vector_store.py +48 -48
evaluation/eval_utils.py +15 -17
evaluation/ragas_eval.py +34 -26
scripts/build_data.py +42 -44
scripts/run_app.py +52 -0
scripts/run_eval.py +5 -5
setup.bat +35 -0
setup.sh +38 -0
test_chunk.md +0 -696

.gitattributes CHANGED Viewed

@@ -59,3 +59,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.webm filter=lfs diff=lfs merge=lfs -text
 *.pdf filter=lfs diff=lfs merge=lfs -text
 data/files/*.pdf filter=lfs diff=lfs merge=lfs -text

 *.webm filter=lfs diff=lfs merge=lfs -text
 *.pdf filter=lfs diff=lfs merge=lfs -text
 data/files/*.pdf filter=lfs diff=lfs merge=lfs -text
+# SCM syntax highlighting & preventing 3-way merges
+pixi.lock merge=binary linguist-language=YAML linguist-generated=true -diff

.gitignore CHANGED Viewed

@@ -157,3 +157,6 @@ __pycache__/
 # Download with: python scripts/download_data.py
 data/

 # Download with: python scripts/download_data.py
 data/
+# pixi environments
+.pixi/*
+!.pixi/config.toml

README.md CHANGED Viewed

@@ -1,56 +1,353 @@
-# HUST RAG - Hệ thống Hỏi đáp Quy chế Sinh viên
-Hệ thống RAG hỗ trợ sinh viên tra cứu quy chế, quy định tại Đại học Bách khoa Hà Nội.
-## Tính năng
-- Hybrid Search (Vector + BM25)
-- Reranking với Qwen3-Reranker
-- Small-to-Big Retrieval cho bảng biểu
-- Giao diện chat Gradio
-## Cài đặt
-**Yêu cầu:** Python 3.10+
-Ubuntu/Debian cần cài thêm:
-sudo apt update
-sudo apt install python3-venv
-**Bước 1:** Chạy setup script
-- **Linux/Mac:** `bash setup.sh`
-- **Windows:** nhấp đúp `setup.bat` hoặc gõ `setup.bat` trong cmd
-> Script sẽ: tạo venv → cài dependencies → tải data → tạo .env
-**Bước 2:** Cấu hình API keys
-Sửa file `.env`:
-SILICONFLOW_API_KEY=your_key    # Embedding & Reranking
-GROQ_API_KEY=your_key           # LLM Generation
-Lấy API keys tại: [SiliconFlow](https://siliconflow.ai/) | [Groq](https://groq.com/)
-**Bước 3:** Chạy ứng dụng
 source venv/bin/activate        # Linux/Mac
-venv\Scripts\activate           # Windows
 python scripts/run_app.py
-Truy cập: http://127.0.0.1:7860
-## Data
-Data trên HuggingFace: [hungnha/do_an_tot_nghiep](https://huggingface.co/datasets/hungnha/do_an_tot_nghiep)
-Tải thủ công:
 huggingface-cli download hungnha/do_an_tot_nghiep --repo-type dataset --local-dir ./data

+# HUST RAG — Student Regulations Q&A System
+A Retrieval-Augmented Generation (RAG) system that helps students query academic regulations and policies at Hanoi University of Science and Technology (HUST). The system processes Markdown-based regulation documents, stores them in a vector database, and uses a hybrid retrieval pipeline with reranking to provide accurate, context-grounded answers through a conversational chat interface.
+---
+## ✨ Key Features
+- **Hybrid Search** — Combines vector similarity search (ChromaDB) with BM25 keyword matching for both semantic and lexical retrieval
+- **Reranking** — Uses Qwen3-Reranker-8B via SiliconFlow API to re-score and sort retrieved documents by relevance
+- **Small-to-Big Retrieval** — Summarizes large tables with an LLM, embeds the summary for search, and returns the full original table at query time
+- **4 Retrieval Modes** — `vector_only`, `bm25_only`, `hybrid`, `hybrid_rerank` — configurable per query
+- **Incremental Data Build** — Hash-based change detection ensures only modified files are re-processed when rebuilding the database
+- **Streaming Chat UI** — Gradio-based conversational interface with real-time response streaming
+- **RAGAS Evaluation** — Built-in evaluation pipeline using the RAGAS framework with metrics like faithfulness, relevancy, precision, recall, and ROUGE scores
+---
+## 🏗️ System Architecture
+```
+┌────────────────────────────────────────────────────────────────────┐
+│                        User Query (Gradio UI)                      │
+└──────────────────────────────┬─────────────────────────────────────┘
+                               │
+                               ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│                        Retrieval Pipeline                            │
+│                                                                      │
+│  ┌──────────────┐   ┌──────────────┐   ┌────────────────────────┐   │
+│  │ Vector Search │ + │ BM25 Search  │ → │ Ensemble (weighted)    │   │
+│  │  (ChromaDB)   │   │ (rank-bm25)  │   │ vector:0.5 + bm25:0.5 │   │
+│  └──────────────┘   └──────────────┘   └──────────┬─────────────┘   │
+│                                                     │                │
+│                                                     ▼                │
+│                                          ┌──────────────────┐        │
+│                                          │ Qwen3-Reranker   │        │
+│                                          │ (SiliconFlow API) │        │
+│                                          └────────┬─────────┘        │
+│                                                   │                  │
+│                                    Small-to-Big:  │                  │
+│                                    summary hit →  │                  │
+│                                    swap w/ parent │                  │
+└───────────────────────────────────────────┬──────────────────────────┘
+                                            │
+                                            ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│                      Context Builder + LLM                           │
+│                                                                      │
+│   Context (top-k docs + metadata) → Prompt → LLM (Groq API)         │
+│                                              → Streaming Response    │
+└──────────────────────────────────────────────────────────────────────┘
+```
+---
+## 📁 Project Structure
+```
+DoAn/
+├── core/                          # Core application modules
+│   ├── rag/                       # RAG engine
+│   │   ├── chunk.py               # Markdown chunking with table extraction & Small-to-Big
+│   │   ├── embedding_model.py     # Qwen3-Embedding wrapper (SiliconFlow API)
+│   │   ├── vector_store.py        # ChromaDB wrapper with parent node storage
+│   │   ├── retrieval.py           # Hybrid retriever + SiliconFlow reranker
+│   │   └── generator.py           # Context builder & prompt construction
+│   ├── gradio/                    # Chat interfaces
+│   │   ├── user_gradio.py         # Main Gradio app (production + debug modes)
+│   │   └── gradio_rag.py          # Debug mode launcher (thin wrapper)
+│   └── hash_file/                 # File hashing utilities
+│       └── hash_file.py           # SHA-256 hash processor for change detection
+│
+├── scripts/                       # Workflow scripts
+│   ├── run_app.py                 # Application entry point (data check + env check + launch)
+│   ├── build_data.py              # Build/update ChromaDB from markdown files
+│   ├── download_data.py           # Download data from HuggingFace
+│   └── run_eval.py                # Run RAGAS evaluation
+│
+├── evaluation/                    # Evaluation pipeline
+│   ├── eval_utils.py              # Shared utilities (RAG init, answer generation)
+│   └── ragas_eval.py              # RAGAS evaluation with multiple metrics
+│
+├── test/                          # Unit tests
+│   ├── conftest.py                # Shared fixtures and sample data
+│   ├── test_chunk.py              # Chunking logic tests
+│   ├── test_embedding.py          # Embedding model tests
+│   ├── test_vector_store.py       # Vector store tests
+│   ├── test_retrieval.py          # Retrieval pipeline tests
+│   ├── test_generator.py          # Generator/context builder tests
+│   └── ...
+│
+├── data/                          # Data directory (downloaded from HuggingFace)
+│   ├── data_process/              # Processed markdown files
+│   └── chroma/                    # ChromaDB persistence directory
+│
+├── requirements.txt               # Python dependencies
+├── setup.sh                       # Linux/Mac setup script
+├── setup.bat                      # Windows setup script
+└── .env                           # API keys (not tracked in git)
+```
+---
+## 🚀 Getting Started
+### Prerequisites
+- **Python 3.10+**
+- **API Keys:**
+  - [SiliconFlow](https://siliconflow.ai/) — for embedding (Qwen3-Embedding-4B) and reranking (Qwen3-Reranker-8B)
+  - [Groq](https://groq.com/) — for LLM generation (Qwen3-32B)
+### Quick Setup (Recommended)
+Run the automated setup script which creates a virtual environment, installs dependencies, downloads data, and creates the `.env` file:
+```bash
+# Linux / macOS
+bash setup.sh
+# Windows
+setup.bat
+```
+Then edit `.env` with your API keys:
+```env
+SILICONFLOW_API_KEY=your_siliconflow_key
+GROQ_API_KEY=your_groq_key
+```
+### Manual Setup
+```bash
+# 1. Create and activate virtual environment
+python3 -m venv venv
 source venv/bin/activate        # Linux/Mac
+# venv\Scripts\activate         # Windows
+# 2. Install dependencies
+pip install -r requirements.txt
+# 3. Download data from HuggingFace
+python scripts/download_data.py
+# 4. Create .env file with your API keys
+echo "SILICONFLOW_API_KEY=your_key" > .env
+echo "GROQ_API_KEY=your_key" >> .env
+```
+### Running the Application
+```bash
+source venv/bin/activate        # Linux/Mac
 python scripts/run_app.py
+```
+Access the chat interface at: **http://127.0.0.1:7860**
+---
+## 📖 Usage Guide
+### Chat Interface
+The Gradio chat interface supports natural language questions about HUST student regulations. Example questions:
+| Question | Topic |
+|----------|-------|
+| Sinh viên vi phạm quy chế thi thì bị xử lý như thế nào? | Exam violation penalties |
+| Điều kiện để đổi ngành là gì? | Major transfer requirements |
+| Làm thế nào để đăng ký hoãn thi? | Exam postponement registration |
+### Debug Mode
+To launch the debug interface that shows retrieved documents and relevance scores:
+```bash
+python core/gradio/gradio_rag.py
+```
+### Building/Updating the Database
+When you add, modify, or delete markdown files in `data/data_process/`, rebuild the database:
+```bash
+# Incremental update (only changed files)
+python scripts/build_data.py
+# Force full rebuild
+python scripts/build_data.py --force
+# Skip orphan deletion
+python scripts/build_data.py --no-delete
+```
+The build script will:
+1. Detect changed files via SHA-256 hash comparison
+2. Delete chunks from removed files
+3. Re-chunk and re-embed only modified files
+4. Automatically invalidate the BM25 cache
+---
+## 🔧 Core Components
+### Chunking (`core/rag/chunk.py`)
+Processes Markdown documents into searchable chunks:
+| Feature | Description |
+|---------|-------------|
+| **YAML Frontmatter Extraction** | Parses metadata (document type, year, cohort, program) into chunk metadata |
+| **Heading-based Splitting** | Uses `MarkdownNodeParser` to split by headings, preserving document structure |
+| **Table Extraction & Splitting** | Extracts Markdown tables, splits large tables into chunks of 15 rows |
+| **Small-to-Big Pattern** | Summarizes tables with LLM → embeds summary → links to parent (full table) |
+| **Small Chunk Merging** | Merges chunks smaller than 200 characters with adjacent chunks |
+| **Metadata Enrichment** | Extracts course names and codes from content using regex patterns |
+**Configuration:**
+```python
+CHUNK_SIZE = 1500          # Maximum chunk size in characters
+CHUNK_OVERLAP = 150        # Overlap between consecutive chunks
+MIN_CHUNK_SIZE = 200       # Minimum chunk size (smaller chunks get merged)
+TABLE_ROWS_PER_CHUNK = 15  # Maximum rows per table chunk
+```
+### Embedding (`core/rag/embedding_model.py`)
+- **Model:** Qwen3-Embedding-4B via SiliconFlow API
+- **Dimensions:** 2048
+- **Batch processing** with configurable batch size (default: 16)
+- **Rate limit handling** with exponential backoff retry
+### Vector Store (`core/rag/vector_store.py`)
+- **Backend:** ChromaDB with LangChain integration
+- **Parent node storage:** Separate JSON file for Small-to-Big parent nodes (not embedded)
+- **Content-based document IDs:** SHA-256 hash of (source_file, header_path, chunk_index, content)
+- **Metadata flattening:** Converts complex metadata types to ChromaDB-compatible formats
+- **Batch operations:** `add_documents()` and `upsert_documents()` with configurable batch size
+### Retrieval (`core/rag/retrieval.py`)
+| Mode | Description |
+|------|-------------|
+| `vector_only` | Pure vector similarity search via ChromaDB |
+| `bm25_only` | Pure keyword matching via BM25 (with lazy-load and disk caching) |
+| `hybrid` | Ensemble of vector + BM25 with configurable weights (default: 0.5/0.5) |
+| `hybrid_rerank` | Hybrid search followed by Qwen3-Reranker-8B reranking **(default)** |
+**Small-to-Big at retrieval time:** When a table summary node is retrieved, it is automatically swapped with the full parent table before returning results to the user.
+**Configuration:**
+```python
+rerank_model = "Qwen/Qwen3-Reranker-8B"  # Reranker model
+initial_k = 25                             # Documents fetched before reranking
+top_k = 5                                  # Final documents returned
+vector_weight = 0.5                        # Weight for vector search
+bm25_weight = 0.5                          # Weight for BM25 search
+```
+### Generator (`core/rag/generator.py`)
+- Builds rich context strings with metadata (source, document type, year, cohort, program, faculty)
+- Constructs prompts with a Vietnamese system prompt that enforces context-grounded answers
+- `RAGContextBuilder` combines retrieval and context preparation into a single step
+---
+## 📊 Evaluation
+The project includes a RAGAS-based evaluation pipeline.
+### Running Evaluation
+```bash
+# Evaluate with default settings (10 samples, hybrid_rerank)
+python scripts/run_eval.py
+# Custom sample size and mode
+python scripts/run_eval.py --samples 50 --mode hybrid_rerank
+# Run all retrieval modes for comparison
+python scripts/run_eval.py --samples 20 --mode all
+```
+### Metrics
+| Metric | Description |
+|--------|-------------|
+| **Faithfulness** | How well the answer is grounded in the retrieved context |
+| **Answer Relevancy** | How relevant the answer is to the question |
+| **Context Precision** | How precise the retrieved contexts are |
+| **Context Recall** | How well the retrieved contexts cover the ground truth |
+| **ROUGE-1 / ROUGE-2 / ROUGE-L** | N-gram overlap with ground truth answers |
+Results are saved to `evaluation/results/` as both JSON and CSV files with timestamps.
+---
+## 🧪 Testing
+```bash
+# Run all tests
+pytest test/ -v
+# Run specific test module
+pytest test/test_chunk.py -v
+pytest test/test_retrieval.py -v
+# Run with coverage
+pytest test/ --cov=core --cov-report=term-missing
+```
+---
+## 🛠️ Technology Stack
+| Category | Technology |
+|----------|------------|
+| **Embedding** | Qwen3-Embedding-4B (SiliconFlow API) |
+| **Reranking** | Qwen3-Reranker-8B (SiliconFlow API) |
+| **LLM** | Qwen3-32B (Groq API) |
+| **Vector Database** | ChromaDB |
+| **Keyword Search** | BM25 (rank-bm25) |
+| **Framework** | LangChain + LlamaIndex (chunking) |
+| **UI** | Gradio |
+| **Evaluation** | RAGAS |
+| **Language** | Python 3.10+ |
+---
+## 📦 Data
+The processed data is hosted on HuggingFace: [hungnha/do_an_tot_nghiep](https://huggingface.co/datasets/hungnha/do_an_tot_nghiep)
+**Manual download:**
+```bash
 huggingface-cli download hungnha/do_an_tot_nghiep --repo-type dataset --local-dir ./data
+```
+The data directory contains:
+- `data_process/` — Processed Markdown regulation documents
+- `chroma/` — ChromaDB persistence files (vector index + parent nodes)
+- `data.csv` — Evaluation dataset (questions + ground truth answers)
+---
+## 📄 License
+This project is developed as an undergraduate thesis at Hanoi University of Science and Technology (HUST).

colab.ipynb DELETED Viewed

@@ -1,476 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": 1,
-   "id": "287f0df4",
-   "metadata": {},
-   "outputs": [
-    {
-     "ename": "KeyboardInterrupt",
-     "evalue": "",
-     "output_type": "error",
-     "traceback": [
-      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
-      "\u001b[0;31mKeyboardInterrupt\u001b[0m                         Traceback (most recent call last)",
-      "\u001b[0;32m/tmp/ipython-input-3329394316.py\u001b[0m in \u001b[0;36m<cell line: 0>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mgoogle\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolab\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mdrive\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 2\u001b[0;31m \u001b[0mdrive\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmount\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'/content/drive'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mforce_remount\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[0;32m/usr/local/lib/python3.12/dist-packages/google/colab/drive.py\u001b[0m in \u001b[0;36mmount\u001b[0;34m(mountpoint, force_remount, timeout_ms, readonly)\u001b[0m\n\u001b[1;32m     95\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mmount\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmountpoint\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mforce_remount\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtimeout_ms\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m120000\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreadonly\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     96\u001b[0m   \u001b[0;34m\"\"\"Mount your Google Drive at the specified mountpoint path.\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 97\u001b[0;31m   return _mount(\n\u001b[0m\u001b[1;32m     98\u001b[0m       \u001b[0mmountpoint\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     99\u001b[0m       \u001b[0mforce_remount\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mforce_remount\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;32m/usr/local/lib/python3.12/dist-packages/google/colab/drive.py\u001b[0m in \u001b[0;36m_mount\u001b[0;34m(mountpoint, force_remount, timeout_ms, ephemeral, readonly)\u001b[0m\n\u001b[1;32m    132\u001b[0m   )\n\u001b[1;32m    133\u001b[0m   \u001b[0;32mif\u001b[0m \u001b[0mephemeral\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 134\u001b[0;31m     _message.blocking_request(\n\u001b[0m\u001b[1;32m    135\u001b[0m         \u001b[0;34m'request_auth'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    136\u001b[0m         \u001b[0mrequest\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m{\u001b[0m\u001b[0;34m'authType'\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0;34m'dfs_ephemeral'\u001b[0m\u001b[0;34m}\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
-      "\u001b[0;32m/usr/local/lib/python3.12/dist-packages/google/colab/_message.py\u001b[0m in \u001b[0;36mblocking_request\u001b[0;34m(request_type, request, timeout_sec, parent)\u001b[0m\n\u001b[1;32m    174\u001b[0m       \u001b[0mrequest_type\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrequest\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparent\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mparent\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mexpect_reply\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    175\u001b[0m   )\n\u001b[0;32m--> 176\u001b[0;31m   \u001b[0;32mreturn\u001b[0m \u001b[0mread_reply_from_input\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mrequest_id\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtimeout_sec\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
-      "\u001b[0;32m/usr/local/lib/python3.12/dist-packages/google/colab/_message.py\u001b[0m in \u001b[0;36mread_reply_from_input\u001b[0;34m(message_id, timeout_sec)\u001b[0m\n\u001b[1;32m     94\u001b[0m     \u001b[0mreply\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_read_next_input_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     95\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mreply\u001b[0m \u001b[0;34m==\u001b[0m \u001b[0m_NOT_READY\u001b[0m \u001b[0;32mor\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0misinstance\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mreply\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdict\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 96\u001b[0;31m       \u001b[0mtime\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msleep\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0.025\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     97\u001b[0m       \u001b[0;32mcontinue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     98\u001b[0m     if (\n",
-      "\u001b[0;31mKeyboardInterrupt\u001b[0m: "
-     ]
-    }
-   ],
-   "source": [
-    "from google.colab import drive\n",
-    "drive.mount('/content/drive', force_remount=True)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "f6891108",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 2. Install dependencies\n",
-    "# Cài đặt hệ thống Tesseract, ngôn ngữ Tiếng Việt và các thư viện development cần thiết để build tesserocr\n",
-    "!sudo apt-get update > /dev/null\n",
-    "!sudo apt-get install -y tesseract-ocr tesseract-ocr-vie libtesseract-dev libleptonica-dev pkg-config > /dev/null\n",
-    "\n",
-    "# Cài đặt tesserocr (Python wrapper cho Tesseract) và docling\n",
-    "# Lưu ý: tesserocr cần được build từ source nên cần các thư viện dev ở trên\n",
-    "!pip install tesserocr docling pypdfium2"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "ca42bfce",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 3. Extract Data\n",
-    "import os\n",
-    "import zipfile\n",
-    "\n",
-    "# Path to your zip file on Drive\n",
-    "zip_path = '/content/drive/MyDrive/data_rag.zip' \n",
-    "extract_path = '/content/data_rag/files'\n",
-    "\n",
-    "if not os.path.exists(extract_path):\n",
-    "    os.makedirs(extract_path, exist_ok=True)\n",
-    "    print(f\"Extracting {zip_path}...\")\n",
-    "    try:\n",
-    "        with zipfile.ZipFile(zip_path, 'r') as zip_ref:\n",
-    "            zip_ref.extractall(extract_path)\n",
-    "        print(\"Done extraction!\")\n",
-    "    except FileNotFoundError:\n",
-    "        print(f\"❌ File not found: {zip_path}. Please check the path.\")\n",
-    "else:\n",
-    "    print(\"Files already extracted.\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "988f7e96",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 4. Define Processor Class (Refactored for High Quality & Performance with Tesseract)\n",
-    "import json\n",
-    "import os\n",
-    "import logging\n",
-    "import shutil\n",
-    "import re\n",
-    "import gc\n",
-    "import signal\n",
-    "from pathlib import Path\n",
-    "from typing import Optional\n",
-    "\n",
-    "# --- AUTO-CONFIG TESSERACT DATA PATH ---\n",
-    "# Fix lỗi \"No language models have been detected\"\n",
-    "# Tự động tìm đường dẫn chứa file ngôn ngữ (vie.traineddata) và set biến môi trường\n",
-    "def setup_tesseract_path():\n",
-    "    possible_paths = [\n",
-    "        \"/usr/share/tesseract-ocr/4.00/tessdata\",\n",
-    "        \"/usr/share/tesseract-ocr/5/tessdata\",\n",
-    "        \"/usr/share/tesseract-ocr/tessdata\",\n",
-    "        \"/usr/local/share/tessdata\"\n",
-    "    ]\n",
-    "    \n",
-    "    found = False\n",
-    "    for path in possible_paths:\n",
-    "        if os.path.exists(os.path.join(path, \"vie.traineddata\")):\n",
-    "            os.environ[\"TESSDATA_PREFIX\"] = path\n",
-    "            print(f\"✅ Found Tesseract data at: {path}\")\n",
-    "            print(f\"   Set TESSDATA_PREFIX={path}\")\n",
-    "            found = True\n",
-    "            break\n",
-    "            \n",
-    "    if not found:\n",
-    "        print(\"⚠️ WARNING: Could not find 'vie.traineddata'. Tesseract might fail.\")\n",
-    "        print(\"   Please run Cell #2 to install tesseract-ocr-vie.\")\n",
-    "\n",
-    "setup_tesseract_path()\n",
-    "\n",
-    "# Setup logging\n",
-    "logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(message)s')\n",
-    "logger = logging.getLogger(__name__)\n",
-    "\n",
-    "# Docling imports\n",
-    "from docling.document_converter import DocumentConverter, FormatOption\n",
-    "from docling.datamodel.base_models import InputFormat\n",
-    "from docling.datamodel.pipeline_options import (\n",
-    "    PdfPipelineOptions, \n",
-    "    TableStructureOptions,\n",
-    "    AcceleratorOptions,\n",
-    "    AcceleratorDevice,\n",
-    "    TesseractOcrOptions # SỬ DỤNG TESSERACT CHO ĐỘ CHÍNH XÁC CAO NHẤT\n",
-    ")\n",
-    "from docling.datamodel.settings import settings\n",
-    "from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend\n",
-    "from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline\n",
-    "\n",
-    "class ColabDoclingProcessor:\n",
-    "    def __init__(self, output_dir: str, use_ocr: bool = True, timeout: int = 300):\n",
-    "        self.output_dir = output_dir\n",
-    "        self.use_ocr = use_ocr\n",
-    "        self.timeout = timeout\n",
-    "        os.makedirs(output_dir, exist_ok=True)\n",
-    "        \n",
-    "        # 1. Cấu hình Pipeline Options\n",
-    "        pipeline_options = PdfPipelineOptions()\n",
-    "        \n",
-    "        # --- Cấu hình TableFormer (Ưu tiên số 1) ---\n",
-    "        # Kích hoạt nhận diện cấu trúc bảng\n",
-    "        pipeline_options.do_table_structure = True\n",
-    "        # Sử dụng chế độ ACCURATE để đảm bảo bảng biểu phức tạp (điểm số, học phí) không bị vỡ\n",
-    "        pipeline_options.table_structure_options = TableStructureOptions(\n",
-    "            do_cell_matching=True,  # Khớp text vào ô chính xác hơn\n",
-    "            mode=\"accurate\"         # Chế độ chính xác cao\n",
-    "        )\n",
-    "\n",
-    "        # --- FIX LỖI ẢNH MỜ (QUAN TRỌNG) ---\n",
-    "        # Tăng độ phân giải ảnh lên gấp 3 lần để Tesseract nhìn rõ dấu tiếng Việt\n",
-    "        # Mặc định là 1.0 (mờ), set lên 3.0 sẽ nét căng.\n",
-    "        pipeline_options.images_scale = 3.0\n",
-    "\n",
-    "        # --- Chiến lược OCR với Tesseract ---\n",
-    "        if use_ocr:\n",
-    "            pipeline_options.do_ocr = True\n",
-    "            \n",
-    "            # --- CẤU HÌNH TESSERACT TƯỜNG MINH ---\n",
-    "            ocr_options = TesseractOcrOptions()\n",
-    "            \n",
-    "            # Cấu hình ngôn ngữ tiếng Việt (vie) - Phải khớp với gói tesseract-ocr-vie\n",
-    "            ocr_options.lang = [\"vie\"] \n",
-    "            \n",
-    "            # --- CHẾ ĐỘ HYBRID (THÔNG MINH) ---\n",
-    "            # Tắt force_full_page_ocr để Docling tự quyết định:\n",
-    "            # 1. Nếu text layer tốt -> Dùng text layer (Nhanh, nhẹ)\n",
-    "            # 2. Nếu text layer lỗi hoặc là ảnh -> Dùng OCR\n",
-    "            ocr_options.force_full_page_ocr = False\n",
-    "            \n",
-    "            # Gán options vào pipeline\n",
-    "            pipeline_options.ocr_options = ocr_options\n",
-    "        else:\n",
-    "            pipeline_options.do_ocr = False\n",
-    "\n",
-    "        # --- Tối ưu phần cứng (GPU Acceleration) ---\n",
-    "        # Tự động phát hiện và sử dụng GPU nếu có (Colab T4/L4)\n",
-    "        pipeline_options.accelerator_options = AcceleratorOptions(\n",
-    "            num_threads=8, # Tăng thread cho Tesseract\n",
-    "            device=AcceleratorDevice.AUTO \n",
-    "        )\n",
-    "\n",
-    "        # 2. Tạo Format Options\n",
-    "        format_options = {\n",
-    "            InputFormat.PDF: FormatOption(\n",
-    "                backend=PyPdfiumDocumentBackend,\n",
-    "                pipeline_cls=StandardPdfPipeline,\n",
-    "                pipeline_options=pipeline_options\n",
-    "            )\n",
-    "        }\n",
-    "        \n",
-    "        # Khởi tạo Converter\n",
-    "        self.converter = DocumentConverter(format_options=format_options)\n",
-    "        print(f\"🚀 Docling Processor Initialized\")\n",
-    "        print(f\"   - OCR Engine: TESSERACT (Vietnamese)\")\n",
-    "        print(f\"   - Mode: HYBRID (Text Layer + OCR fallback)\")\n",
-    "        print(f\"   - Image Scale: 3.0 (High Resolution)\")\n",
-    "        print(f\"   - Table Mode: Accurate\")\n",
-    "        print(f\"   - Device: Auto-detect (GPU/CPU)\")\n",
-    "        print(f\"   - Timeout: {self.timeout}s per file\")\n",
-    "\n",
-    "    def clean_markdown(self, text: str) -> str:\n",
-    "        \"\"\"Hậu xử lý: Làm sạch Markdown.\"\"\"\n",
-    "        # 1. Xóa dòng \"Trang x\" (An toàn)\n",
-    "        text = re.sub(r'\\n\\s*Trang\\s+\\d+\\s*\\n', '\\n', text)\n",
-    "        \n",
-    "        # 3. Xóa nhiều dòng trống (An toàn & Cần thiết)\n",
-    "        text = re.sub(r'\\n{3,}', '\\n\\n', text)\n",
-    "        return text.strip()\n",
-    "\n",
-    "    def parse_directory(self, source_dir: str):\n",
-    "        print(f\"📂 Parsing PDFs in: {source_dir}\")\n",
-    "        source_path = Path(source_dir)\n",
-    "        pdf_files = list(source_path.rglob(\"*.pdf\"))\n",
-    "        print(f\"   Found {len(pdf_files)} PDF files.\")\n",
-    "        \n",
-    "        results = {\"total\": 0, \"parsed\": 0, \"skipped\": 0, \"errors\": 0}\n",
-    "        \n",
-    "        # Define timeout handler\n",
-    "        def timeout_handler(signum, frame):\n",
-    "            raise TimeoutError(\"Processing timeout\")\n",
-    "        \n",
-    "        # Register signal for timeout\n",
-    "        signal.signal(signal.SIGALRM, timeout_handler)\n",
-    "        \n",
-    "        for i, file_path in enumerate(pdf_files):\n",
-    "            filename = file_path.name\n",
-    "            \n",
-    "            # --- GIỮ NGUYÊN CẤU TRÚC THƯ MỤC ---\n",
-    "            # Tính toán đường dẫn tương đối: data/files/subdir/file.pdf -> subdir/file.pdf\n",
-    "            try:\n",
-    "                relative_path = file_path.relative_to(source_path)\n",
-    "            except ValueError:\n",
-    "                # Fallback nếu file không nằm trong source_dir (ít khi xảy ra với rglob)\n",
-    "                relative_path = Path(filename)\n",
-    "\n",
-    "            # Tạo đường dẫn output tương ứng: output_dir/subdir/file.md\n",
-    "            output_file_path = Path(self.output_dir) / relative_path.with_suffix(\".md\")\n",
-    "            \n",
-    "            # Tạo thư mục con nếu chưa tồn tại\n",
-    "            output_file_path.parent.mkdir(parents=True, exist_ok=True)\n",
-    "            \n",
-    "            output_path = str(output_file_path)\n",
-    "            \n",
-    "            # --- TỐI ƯU 1: SKIP NẾU ĐÃ CÓ KẾT QUẢ (Checkpoint) ---\n",
-    "            if os.path.exists(output_path):\n",
-    "                results[\"skipped\"] += 1\n",
-    "                if results[\"skipped\"] % 50 == 0:\n",
-    "                    print(f\"⏩ Skipped {results['skipped']} files (already processed)...\")\n",
-    "                continue\n",
-    "\n",
-    "            try:\n",
-    "                # Set timeout\n",
-    "                signal.alarm(self.timeout)\n",
-    "                \n",
-    "                # Convert\n",
-    "                result = self.converter.convert(str(file_path))\n",
-    "                \n",
-    "                # Cancel timeout\n",
-    "                signal.alarm(0)\n",
-    "                \n",
-    "                # Export to Markdown (Làm sạch dữ liệu ảnh rác)\n",
-    "                markdown_content = result.document.export_to_markdown(image_placeholder=\"\")\n",
-    "                \n",
-    "                # Post-processing cleaning\n",
-    "                markdown_content = self.clean_markdown(markdown_content)\n",
-    "                \n",
-    "                # Metadata Extraction (Chuẩn bị cho RAG)\n",
-    "                metadata_header = f\"\"\"---\n",
-    "filename: {filename}\n",
-    "filepath: {file_path}\n",
-    "page_count: {len(result.document.pages)}\n",
-    "processed_at: {os.path.getmtime(file_path)}\n",
-    "---\n",
-    "\n",
-    "\"\"\"\n",
-    "                final_content = metadata_header + markdown_content\n",
-    "                \n",
-    "                # Save\n",
-    "                with open(output_path, 'w', encoding='utf-8') as f:\n",
-    "                    f.write(final_content)\n",
-    "                \n",
-    "                results[\"parsed\"] += 1\n",
-    "                \n",
-    "                # --- TỐI ƯU 2: GIẢI PHÓNG RAM ---\n",
-    "                del result\n",
-    "                del markdown_content\n",
-    "                \n",
-    "                if (i+1) % 10 == 0:\n",
-    "                    gc.collect()\n",
-    "                    print(f\"✅ Processed {i+1}/{len(pdf_files)} files (Skipped: {results['skipped']})\")\n",
-    "                    \n",
-    "            except TimeoutError:\n",
-    "                print(f\"⏰ Timeout parsing {filename} (>{self.timeout}s)\")\n",
-    "                results[\"errors\"] += 1\n",
-    "            except Exception as e:\n",
-    "                print(f\"❌ Failed to parse {filename}: {e}\")\n",
-    "                results[\"errors\"] += 1\n",
-    "            finally:\n",
-    "                signal.alarm(0) # Ensure alarm is off\n",
-    "                \n",
-    "        return results"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "0b87fec5",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 5.5. Test Run on Specific File\n",
-    "# Chạy cell này để kiểm tra chất lượng trên file cụ thể (giống Marker)\n",
-    "import os\n",
-    "from pathlib import Path\n",
-    "\n",
-    "# Setup paths (đồng bộ với Cell 3)\n",
-    "source_dir = '/content/data_rag/files'\n",
-    "root = Path(source_dir)\n",
-    "\n",
-    "if not root.exists():\n",
-    "    print(f\"❌ Source directory not found: {root}\")\n",
-    "    print(\"⚠️ Hãy chạy Cell 3 (Extract Data) trước.\")\n",
-    "else:\n",
-    "    # Nếu zip giải nén ra 1 thư mục con 'files' thì đi vào đó\n",
-    "    nested_files = root / 'files'\n",
-    "    if nested_files.exists():\n",
-    "        root = nested_files\n",
-    "\n",
-    "    # Tìm file cụ thể\n",
-    "    target_filename = \"1.1. Kỹ thuật Cơ điện tử.pdf\"\n",
-    "    # Nếu bạn biết chắc thư mục con, điền ở đây (vd: 'quy_che'); nếu không chắc có thể để None\n",
-    "    target_subdir = \"quy_che\"\n",
-    "\n",
-    "    preferred_path = (root / target_subdir / target_filename) if target_subdir else (root / target_filename)\n",
-    "    target_path = preferred_path\n",
-    "\n",
-    "    if not target_path.exists():\n",
-    "        # Fallback: tự động tìm theo tên file trong toàn bộ cây thư mục\n",
-    "        matches = list(root.rglob(target_filename))\n",
-    "        if len(matches) == 1:\n",
-    "            target_path = matches[0]\n",
-    "            print(f\"🔎 Auto-found file at: {target_path}\")\n",
-    "        elif len(matches) > 1:\n",
-    "            print(\"⚠️ Found multiple matches. Showing up to 20:\")\n",
-    "            for p in matches[:20]:\n",
-    "                print(f\" - {p}\")\n",
-    "            target_path = matches[0]\n",
-    "            print(f\"➡️ Using first match: {target_path}\")\n",
-    "        else:\n",
-    "            print(f\"❌ File not found: {preferred_path}\")\n",
-    "            print(f\"Searching in: {root}\")\n",
-    "            # Gợi ý: in ra các thư mục cấp 1 để bạn chọn đúng target_subdir\n",
-    "            subdirs = sorted([p.name for p in root.iterdir() if p.is_dir()])\n",
-    "            if subdirs:\n",
-    "                print(\"📁 Top-level folders:\")\n",
-    "                for name in subdirs[:30]:\n",
-    "                    print(f\" - {name}\")\n",
-    "            raise FileNotFoundError(target_filename)\n",
-    "\n",
-    "    print(f\"🧪 Using target file: {target_path}\")\n",
-    "    \n",
-    "    # Initialize processor for test\n",
-    "    test_output_dir = '/content/data/test_output'\n",
-    "    os.makedirs(test_output_dir, exist_ok=True)\n",
-    "    \n",
-    "    print(\"🚀 Initializing processor for test run (OCR Enabled - Default)...\")\n",
-    "    # Use ColabDoclingProcessor defined in previous cell\n",
-    "    test_processor = ColabDoclingProcessor(\n",
-    "        output_dir=test_output_dir,\n",
-    "        use_ocr=True,\n",
-    "    )\n",
-    "    \n",
-    "    try:\n",
-    "        print(f\"⏳ Processing {target_path.name}...\")\n",
-    "        result = test_processor.converter.convert(str(target_path))\n",
-    "        markdown_content = result.document.export_to_markdown()\n",
-    "        \n",
-    "        # Save to local output\n",
-    "        output_file = Path(test_output_dir) / f\"{target_path.stem}.md\"\n",
-    "        with open(output_file, 'w', encoding='utf-8') as f:\n",
-    "            f.write(markdown_content)\n",
-    "            \n",
-    "        print(f\"💾 Saved local test file: {output_file}\")\n",
-    "        \n",
-    "        print(\"\\n\" + \"=\"*50)\n",
-    "        print(\"📄 RESULT PREVIEW (First 2000 characters)\")\n",
-    "        print(\"=\"*50)\n",
-    "        print(markdown_content[:2000])\n",
-    "        print(\"\\n\" + \"=\"*50)\n",
-    "        print(\"✅ Test completed! Hãy chạy cell tiếp theo để lưu kết quả lên Drive.\")\n",
-    "        \n",
-    "    except Exception as e:\n",
-    "        print(f\"❌ Test failed: {e}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "a46429ed",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 5.6. Save Test Result to Google Drive\n",
-    "import shutil\n",
-    "\n",
-    "# Cấu hình đường dẫn lưu trên Drive (Lưu vào folder riêng để dễ so sánh với Marker)\n",
-    "drive_test_folder = '/content/drive/MyDrive/docling/test_result_docling'\n",
-    "\n",
-    "# Biến test_output_dir được định nghĩa ở cell 5.5\n",
-    "if 'test_output_dir' in locals() and os.path.exists(test_output_dir):\n",
-    "    # Tạo thư mục cha trên Drive nếu chưa có\n",
-    "    if not os.path.exists(os.path.dirname(drive_test_folder)):\n",
-    "        os.makedirs(os.path.dirname(drive_test_folder), exist_ok=True)\n",
-    "        \n",
-    "    print(f\"📂 Copying test results to: {drive_test_folder}\")\n",
-    "    \n",
-    "    # Sử dụng copytree với dirs_exist_ok=True để copy cả thư mục con và ghi đè nếu cần\n",
-    "    # Cách này giữ nguyên cấu trúc thư mục (subdir)\n",
-    "    try:\n",
-    "        shutil.copytree(test_output_dir, drive_test_folder, dirs_exist_ok=True)\n",
-    "        print(f\"   ✅ Copied entire folder structure successfully!\")\n",
-    "    except Exception as e:\n",
-    "        print(f\"   ❌ Error copying folder: {e}\")\n",
-    "            \n",
-    "    print(f\"\\n🎉 Done! Bạn có thể mở Drive để xem file markdown đầy đủ tại: {drive_test_folder}\")\n",
-    "else:\n",
-    "    print(\"❌ Không tìm thấy thư mục kết quả test hoặc biến 'test_output_dir' chưa được định nghĩa.\")\n",
-    "    print(\"⚠️ Hãy chạy cell 5.5 (Test Run) trước khi chạy cell này!\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8228498a",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# 6. Run Processing & Save Results\n",
-    "output_dir = '/content/data/docling_output'\n",
-    "# Bật OCR \n",
-    "processor = ColabDoclingProcessor(output_dir=output_dir, use_ocr=True) \n",
-    "\n",
-    "# Determine source directory (handle if zip extracted to subfolder)\n",
-    "source_dir = '/content/data_rag/files' \n",
-    "# Check if files are in a subfolder named 'files' inside the extraction path\n",
-    "if os.path.exists(os.path.join(source_dir, 'files')):\n",
-    "    source_dir = os.path.join(source_dir, 'files')\n",
-    "\n",
-    "# Run\n",
-    "processor.parse_directory(source_dir)\n",
-    "\n",
-    "# Zip output and save to Drive\n",
-    "output_zip_path = '/content/drive/MyDrive/docling/docling_output.zip'\n",
-    "print(f\"Zipping output to {output_zip_path}...\")\n",
-    "shutil.make_archive(output_zip_path.replace('.zip', ''), 'zip', output_dir)\n",
-    "print(\"🎉 Done! Check your Google Drive for docling_output.zip\")"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}

core/gradio/gradio_rag.py CHANGED Viewed

@@ -1,188 +1,8 @@
-from __future__ import annotations
-import os
-import sys
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Dict, List, Optional
-import gradio as gr
-from dotenv import find_dotenv, load_dotenv
-from openai import OpenAI
-# Thêm thư mục gốc vào Python path
-REPO_ROOT = Path(__file__).resolve().parents[2]
-if str(REPO_ROOT) not in sys.path:
-    sys.path.insert(0, str(REPO_ROOT))
-@dataclass
-class GradioConfig:
-    """Cấu hình Gradio server: host và port."""
-    server_host: str = "127.0.0.1"
-    server_port: int = 7860
-def _load_env() -> None:
-    """Tải biến môi trường từ file .env."""
-    dotenv_path = find_dotenv(usecwd=True) or ""
-    load_dotenv(dotenv_path=dotenv_path or None, override=False)
-# Import các module RAG
-from core.rag.embedding_model import EmbeddingConfig, QwenEmbeddings
-from core.rag.vector_store import ChromaConfig, ChromaVectorDB
-from core.rag.retrival import Retriever, RetrievalMode, get_retrieval_config
-from core.rag.generator import RAGContextBuilder, build_context, build_prompt, SYSTEM_PROMPT
-_load_env()
-# Cấu hình retrieval và LLM
-RETRIEVAL_MODE = RetrievalMode.HYBRID_RERANK  # Chế độ tìm kiếm
-LLM_MODEL = os.getenv("LLM_MODEL", "qwen/qwen3-32b")  # Model LLM
-LLM_API_BASE = "https://api.groq.com/openai/v1"  # Groq API endpoint
-LLM_API_KEY_ENV = "GROQ_API_KEY"  # Biến môi trường chứa API key
-# Khởi tạo cấu hình
-GRADIO_CFG = GradioConfig()
-RETRIEVAL_CFG = get_retrieval_config()
-class AppState:
-    """Quản lý trạng thái ứng dụng: database, retriever, LLM client."""
-    def __init__(self) -> None:
-        self.db: Optional[ChromaVectorDB] = None
-        self.retriever: Optional[Retriever] = None
-        self.rag_builder: Optional[RAGContextBuilder] = None
-        self.client: Optional[OpenAI] = None
-STATE = AppState()  # Singleton state
-def _init_resources() -> None:
-    """Khởi tạo các tài nguyên: DB, Retriever, LLM client (lazy init)."""
-    if STATE.db is not None:
-        return
-    print(f" Đang khởi tạo Database & Re-ranker...")
-    print(f" Retrieval Mode: {RETRIEVAL_MODE.value}")
-    # Khởi tạo embedding và database
-    emb = QwenEmbeddings(EmbeddingConfig())
-    db_cfg = ChromaConfig()
-    STATE.db = ChromaVectorDB(embedder=emb, config=db_cfg)
-    STATE.retriever = Retriever(vector_db=STATE.db)
-    # Khởi tạo LLM client
-    api_key = (os.getenv(LLM_API_KEY_ENV) or "").strip()
-    if not api_key:
-        raise RuntimeError(f"Missing {LLM_API_KEY_ENV}")
-    STATE.client = OpenAI(api_key=api_key, base_url=LLM_API_BASE)
-    # Khởi tạo RAG builder
-    STATE.rag_builder = RAGContextBuilder(retriever=STATE.retriever)
-    print(" Đã sẵn sàng!")
-def rag_chat(message: str, history: List[Dict[str, str]] | None = None):
-    """Xử lý chat: retrieve documents -> gọi LLM -> stream response"""
-    _init_resources()
-    assert STATE.db is not None
-    assert STATE.client is not None
-    assert STATE.retriever is not None
-    assert STATE.rag_builder is not None
-    # Retrieve và chuẩn bị context
-    prepared = STATE.rag_builder.retrieve_and_prepare(
-        message,
-        k=RETRIEVAL_CFG.top_k,
-        initial_k=RETRIEVAL_CFG.initial_k,
-        mode=RETRIEVAL_MODE.value,
-    )
-    results = prepared["results"]
-    if not results:
-        yield "Xin lỗi, tôi không tìm thấy thông tin phù hợp trong dữ liệu."
-        return
-    # Gọi LLM với streaming
-    completion = STATE.client.chat.completions.create(
-        model=LLM_MODEL,
-        messages=[{"role": "user", "content": prepared["prompt"]}],
-        temperature=0.0,
-        max_tokens=4096,
-        stream=True,
-    )
-    # Stream response
-    acc = ""
-    for chunk in completion:
-        delta = getattr(chunk.choices[0].delta, "content", "") or ""
-        if delta:
-            acc += delta
-            yield acc
-    # Thêm debug info về các documents đã retrieve
-    debug_info = f"\n\n---\n\n**Retrieved (Top {len(results)} | Mode: {RETRIEVAL_MODE.value})**\n\n"
-    for i, r in enumerate(results, 1):
-        md = r.get("metadata", {})
-        content = r.get("content", "").strip()
-        rerank_score = r.get("rerank_score")
-        distance = r.get("distance")
-        # Trích xuất metadata
-        source = md.get("source_file", "N/A")
-        doc_type = md.get("document_type", "N/A")
-        header = md.get("header_path", "")
-        cohorts = md.get("applicable_cohorts", "")
-        program = md.get("program_name", "")
-        issued_year = md.get("issued_year", "")
-        # Format score
-        score_info = ""
-        if rerank_score is not None:
-            score_info += f"Rerank: `{rerank_score:.4f}` "
-        if distance is not None:
-            score_info += f"Distance: `{distance:.4f}`"
-        if not score_info:
-            score_info = f"Rank: `{r.get('final_rank', i)}`"
-        # Format metadata
-        meta_parts = [f"**Nguồn:** {source}", f"**Loại:** {doc_type}"]
-        if issued_year:
-            meta_parts.append(f"**Năm:** {issued_year}")
-        if cohorts:
-            meta_parts.append(f"**Áp dụng:** {cohorts}")
-        if program:
-            meta_parts.append(f"**CTĐT:** {program}")
-        debug_info += f"**#{i}** | {score_info}\n"
-        debug_info += f"   - {' | '.join(meta_parts)}\n"
-        if header and header != "/":
-            debug_info += f"   - **Mục:** {header[:80]}{'...' if len(header) > 80 else ''}\n"
-        debug_info += f"   - **Content:** {content[:200]}{'...' if len(content) > 200 else ''}\n\n"
-    yield acc + debug_info
-# Tạo giao diện Gradio
-demo = gr.ChatInterface(
-    fn=rag_chat,
-    title=f"HUST RAG Assistant",
-    description=f"Trợ lý học vụ Đại học Bách khoa Hà Nội",
-    examples=[
-        "Điều kiện tốt nghiệp đại học là gì?",
-        "Điều kiện để đổi ngành là gì?",
-        "Làm thế nào để đăng ký hoãn thi?",
-    ],
-)
 if __name__ == "__main__":
     print(f"\n{'='*60}")
-    print(f"Starting HUST RAG Assistant")
     print(f"{'='*60}\n")
     demo.launch(
         server_name=GRADIO_CFG.server_host,

+from core.gradio.user_gradio import demo_debug as demo, GRADIO_CFG
 if __name__ == "__main__":
     print(f"\n{'='*60}")
+    print(f"Starting HUST RAG Assistant (Debug Mode)")
     print(f"{'='*60}\n")
     demo.launch(
         server_name=GRADIO_CFG.server_host,

core/gradio/user_gradio.py CHANGED Viewed

@@ -9,6 +9,7 @@ from dotenv import find_dotenv, load_dotenv
 from openai import OpenAI
 import re
 REPO_ROOT = Path(__file__).resolve().parents[2]
 if str(REPO_ROOT) not in sys.path:
     sys.path.insert(0, str(REPO_ROOT))
@@ -19,26 +20,27 @@ class GradioConfig:
     server_host: str = "127.0.0.1"
     server_port: int = 7860
 def _load_env() -> None:
     dotenv_path = find_dotenv(usecwd=True) or ""
     load_dotenv(dotenv_path=dotenv_path or None, override=False)
 from core.rag.embedding_model import EmbeddingConfig, QwenEmbeddings
 from core.rag.vector_store import ChromaConfig, ChromaVectorDB
-from core.rag.retrival import Retriever, RetrievalMode, get_retrieval_config
 from core.rag.generator import RAGContextBuilder, build_context, build_prompt, SYSTEM_PROMPT
 _load_env()
-RETRIEVAL_MODE = RetrievalMode.HYBRID_RERANK  # Test with debug logs
-# LLM Config (hardcoded sau khi xóa LLMConfig từ generator)
 LLM_MODEL = os.getenv("LLM_MODEL", "qwen/qwen3-32b")
 LLM_API_BASE = "https://api.groq.com/openai/v1"
 LLM_API_KEY_ENV = "GROQ_API_KEY"
-# Load retrieval config
 GRADIO_CFG = GradioConfig()
 RETRIEVAL_CFG = get_retrieval_config()
@@ -51,39 +53,84 @@ class AppState:
         self.client: Optional[OpenAI] = None
-STATE = AppState()
 def _init_resources() -> None:
     if STATE.db is not None:
         return
-    print(f" Đang khởi tạo Database & Re-ranker...")
     print(f" Retrieval Mode: {RETRIEVAL_MODE.value}")
     emb = QwenEmbeddings(EmbeddingConfig())
     db_cfg = ChromaConfig()
-    STATE.db = ChromaVectorDB(
-        embedder=emb,
-        config=db_cfg,
-    )
     STATE.retriever = Retriever(vector_db=STATE.db)
-    # LLM Client
     api_key = (os.getenv(LLM_API_KEY_ENV) or "").strip()
     if not api_key:
         raise RuntimeError(f"Missing {LLM_API_KEY_ENV}")
     STATE.client = OpenAI(api_key=api_key, base_url=LLM_API_BASE)
-    # RAGContextBuilder
     STATE.rag_builder = RAGContextBuilder(retriever=STATE.retriever)
-    print(" Đã sẵn sàng!")
-def rag_chat(message: str, history: List[Dict[str, str]] | None = None):
     _init_resources()
     assert STATE.db is not None
@@ -91,7 +138,7 @@ def rag_chat(message: str, history: List[Dict[str, str]] | None = None):
     assert STATE.retriever is not None
     assert STATE.rag_builder is not None
-    # Retrieve và prepare context
     prepared = STATE.rag_builder.retrieve_and_prepare(
         message,
         k=RETRIEVAL_CFG.top_k,
@@ -104,7 +151,7 @@ def rag_chat(message: str, history: List[Dict[str, str]] | None = None):
         yield "Xin lỗi, tôi không tìm thấy thông tin phù hợp trong dữ liệu."
         return
-    # LLM streaming để generate answer
     completion = STATE.client.chat.completions.create(
         model=LLM_MODEL,
         messages=[{"role": "user", "content": prepared["prompt"]}],
@@ -113,36 +160,54 @@ def rag_chat(message: str, history: List[Dict[str, str]] | None = None):
         stream=True,
     )
     acc = ""
     for chunk in completion:
         delta = getattr(chunk.choices[0].delta, "content", "") or ""
         if delta:
             acc += delta
-            # Lọc bỏ phần <think>...</think> trước khi yield
-            display_text = _filter_think_tags(acc)
-            yield display_text
-    # Yield kết quả cuối cùng (đã lọc think)
-    yield _filter_think_tags(acc)
-def _filter_think_tags(text: str) -> str:
-    # Loại bỏ các block <think>...</think> (kể cả multi-line)
-    filtered = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
-    # Loại bỏ khoảng trắng thừa đầu dòng
-    filtered = filtered.strip()
-    return filtered
-# Create Gradio interface
 demo = gr.ChatInterface(
-    fn=rag_chat,
-    title=f"HUST RAG Assistant",
-    description=f"Trợ lý học vụ Đại học Bách khoa Hà Nội",
     examples=[
-        "Sinh viên vi phạm quy chế thi thì bị xử lý như thế nào?",
         "Điều kiện để đổi ngành là gì?",
         "Làm thế nào để đăng ký hoãn thi?",
     ],

 from openai import OpenAI
 import re
+# Add project root to Python path
 REPO_ROOT = Path(__file__).resolve().parents[2]
 if str(REPO_ROOT) not in sys.path:
     sys.path.insert(0, str(REPO_ROOT))
     server_host: str = "127.0.0.1"
     server_port: int = 7860
 def _load_env() -> None:
     dotenv_path = find_dotenv(usecwd=True) or ""
     load_dotenv(dotenv_path=dotenv_path or None, override=False)
+# RAG module imports
 from core.rag.embedding_model import EmbeddingConfig, QwenEmbeddings
 from core.rag.vector_store import ChromaConfig, ChromaVectorDB
+from core.rag.retrieval import Retriever, RetrievalMode, get_retrieval_config
 from core.rag.generator import RAGContextBuilder, build_context, build_prompt, SYSTEM_PROMPT
 _load_env()
+# Retrieval and LLM configuration
+RETRIEVAL_MODE = RetrievalMode.HYBRID_RERANK
 LLM_MODEL = os.getenv("LLM_MODEL", "qwen/qwen3-32b")
 LLM_API_BASE = "https://api.groq.com/openai/v1"
 LLM_API_KEY_ENV = "GROQ_API_KEY"
+# Initialize configs
 GRADIO_CFG = GradioConfig()
 RETRIEVAL_CFG = get_retrieval_config()
         self.client: Optional[OpenAI] = None
+STATE = AppState()  # Singleton state
 def _init_resources() -> None:
     if STATE.db is not None:
         return
+    print(f" Initializing Database & Reranker...")
     print(f" Retrieval Mode: {RETRIEVAL_MODE.value}")
+    # Initialize embedding and vector database
     emb = QwenEmbeddings(EmbeddingConfig())
     db_cfg = ChromaConfig()
+    STATE.db = ChromaVectorDB(embedder=emb, config=db_cfg)
     STATE.retriever = Retriever(vector_db=STATE.db)
+    # Initialize LLM client
     api_key = (os.getenv(LLM_API_KEY_ENV) or "").strip()
     if not api_key:
         raise RuntimeError(f"Missing {LLM_API_KEY_ENV}")
     STATE.client = OpenAI(api_key=api_key, base_url=LLM_API_BASE)
+    # Initialize RAG context builder
     STATE.rag_builder = RAGContextBuilder(retriever=STATE.retriever)
+    print(" Ready!")
+def _filter_think_tags(text: str) -> str:
+    filtered = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
+    return filtered.strip()
+def _format_debug_info(results: List[Dict]) -> str:
+    debug_info = f"\n\n---\n\n**Retrieved (Top {len(results)} | Mode: {RETRIEVAL_MODE.value})**\n\n"
+    for i, r in enumerate(results, 1):
+        md = r.get("metadata", {})
+        content = r.get("content", "").strip()
+        rerank_score = r.get("rerank_score")
+        distance = r.get("distance")
+        # Extract metadata fields
+        source = md.get("source_file", "N/A")
+        doc_type = md.get("document_type", "N/A")
+        header = md.get("header_path", "")
+        cohorts = md.get("applicable_cohorts", "")
+        program = md.get("program_name", "")
+        issued_year = md.get("issued_year", "")
+        # Format relevance score
+        score_info = ""
+        if rerank_score is not None:
+            score_info += f"Rerank: `{rerank_score:.4f}` "
+        if distance is not None:
+            score_info += f"Distance: `{distance:.4f}`"
+        if not score_info:
+            score_info = f"Rank: `{r.get('final_rank', i)}`"
+        # Format metadata labels
+        meta_parts = [f"**Nguồn:** {source}", f"**Loại:** {doc_type}"]
+        if issued_year:
+            meta_parts.append(f"**Năm:** {issued_year}")
+        if cohorts:
+            meta_parts.append(f"**Áp dụng:** {cohorts}")
+        if program:
+            meta_parts.append(f"**CTĐT:** {program}")
+        debug_info += f"**#{i}** | {score_info}\n"
+        debug_info += f"   - {' | '.join(meta_parts)}\n"
+        if header and header != "/":
+            debug_info += f"   - **Mục:** {header[:80]}{'...' if len(header) > 80 else ''}\n"
+        debug_info += f"   - **Content:** {content[:200]}{'...' if len(content) > 200 else ''}\n\n"
+    return debug_info
+def rag_chat(message: str, history: List[Dict[str, str]] | None = None, *, debug: bool = False):
     _init_resources()
     assert STATE.db is not None
     assert STATE.retriever is not None
     assert STATE.rag_builder is not None
+    # Retrieve and prepare context
     prepared = STATE.rag_builder.retrieve_and_prepare(
         message,
         k=RETRIEVAL_CFG.top_k,
         yield "Xin lỗi, tôi không tìm thấy thông tin phù hợp trong dữ liệu."
         return
+    # Call LLM with streaming
     completion = STATE.client.chat.completions.create(
         model=LLM_MODEL,
         messages=[{"role": "user", "content": prepared["prompt"]}],
         stream=True,
     )
+    # Stream response tokens
     acc = ""
     for chunk in completion:
         delta = getattr(chunk.choices[0].delta, "content", "") or ""
         if delta:
             acc += delta
+            # Filter out <think>...</think> blocks before yielding
+            yield _filter_think_tags(acc)
+    # Yield final result
+    final_text = _filter_think_tags(acc)
+    if debug:
+        # Append debug info about retrieved documents
+        final_text += _format_debug_info(results)
+    yield final_text
+# --- User interface (production) ---
+def _rag_chat_user(message: str, history: List[Dict[str, str]] | None = None):
+    yield from rag_chat(message, history, debug=False)
+# --- Debug interface (development) ---
+def _rag_chat_debug(message: str, history: List[Dict[str, str]] | None = None):
+    yield from rag_chat(message, history, debug=True)
+# Production interface (no debug info)
 demo = gr.ChatInterface(
+    fn=_rag_chat_user,
+    title="HUST RAG Assistant",
+    description="Trợ lý học vụ Đại học Bách khoa Hà Nội",
+    examples=[
+        "Cách tính điểm học tập học kỳ ?",
+        "Điều kiện để đổi ngành là gì?",
+        "Làm thế nào để đăng ký hoãn thi?",
+    ],
+)
+# Debug interface (shows retrieved docs info)
+demo_debug = gr.ChatInterface(
+    fn=_rag_chat_debug,
+    title="HUST RAG Assistant (Debug)",
+    description="Trợ lý học vụ Đại học Bách khoa Hà Nội - Chế độ debug",
     examples=[
+        "Điều kiện tốt nghiệp đại học là gì?",
         "Điều kiện để đổi ngành là gì?",
         "Làm thế nào để đăng ký hoãn thi?",
     ],

core/hash_file/hash_data_goc.py CHANGED Viewed

@@ -9,20 +9,19 @@ if str(PROJECT_ROOT) not in sys.path:
 from core.hash_file.hash_file import HashProcessor
-# HuggingFace repo chứa PDF gốc
 HF_RAW_PDF_REPO = "hungnha/Do_An_Dataset"
 def download_from_hf(cache_dir: Path) -> Path:
-    """Tải PDF từ HuggingFace, trả về đường dẫn tới folder data_rag."""
     from huggingface_hub import snapshot_download
-    # Kiểm tra cache đã tồn tại chưa
     if cache_dir.exists() and any(cache_dir.iterdir()):
-        print(f"Cache đã tồn tại: {cache_dir}")
         return cache_dir / "data_rag"
-    print(f"Đang tải từ HuggingFace: {HF_RAW_PDF_REPO}")
     snapshot_download(
         repo_id=HF_RAW_PDF_REPO,
         repo_type="dataset",
@@ -33,7 +32,6 @@ def download_from_hf(cache_dir: Path) -> Path:
 def load_existing_hashes(path: Path) -> dict:
-    """Đọc hash index cũ từ file JSON."""
     if not path.exists():
         return {}
     try:
@@ -44,10 +42,9 @@ def load_existing_hashes(path: Path) -> dict:
 def process_pdfs(source_root: Path, dest_dir: Path, existing_hashes: dict) -> tuple:
-    """Copy PDFs và tính hash. Trả về (results, processed, skipped)."""
     hasher = HashProcessor(verbose=False)
     pdf_files = list(source_root.rglob("*.pdf"))
-    print(f"Tìm thấy {len(pdf_files)} file PDF\n")
     results, processed, skipped = [], 0, 0
@@ -56,7 +53,7 @@ def process_pdfs(source_root: Path, dest_dir: Path, existing_hashes: dict) -> tu
         dest = dest_dir / rel_path
         dest.parent.mkdir(parents=True, exist_ok=True)
-        # Bỏ qua nếu file không thay đổi (hash khớp)
         if dest.exists() and rel_path in existing_hashes:
             current_hash = hasher.get_file_hash(str(dest))
             if current_hash == existing_hashes[rel_path]:
@@ -64,7 +61,7 @@ def process_pdfs(source_root: Path, dest_dir: Path, existing_hashes: dict) -> tu
                 skipped += 1
                 continue
-        # Copy và tính hash
         try:
             shutil.copy2(src, dest)
             file_hash = hasher.get_file_hash(str(dest))
@@ -72,20 +69,20 @@ def process_pdfs(source_root: Path, dest_dir: Path, existing_hashes: dict) -> tu
                 results.append({'filename': rel_path, 'hash': file_hash, 'index': idx})
                 processed += 1
         except Exception as e:
-            print(f"Lỗi: {rel_path} - {e}")
-        # Hiển thị tiến độ
         if (idx + 1) % 20 == 0:
-            print(f"Tiến độ: {idx + 1}/{len(pdf_files)}")
     return results, processed, skipped
 def main():
     import argparse
-    parser = argparse.ArgumentParser(description="Tải PDF và tạo hash index")
-    parser.add_argument("--source", type=str, help="Đường dẫn local tới PDFs (bỏ qua tải HF)")
-    parser.add_argument("--download-only", action="store_true", help="Chỉ tải về, không copy")
     args = parser.parse_args()
     data_dir = PROJECT_ROOT / "data"
@@ -93,34 +90,34 @@ def main():
     files_dir.mkdir(parents=True, exist_ok=True)
     hash_file = data_dir / "hash_data_goc_index.json"
-    # Xác định thư mục nguồn
     if args.source:
         source_root = Path(args.source)
         if not source_root.exists():
-            return print(f"Không tìm thấy thư mục nguồn: {source_root}")
     else:
-        # Tải từ HuggingFace
         source_root = download_from_hf(data_dir / "raw_pdf_cache")
         if args.download_only:
-            return print(f"PDF đã cache tại: {source_root}")
     if not source_root.exists():
-        return print(f"Không tìm thấy thư mục PDF: {source_root}")
-    # Xử lý
     existing = load_existing_hashes(hash_file)
-    print(f"Đã tải {len(existing)} hash từ index cũ")
     results, processed, skipped = process_pdfs(source_root, files_dir, existing)
-    # Lưu kết quả
     hash_file.write_text(json.dumps({
         'train': results,
         'total_files': len(results)
     }, ensure_ascii=False, indent=2), encoding='utf-8')
-    print(f"\nHoàn tất! Tổng: {len(results)} | Mới: {processed} | Bỏ qua: {skipped}")
-    print(f"File index: {hash_file}")
 if __name__ == "__main__":

 from core.hash_file.hash_file import HashProcessor
+# HuggingFace repo containing raw PDFs
 HF_RAW_PDF_REPO = "hungnha/Do_An_Dataset"
 def download_from_hf(cache_dir: Path) -> Path:
     from huggingface_hub import snapshot_download
+    # Check if cache already exists
     if cache_dir.exists() and any(cache_dir.iterdir()):
+        print(f"Cache already exists: {cache_dir}")
         return cache_dir / "data_rag"
+    print(f"Downloading from HuggingFace: {HF_RAW_PDF_REPO}")
     snapshot_download(
         repo_id=HF_RAW_PDF_REPO,
         repo_type="dataset",
 def load_existing_hashes(path: Path) -> dict:
     if not path.exists():
         return {}
     try:
 def process_pdfs(source_root: Path, dest_dir: Path, existing_hashes: dict) -> tuple:
     hasher = HashProcessor(verbose=False)
     pdf_files = list(source_root.rglob("*.pdf"))
+    print(f"Found {len(pdf_files)} PDF files\n")
     results, processed, skipped = [], 0, 0
         dest = dest_dir / rel_path
         dest.parent.mkdir(parents=True, exist_ok=True)
+        # Skip if file unchanged (hash matches)
         if dest.exists() and rel_path in existing_hashes:
             current_hash = hasher.get_file_hash(str(dest))
             if current_hash == existing_hashes[rel_path]:
                 skipped += 1
                 continue
+        # Copy and compute hash
         try:
             shutil.copy2(src, dest)
             file_hash = hasher.get_file_hash(str(dest))
                 results.append({'filename': rel_path, 'hash': file_hash, 'index': idx})
                 processed += 1
         except Exception as e:
+            print(f"Error: {rel_path} - {e}")
+        # Display progress
         if (idx + 1) % 20 == 0:
+            print(f"Progress: {idx + 1}/{len(pdf_files)}")
     return results, processed, skipped
 def main():
     import argparse
+    parser = argparse.ArgumentParser(description="Download PDFs and build hash index")
+    parser.add_argument("--source", type=str, help="Local path to PDFs (skip HF download)")
+    parser.add_argument("--download-only", action="store_true", help="Download only, no copy")
     args = parser.parse_args()
     data_dir = PROJECT_ROOT / "data"
     files_dir.mkdir(parents=True, exist_ok=True)
     hash_file = data_dir / "hash_data_goc_index.json"
+    # Determine source directory
     if args.source:
         source_root = Path(args.source)
         if not source_root.exists():
+            return print(f"Source directory not found: {source_root}")
     else:
+        # Download from HuggingFace
         source_root = download_from_hf(data_dir / "raw_pdf_cache")
         if args.download_only:
+            return print(f"PDFs cached at: {source_root}")
     if not source_root.exists():
+        return print(f"PDF directory not found: {source_root}")
+    # Process
     existing = load_existing_hashes(hash_file)
+    print(f"Loaded {len(existing)} hashes from existing index")
     results, processed, skipped = process_pdfs(source_root, files_dir, existing)
+    # Save results
     hash_file.write_text(json.dumps({
         'train': results,
         'total_files': len(results)
     }, ensure_ascii=False, indent=2), encoding='utf-8')
+    print(f"\nDone! Total: {len(results)} | New: {processed} | Skipped: {skipped}")
+    print(f"Index file: {hash_file}")
 if __name__ == "__main__":

core/hash_file/hash_file.py CHANGED Viewed

@@ -9,23 +9,22 @@ from pathlib import Path
 from typing import Dict, List, Optional
 from datetime import datetime
-# Hằng số
-CHUNK_SIZE = 8192  # Đọc file theo chunk 8KB
 DEFAULT_FILE_EXTENSION = '.pdf'
 class HashProcessor:
-    """Lớp xử lý hash cho files - dùng để phát hiện thay đổi và tránh xử lý lại."""
     def __init__(self, verbose: bool = True):
-        """Khởi tạo HashProcessor."""
         self.verbose = verbose
         self.logger = logging.getLogger(__name__)
         if not verbose:
             self.logger.setLevel(logging.WARNING)
     def get_file_hash(self, path: str) -> Optional[str]:
-        """Tính SHA256 hash của một file."""
         h = hashlib.sha256()
         try:
             with open(path, "rb") as f:
@@ -33,10 +32,10 @@ class HashProcessor:
                     h.update(chunk)
             return h.hexdigest()
         except (IOError, OSError) as e:
-            self.logger.error(f"Lỗi khi đọc file {path}: {e}")
             return None
         except Exception as e:
-            self.logger.error(f"Lỗi không xác định khi xử lý file {path}: {e}")
             return None
     def scan_files_for_hash(
@@ -45,13 +44,13 @@ class HashProcessor:
         file_extension: str = DEFAULT_FILE_EXTENSION,
         recursive: bool = False
     ) -> Dict[str, List[Dict[str, str]]]:
-        """Quét thư mục và tính hash cho tất cả files."""
         source_path = Path(source_dir)
         if not source_path.exists():
-            raise FileNotFoundError(f"Thư mục không tồn tại: {source_dir}")
         hash_to_files = defaultdict(list)
-        self.logger.info(f"Đang quét file trong: {source_dir}")
         pattern = f"**/*{file_extension}" if recursive else f"*{file_extension}"
@@ -62,7 +61,7 @@ class HashProcessor:
                 if not file_path.is_file():
                     continue
-                self.logger.info(f"Đang tính hash cho: {file_path.name}")
                 file_hash = self.get_file_hash(str(file_path))
                 if file_hash:
@@ -72,53 +71,49 @@ class HashProcessor:
                         'size': file_path.stat().st_size
                     })
         except PermissionError as e:
-            self.logger.error(f"Lỗi quyền truy cập: {e}")
             raise
         return hash_to_files
     def load_processed_index(self, index_file: str) -> Dict:
-        """Đọc file index đã xử lý từ JSON."""
         if os.path.exists(index_file):
             try:
                 with open(index_file, "r", encoding="utf-8") as f:
                     return json.load(f)
             except json.JSONDecodeError as e:
-                self.logger.error(f"Lỗi đọc file index {index_file}: {e}")
                 return {}
             except Exception as e:
-                self.logger.error(f"Lỗi không xác định khi đọc index: {e}")
                 return {}
         return {}
     def save_processed_index(self, index_file: str, processed_hashes: Dict) -> None:
-        """Lưu index đã xử lý vào file JSON (atomic write).
-        Ghi vào file tạm trước, sau đó rename để đảm bảo an toàn.
-        """
         temp_name = None
         try:
             os.makedirs(os.path.dirname(index_file), exist_ok=True)
-            # Ghi vào file tạm trước
             dir_name = os.path.dirname(index_file)
             with tempfile.NamedTemporaryFile('w', dir=dir_name, delete=False, encoding='utf-8') as tmp_file:
                 json.dump(processed_hashes, tmp_file, indent=2, ensure_ascii=False)
                 temp_name = tmp_file.name
-            # Rename file tạm thành file chính (atomic operation trên POSIX)
             shutil.move(temp_name, index_file)
-            self.logger.info(f"Đã lưu index file an toàn: {index_file}")
         except Exception as e:
-            self.logger.error(f"Lỗi khi lưu index file {index_file}: {e}")
             if temp_name and os.path.exists(temp_name):
                 os.remove(temp_name)
     def get_current_timestamp(self) -> str:
-        """Lấy timestamp hiện tại theo định dạng ISO."""
         return datetime.now().isoformat()
     def get_string_hash(self, text: str) -> str:
-        """Tính SHA256 hash của một chuỗi text."""
         return hashlib.sha256(text.encode('utf-8')).hexdigest()

 from typing import Dict, List, Optional
 from datetime import datetime
+# Constants
+CHUNK_SIZE = 8192  # Read files in 8KB chunks
 DEFAULT_FILE_EXTENSION = '.pdf'
 class HashProcessor:
     def __init__(self, verbose: bool = True):
         self.verbose = verbose
         self.logger = logging.getLogger(__name__)
         if not verbose:
             self.logger.setLevel(logging.WARNING)
     def get_file_hash(self, path: str) -> Optional[str]:
         h = hashlib.sha256()
         try:
             with open(path, "rb") as f:
                     h.update(chunk)
             return h.hexdigest()
         except (IOError, OSError) as e:
+            self.logger.error(f"Error reading file {path}: {e}")
             return None
         except Exception as e:
+            self.logger.error(f"Unexpected error processing file {path}: {e}")
             return None
     def scan_files_for_hash(
         file_extension: str = DEFAULT_FILE_EXTENSION,
         recursive: bool = False
     ) -> Dict[str, List[Dict[str, str]]]:
         source_path = Path(source_dir)
         if not source_path.exists():
+            raise FileNotFoundError(f"Directory not found: {source_dir}")
         hash_to_files = defaultdict(list)
+        self.logger.info(f"Scanning files in: {source_dir}")
         pattern = f"**/*{file_extension}" if recursive else f"*{file_extension}"
                 if not file_path.is_file():
                     continue
+                self.logger.info(f"Computing hash for: {file_path.name}")
                 file_hash = self.get_file_hash(str(file_path))
                 if file_hash:
                         'size': file_path.stat().st_size
                     })
         except PermissionError as e:
+            self.logger.error(f"Permission error: {e}")
             raise
         return hash_to_files
     def load_processed_index(self, index_file: str) -> Dict:
         if os.path.exists(index_file):
             try:
                 with open(index_file, "r", encoding="utf-8") as f:
                     return json.load(f)
             except json.JSONDecodeError as e:
+                self.logger.error(f"Error reading index file {index_file}: {e}")
                 return {}
             except Exception as e:
+                self.logger.error(f"Unexpected error reading index: {e}")
                 return {}
         return {}
     def save_processed_index(self, index_file: str, processed_hashes: Dict) -> None:
         temp_name = None
         try:
             os.makedirs(os.path.dirname(index_file), exist_ok=True)
+            # Write to temp file first
             dir_name = os.path.dirname(index_file)
             with tempfile.NamedTemporaryFile('w', dir=dir_name, delete=False, encoding='utf-8') as tmp_file:
                 json.dump(processed_hashes, tmp_file, indent=2, ensure_ascii=False)
                 temp_name = tmp_file.name
+            # Atomic rename temp to target (POSIX)
             shutil.move(temp_name, index_file)
+            self.logger.info(f"Saved index file safely: {index_file}")
         except Exception as e:
+            self.logger.error(f"Error saving index file {index_file}: {e}")
             if temp_name and os.path.exists(temp_name):
                 os.remove(temp_name)
     def get_current_timestamp(self) -> str:
         return datetime.now().isoformat()
     def get_string_hash(self, text: str) -> str:
         return hashlib.sha256(text.encode('utf-8')).hexdigest()

core/preprocessing/docling_processor.py CHANGED Viewed

@@ -13,7 +13,7 @@ from docling.datamodel.pipeline_options import PdfPipelineOptions, TableStructur
 from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
 from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
-# Thêm project root vào path để import HashProcessor
 PROJECT_ROOT = Path(__file__).resolve().parents[2]
 if str(PROJECT_ROOT) not in sys.path:
     sys.path.insert(0, str(PROJECT_ROOT))
@@ -22,26 +22,26 @@ from core.hash_file.hash_file import HashProcessor
 class DoclingProcessor:
-    """Chuyển đổi PDF sang Markdown bằng Docling."""
     def __init__(self, output_dir: str, use_ocr: bool = True, timeout: int = 300, images_scale: float = 3.0):
-        """Khởi tạo processor với cấu hình OCR và table extraction."""
         self.output_dir = output_dir
         self.timeout = timeout
         self.logger = logging.getLogger(__name__)
         self.hasher = HashProcessor(verbose=False)
         os.makedirs(output_dir, exist_ok=True)
-        # File lưu hash index
         self.hash_index_path = Path(output_dir) / "docling_hash_index.json"
         self.hash_index = self.hasher.load_processed_index(str(self.hash_index_path))
-        # Cấu hình pipeline PDF
         opts = PdfPipelineOptions(do_ocr=use_ocr, do_table_structure=True)
         opts.table_structure_options = TableStructureOptions(do_cell_matching=True, mode=TableFormerMode.ACCURATE)
         opts.images_scale = images_scale
-        # Cấu hình OCR tiếng Việt
         if use_ocr:
             ocr = EasyOcrOptions()
             ocr.lang = ["vi"]
@@ -54,39 +54,39 @@ class DoclingProcessor:
         self.logger.info(f"Docling | OCR={use_ocr} | Table=accurate | Scale={images_scale} | timeout={timeout}s")
     def clean_markdown(self, text: str) -> str:
-        """Xóa số trang và khoảng trắng thừa."""
         text = re.sub(r'\n\s*Trang\s+\d+\s*\n', '\n', text)
         return re.sub(r'\n{3,}', '\n\n', text).strip()
     def _should_process(self, pdf_path: str, output_path: Path) -> bool:
-        """Kiểm tra xem file PDF có cần xử lý lại không (dựa trên hash)."""
-        # Nếu output chưa tồn tại -> cần xử lý
         if not output_path.exists():
             return True
-        # Tính hash file PDF hiện tại
         current_hash = self.hasher.get_file_hash(pdf_path)
         if not current_hash:
             return True
-        # So sánh với hash đã lưu
         saved_hash = self.hash_index.get(pdf_path, {}).get("hash")
         return current_hash != saved_hash
     def _save_hash(self, pdf_path: str, file_hash: str) -> None:
-        """Lưu hash của file đã xử lý vào index."""
         self.hash_index[pdf_path] = {
             "hash": file_hash,
             "processed_at": self.hasher.get_current_timestamp()
         }
     def parse_document(self, file_path: str) -> str | None:
-        """Chuyển đổi 1 file PDF sang Markdown với timeout."""
         if not os.path.exists(file_path):
             return None
         filename = os.path.basename(file_path)
         try:
-            # Đặt timeout để tránh treo
             signal.signal(signal.SIGALRM, lambda s, f: (_ for _ in ()).throw(TimeoutError()))
             signal.alarm(self.timeout)
@@ -95,22 +95,19 @@ class DoclingProcessor:
             signal.alarm(0)
             md = self.clean_markdown(md)
-            # Thêm frontmatter metadata
             return f"---\nfilename: {filename}\nfilepath: {file_path}\npage_count: {len(result.document.pages)}\nprocessed_at: {datetime.now().isoformat()}\n---\n\n{md}"
         except TimeoutError:
             self.logger.warning(f"Timeout: {filename}")
             signal.alarm(0)
             return None
         except Exception as e:
-            self.logger.error(f"Lỗi: {filename}: {e}")
             signal.alarm(0)
             return None
     def parse_directory(self, source_dir: str) -> dict:
-        """Xử lý toàn bộ thư mục PDF, bỏ qua file không thay đổi (dựa trên hash)."""
-        source_path = Path(source_dir)
-        pdf_files = list(source_path.rglob("*.pdf"))
-        self.logger.info(f"Tìm thấy {len(pdf_files)} file PDF trong {source_dir}")
         results = {"total": len(pdf_files), "parsed": 0, "skipped": 0, "errors": 0}
@@ -124,31 +121,31 @@ class DoclingProcessor:
             pdf_path = str(fp)
-            # Kiểm tra hash để quyết định có cần xử lý không
             if not self._should_process(pdf_path, out):
                 results["skipped"] += 1
                 continue
-            # Tính hash trước khi xử lý
             file_hash = self.hasher.get_file_hash(pdf_path)
             md = self.parse_document(pdf_path)
             if md:
                 out.write_text(md, encoding="utf-8")
                 results["parsed"] += 1
-                # Lưu hash sau khi xử lý thành công
                 if file_hash:
                     self._save_hash(pdf_path, file_hash)
             else:
                 results["errors"] += 1
-            # Dọn memory mỗi 10 files
             if (i + 1) % 10 == 0:
                 gc.collect()
-                self.logger.info(f"{i+1}/{len(pdf_files)} (bỏ qua: {results['skipped']})")
-        # Lưu hash index sau khi xử lý xong
         self.hasher.save_processed_index(str(self.hash_index_path), self.hash_index)
-        self.logger.info(f"Xong: {results['parsed']} đã xử lý, {results['skipped']} bỏ qua, {results['errors']} lỗi")
         return results

 from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
 from docling.pipeline.standard_pdf_pipeline import StandardPdfPipeline
+# Add project root to path for HashProcessor import
 PROJECT_ROOT = Path(__file__).resolve().parents[2]
 if str(PROJECT_ROOT) not in sys.path:
     sys.path.insert(0, str(PROJECT_ROOT))
 class DoclingProcessor:
     def __init__(self, output_dir: str, use_ocr: bool = True, timeout: int = 300, images_scale: float = 3.0):
         self.output_dir = output_dir
         self.timeout = timeout
         self.logger = logging.getLogger(__name__)
         self.hasher = HashProcessor(verbose=False)
         os.makedirs(output_dir, exist_ok=True)
+        # Hash index file
         self.hash_index_path = Path(output_dir) / "docling_hash_index.json"
         self.hash_index = self.hasher.load_processed_index(str(self.hash_index_path))
+        # PDF pipeline configuration
         opts = PdfPipelineOptions(do_ocr=use_ocr, do_table_structure=True)
         opts.table_structure_options = TableStructureOptions(do_cell_matching=True, mode=TableFormerMode.ACCURATE)
         opts.images_scale = images_scale
+        # Vietnamese OCR configuration
         if use_ocr:
             ocr = EasyOcrOptions()
             ocr.lang = ["vi"]
         self.logger.info(f"Docling | OCR={use_ocr} | Table=accurate | Scale={images_scale} | timeout={timeout}s")
     def clean_markdown(self, text: str) -> str:
         text = re.sub(r'\n\s*Trang\s+\d+\s*\n', '\n', text)
         return re.sub(r'\n{3,}', '\n\n', text).strip()
     def _should_process(self, pdf_path: str, output_path: Path) -> bool:
+        # If output doesn't exist -> needs processing
         if not output_path.exists():
             return True
+        # Compute hash of current PDF file
         current_hash = self.hasher.get_file_hash(pdf_path)
         if not current_hash:
             return True
+        # Compare with saved hash
         saved_hash = self.hash_index.get(pdf_path, {}).get("hash")
         return current_hash != saved_hash
     def _save_hash(self, pdf_path: str, file_hash: str) -> None:
         self.hash_index[pdf_path] = {
             "hash": file_hash,
             "processed_at": self.hasher.get_current_timestamp()
         }
     def parse_document(self, file_path: str) -> str | None:
         if not os.path.exists(file_path):
             return None
         filename = os.path.basename(file_path)
         try:
+            # Set timeout to prevent hanging
             signal.signal(signal.SIGALRM, lambda s, f: (_ for _ in ()).throw(TimeoutError()))
             signal.alarm(self.timeout)
             signal.alarm(0)
             md = self.clean_markdown(md)
+            # Add frontmatter metadata
             return f"---\nfilename: {filename}\nfilepath: {file_path}\npage_count: {len(result.document.pages)}\nprocessed_at: {datetime.now().isoformat()}\n---\n\n{md}"
         except TimeoutError:
             self.logger.warning(f"Timeout: {filename}")
             signal.alarm(0)
             return None
         except Exception as e:
+            self.logger.error(f"Error: {filename}: {e}")
             signal.alarm(0)
             return None
     def parse_directory(self, source_dir: str) -> dict:
+        self.logger.info(f"Found {len(pdf_files)} PDF files in {source_dir}")
         results = {"total": len(pdf_files), "parsed": 0, "skipped": 0, "errors": 0}
             pdf_path = str(fp)
+            # Check hash to decide if processing is needed
             if not self._should_process(pdf_path, out):
                 results["skipped"] += 1
                 continue
+            # Compute hash before processing
             file_hash = self.hasher.get_file_hash(pdf_path)
             md = self.parse_document(pdf_path)
             if md:
                 out.write_text(md, encoding="utf-8")
                 results["parsed"] += 1
+                # Save hash after successful processing
                 if file_hash:
                     self._save_hash(pdf_path, file_hash)
             else:
                 results["errors"] += 1
+            # Clean up memory every 10 files
             if (i + 1) % 10 == 0:
                 gc.collect()
+                self.logger.info(f"{i+1}/{len(pdf_files)} (skipped: {results['skipped']})")
+        # Save hash index after processing
         self.hasher.save_processed_index(str(self.hash_index_path), self.hash_index)
+        self.logger.info(f"Done: {results['parsed']} processed, {results['skipped']} skipped, {results['errors']} errors")
         return results

core/preprocessing/pdf_parser.py CHANGED Viewed

@@ -1,22 +1,22 @@
 from docling_processor import DoclingProcessor
-# Cấu hình đường dẫn
-PDF_FILE = ""  # File đơn lẻ (để trống nếu muốn parse cả thư mục)
-SOURCE_DIR = "data/data_raw"  # Thư mục chứa PDFs
-OUTPUT_DIR = "data"           # Thư mục xuất Markdown
-USE_OCR = False               # Bật OCR cho PDF scan
 if __name__ == "__main__":
     processor = DoclingProcessor(OUTPUT_DIR, use_ocr=USE_OCR)
     if PDF_FILE:
-        # Parse 1 file đơn lẻ
-        print(f"Đang xử lý: {PDF_FILE}")
         result = processor.parse_document(PDF_FILE)
-        print("Xong!" if result else "Lỗi hoặc bỏ qua")
     else:
-        # Parse cả thư mục
-        print(f"Đang xử lý thư mục: {SOURCE_DIR}")
         r = processor.parse_directory(SOURCE_DIR)
-        print(f"Tổng: {r['total']} | Thành công: {r['parsed']} | Bỏ qua: {r['skipped']} | Lỗi: {r['errors']}")

 from docling_processor import DoclingProcessor
+# Configuration
+PDF_FILE = ""  # Single file (leave empty to parse entire directory)
+SOURCE_DIR = "data/data_raw"  # Directory containing PDFs
+OUTPUT_DIR = "data"           # Markdown output directory
+USE_OCR = False               # Enable OCR for scanned PDFs
 if __name__ == "__main__":
     processor = DoclingProcessor(OUTPUT_DIR, use_ocr=USE_OCR)
     if PDF_FILE:
+        # Parse a single file
+        print(f"Processing: {PDF_FILE}")
         result = processor.parse_document(PDF_FILE)
+        print("Done!" if result else "Error or skipped")
     else:
+        # Parse entire directory
+        print(f"Processing directory: {SOURCE_DIR}")
         r = processor.parse_directory(SOURCE_DIR)
+        print(f"Total: {r['total']} | Success: {r['parsed']} | Skipped: {r['skipped']} | Errors: {r['errors']}")

core/rag/chunk.py CHANGED Viewed

@@ -10,13 +10,13 @@ from llama_index.core import Document
 from llama_index.core.node_parser import MarkdownNodeParser, SentenceSplitter
 from llama_index.core.schema import BaseNode, TextNode
-# Cấu hình chunking
 CHUNK_SIZE = 1500
 CHUNK_OVERLAP = 150
 MIN_CHUNK_SIZE = 200
 TABLE_ROWS_PER_CHUNK = 15
-# Cấu hình Small-to-Big
 ENABLE_TABLE_SUMMARY = True
 MIN_TABLE_ROWS_FOR_SUMMARY = 0
 SUMMARY_MODEL = "openai/gpt-oss-120b"
@@ -31,20 +31,20 @@ TABLE_TITLE_PATTERN = re.compile(r"(?:^|\n)#+\s*(?:Bảng|BẢNG)\s*(\d+(?:\.\d+
 def _is_table_row(line: str) -> bool:
-    """Kiểm tra dòng có phải là hàng trong bảng Markdown không."""
     s = line.strip()
     return s.startswith("|") and s.endswith("|") and s.count("|") >= 2
 def _is_separator(line: str) -> bool:
-    """Kiểm tra dòng có phải là separator của bảng (|---|---|)."""
     if not _is_table_row(line):
         return False
     return not line.strip().replace("|", "").replace("-", "").replace(":", "").replace(" ", "")
 def _is_header(line: str) -> bool:
-    """Kiểm tra dòng có phải là header của bảng không."""
     if not _is_table_row(line):
         return False
     cells = [c.strip() for c in line.split("|") if c.strip()]
@@ -54,7 +54,7 @@ def _is_header(line: str) -> bool:
 def _extract_tables(text: str) -> Tuple[List[Tuple[str, List[str]]], str]:
-    """Trích xuất bảng từ text và thay bằng placeholder."""
     lines, tables, last_header, i = text.split("\n"), [], None, 0
     while i < len(lines) - 1:
@@ -78,7 +78,7 @@ def _extract_tables(text: str) -> Tuple[List[Tuple[str, List[str]]], str]:
         else:
             i += 1
-    # Thay bảng bằng placeholder
     result, tbl_idx, i = [], 0, 0
     while i < len(lines):
         if tbl_idx < len(tables) and i < len(lines) - 1 and _is_table_row(lines[i]) and _is_separator(lines[i + 1]):
@@ -95,7 +95,7 @@ def _extract_tables(text: str) -> Tuple[List[Tuple[str, List[str]]], str]:
 def _split_table(header: str, rows: List[str], max_rows: int = TABLE_ROWS_PER_CHUNK) -> List[str]:
-    """Chia bảng lớn thành nhiều chunks nhỏ."""
     if len(rows) <= max_rows:
         return [header + "\n".join(rows)]
@@ -104,7 +104,7 @@ def _split_table(header: str, rows: List[str], max_rows: int = TABLE_ROWS_PER_CH
         chunk_rows = rows[i:i + max_rows]
         chunks.append(chunk_rows)
-    # Gộp chunk cuối nếu quá nhỏ (< 5 dòng)
     if len(chunks) > 1 and len(chunks[-1]) < 5:
         chunks[-2].extend(chunks[-1])
         chunks.pop()
@@ -116,14 +116,14 @@ _summary_client: Optional[OpenAI] = None
 def _get_summary_client() -> Optional[OpenAI]:
-    """Lấy Groq client để tóm tắt bảng."""
     global _summary_client
     if _summary_client is not None:
         return _summary_client
     api_key = os.getenv("GROQ_API_KEY", "").strip()
     if not api_key:
-        print("Chưa đặt GROQ_API_KEY. Tắt tính năng tóm tắt bảng.")
         return None
     _summary_client = OpenAI(api_key=api_key, base_url=GROQ_BASE_URL)
@@ -139,17 +139,17 @@ def _summarize_table(
     max_retries: int = 5,
     base_delay: float = 2.0
 ) -> str:
-    """Tóm tắt bảng bằng LLM với retry logic."""
     import time
     if not ENABLE_TABLE_SUMMARY:
-        raise RuntimeError("Tính năng tóm tắt bảng đã tắt. Đặt ENABLE_TABLE_SUMMARY = True")
     client = _get_summary_client()
     if client is None:
-        raise RuntimeError("Chưa đặt GROQ_API_KEY. Không thể tóm tắt bảng.")
-    # Tạo chuỗi định danh bảng
     table_id_parts = []
     if table_number:
         table_id_parts.append(f"Bảng {table_number}")
@@ -188,17 +188,17 @@ Bảng:
             if summary.strip():
                 return summary.strip()
             else:
-                raise ValueError("API trả về summary rỗng")
         except Exception as e:
             last_error = e
             delay = base_delay * (2 ** attempt)  # Exponential backoff: 2, 4, 8, 16, 32 giây
-            print(f"Thử lại {attempt + 1}/{max_retries} cho {table_identifier}: {e}")
-            print(f"   Đợi {delay:.1f}s trước khi thử lại...")
             time.sleep(delay)
-    # Tất cả retry đều thất bại
-    raise RuntimeError(f"Không thể tóm tắt {table_identifier} sau {max_retries} l���n thử. Lỗi cuối: {last_error}")
 def _create_table_nodes(
@@ -209,11 +209,11 @@ def _create_table_nodes(
     table_title: str = "",
     source_file: str = ""
 ) -> List[TextNode]:
-    """Tạo nodes cho bảng. Bảng lớn sẽ có parent + summary node."""
-    # Đếm số dòng để quyết định có cần tóm tắt không
     row_count = table_text.count("\n")
-    # Thêm thông tin bảng vào metadata
     table_meta = {**metadata}
     if table_number:
         table_meta["table_number"] = table_number
@@ -221,15 +221,15 @@ def _create_table_nodes(
         table_meta["table_title"] = table_title
     if row_count < MIN_TABLE_ROWS_FOR_SUMMARY:
-        # Bảng quá nhỏ, không cần tóm tắt
         return [TextNode(text=table_text, metadata={**table_meta, "is_table": True})]
-    # Kiểm tra có thể tóm tắt không (cần API key)
     if _get_summary_client() is None:
-        # Không có API key -> trả về node bảng đơn giản, không tóm tắt
         return [TextNode(text=table_text, metadata={**table_meta, "is_table": True})]
-    # Tạo summary với retry logic
     summary = _summarize_table(
         table_text,
         context_hint,
@@ -238,36 +238,36 @@ def _create_table_nodes(
         source_file=source_file
     )
-    # Tạo parent node (bảng gốc - KHÔNG embed)
     parent_id = str(uuid.uuid4())
     parent_node = TextNode(
         text=table_text,
         metadata={
             **table_meta,
             "is_table": True,
-            "is_parent": True,  # Flag để bỏ qua embedding
             "node_id": parent_id,
         }
     )
     parent_node.id_ = parent_id
-    # Tạo summary node (SẼ được embed để search)
     summary_node = TextNode(
         text=summary,
         metadata={
             **table_meta,
             "is_table_summary": True,
-            "parent_id": parent_id,  # Link tới parent
         }
     )
-    table_id = f"Bảng {table_number}" if table_number else "bảng"
-    print(f"Đã tạo summary cho {table_id} ({row_count} dòng)")
     return [parent_node, summary_node]
 def _enrich_metadata(node: BaseNode, source_path: Path | None) -> None:
-    """Bổ sung metadata từ source path và trích xuất thông tin học phần."""
     if source_path:
         node.metadata.update({"source_path": str(source_path), "source_file": source_path.name})
     if "Học phần" in (text := node.get_content()) and (m := COURSE_PATTERN.search(text)):
@@ -275,7 +275,7 @@ def _enrich_metadata(node: BaseNode, source_path: Path | None) -> None:
 def _chunk_text(text: str, metadata: dict) -> List[BaseNode]:
-    """Chia text thành chunks theo kích thước cấu hình."""
     if len(text) <= CHUNK_SIZE:
         return [TextNode(text=text, metadata=metadata.copy())]
     return SentenceSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP).get_nodes_from_documents(
@@ -284,7 +284,7 @@ def _chunk_text(text: str, metadata: dict) -> List[BaseNode]:
 def _extract_frontmatter(text: str) -> Tuple[Dict[str, Any], str]:
-    """Trích xuất YAML frontmatter từ đầu file."""
     match = FRONTMATTER_PATTERN.match(text)
     if not match:
         return {}, text
@@ -298,23 +298,23 @@ def _extract_frontmatter(text: str) -> Tuple[Dict[str, Any], str]:
 def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[BaseNode]:
-    """Chunk một file Markdown thành các nodes."""
     if not text or not text.strip():
         return []
     path = Path(source_path) if source_path else None
-    # Trích xuất YAML frontmatter làm metadata (không chunk)
     frontmatter_meta, text = _extract_frontmatter(text)
     tables, text_with_placeholders = _extract_tables(text)
-    # Metadata cơ bản từ frontmatter + source path
     base_meta = {**frontmatter_meta}
     if path:
         base_meta.update({"source_path": str(path), "source_file": path.name})
-    # Parse theo headings
     doc = Document(text=text_with_placeholders, metadata=base_meta.copy())
     heading_nodes = MarkdownNodeParser().get_nodes_from_documents([doc])
@@ -329,10 +329,10 @@ def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[Bas
         last_end = 0
         for match in matches:
-            # Text trước bảng
             before_text = content[last_end:match.start()].strip()
-            # Trích xuất số bảng và tiêu đề từ text trước bảng
             table_number = ""
             table_title = ""
             if before_text:
@@ -344,15 +344,15 @@ def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[Bas
             if before_text and len(before_text) >= MIN_CHUNK_SIZE:
                 nodes.extend(_chunk_text(before_text, meta) if len(before_text) > CHUNK_SIZE else [TextNode(text=before_text, metadata=meta.copy())])
-            # Chunk bảng - sử dụng Small-to-Big pattern
             if (idx := int(match.group(1))) < len(tables):
                 header, rows = tables[idx]
                 table_chunks = _split_table(header, rows)
-                # Lấy context hint từ header path
                 context_hint = meta.get("Header 1", "") or meta.get("section", "")
-                # Lấy source file cho summary
                 source_file = meta.get("source_file", "") or (path.name if path else "")
                 for i, chunk in enumerate(table_chunks):
@@ -360,7 +360,7 @@ def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[Bas
                     if len(table_chunks) > 1:
                         chunk_meta["table_part"] = f"{i+1}/{len(table_chunks)}"
-                    # Tạo parent + summary nodes nếu cần
                     table_nodes = _create_table_nodes(
                         chunk,
                         chunk_meta,
@@ -373,11 +373,11 @@ def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[Bas
             last_end = match.end()
-        # Text sau bảng
         if (after := content[last_end:].strip()) and len(after) >= MIN_CHUNK_SIZE:
             nodes.extend(_chunk_text(after, meta) if len(after) > CHUNK_SIZE else [TextNode(text=after, metadata=meta.copy())])
-    # Gộp các node nhỏ với node kế tiếp
     final: List[BaseNode] = []
     i = 0
     while i < len(nodes):
@@ -385,12 +385,12 @@ def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[Bas
         curr_content = curr.get_content()
         curr_is_table = curr.metadata.get("is_table")
-        # Bỏ qua node rỗng
         if not curr_content.strip():
             i += 1
             continue
-        # Nếu node hiện tại nhỏ và không phải bảng -> gộp với node sau
         if not curr_is_table and len(curr_content) < MIN_CHUNK_SIZE and i + 1 < len(nodes):
             next_node = nodes[i + 1]
             next_is_table = next_node.metadata.get("is_table")
@@ -417,8 +417,8 @@ def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[Bas
 def chunk_markdown_file(path: str | Path) -> List[BaseNode]:
-    """Đọc và chunk một file Markdown."""
     p = Path(path)
     if not p.exists():
-        raise FileNotFoundError(f"Không tìm thấy file: {p}")
     return chunk_markdown(p.read_text(encoding="utf-8"), source_path=p)

 from llama_index.core.node_parser import MarkdownNodeParser, SentenceSplitter
 from llama_index.core.schema import BaseNode, TextNode
+# Chunking configuration
 CHUNK_SIZE = 1500
 CHUNK_OVERLAP = 150
 MIN_CHUNK_SIZE = 200
 TABLE_ROWS_PER_CHUNK = 15
+# Small-to-Big configuration
 ENABLE_TABLE_SUMMARY = True
 MIN_TABLE_ROWS_FOR_SUMMARY = 0
 SUMMARY_MODEL = "openai/gpt-oss-120b"
 def _is_table_row(line: str) -> bool:
     s = line.strip()
     return s.startswith("|") and s.endswith("|") and s.count("|") >= 2
 def _is_separator(line: str) -> bool:
     if not _is_table_row(line):
         return False
     return not line.strip().replace("|", "").replace("-", "").replace(":", "").replace(" ", "")
 def _is_header(line: str) -> bool:
     if not _is_table_row(line):
         return False
     cells = [c.strip() for c in line.split("|") if c.strip()]
 def _extract_tables(text: str) -> Tuple[List[Tuple[str, List[str]]], str]:
     lines, tables, last_header, i = text.split("\n"), [], None, 0
     while i < len(lines) - 1:
         else:
             i += 1
+    # Replace tables with placeholders
     result, tbl_idx, i = [], 0, 0
     while i < len(lines):
         if tbl_idx < len(tables) and i < len(lines) - 1 and _is_table_row(lines[i]) and _is_separator(lines[i + 1]):
 def _split_table(header: str, rows: List[str], max_rows: int = TABLE_ROWS_PER_CHUNK) -> List[str]:
     if len(rows) <= max_rows:
         return [header + "\n".join(rows)]
         chunk_rows = rows[i:i + max_rows]
         chunks.append(chunk_rows)
+    # Merge last chunk if too small (< 5 rows)
     if len(chunks) > 1 and len(chunks[-1]) < 5:
         chunks[-2].extend(chunks[-1])
         chunks.pop()
 def _get_summary_client() -> Optional[OpenAI]:
     global _summary_client
     if _summary_client is not None:
         return _summary_client
     api_key = os.getenv("GROQ_API_KEY", "").strip()
     if not api_key:
+        print("GROQ_API_KEY not set. Table summarization disabled.")
         return None
     _summary_client = OpenAI(api_key=api_key, base_url=GROQ_BASE_URL)
     max_retries: int = 5,
     base_delay: float = 2.0
 ) -> str:
     import time
     if not ENABLE_TABLE_SUMMARY:
+        raise RuntimeError("Table summarization is disabled. Set ENABLE_TABLE_SUMMARY = True")
     client = _get_summary_client()
     if client is None:
+        raise RuntimeError("GROQ_API_KEY not set. Cannot summarize table.")
+    # Build table identifier string
     table_id_parts = []
     if table_number:
         table_id_parts.append(f"Bảng {table_number}")
             if summary.strip():
                 return summary.strip()
             else:
+                raise ValueError("API returned empty summary")
         except Exception as e:
             last_error = e
             delay = base_delay * (2 ** attempt)  # Exponential backoff: 2, 4, 8, 16, 32 giây
+            print(f"Retry {attempt + 1}/{max_retries} for {table_identifier}: {e}")
+            print(f"   Waiting {delay:.1f}s before retry...")
             time.sleep(delay)
+    # All retries failed
+    raise RuntimeError(f"Failed to summarize {table_identifier} after {max_retries} attempts. Last error: {last_error}")
 def _create_table_nodes(
     table_title: str = "",
     source_file: str = ""
 ) -> List[TextNode]:
+    # Count rows to decide if summarization is needed
     row_count = table_text.count("\n")
+    # Add table info to metadata
     table_meta = {**metadata}
     if table_number:
         table_meta["table_number"] = table_number
         table_meta["table_title"] = table_title
     if row_count < MIN_TABLE_ROWS_FOR_SUMMARY:
+        # Table too small, no summary needed
         return [TextNode(text=table_text, metadata={**table_meta, "is_table": True})]
+    # Check if summarization is possible (needs API key)
     if _get_summary_client() is None:
+        # No API key -> return simple table node without summary
         return [TextNode(text=table_text, metadata={**table_meta, "is_table": True})]
+    # Create summary with retry logic
     summary = _summarize_table(
         table_text,
         context_hint,
         source_file=source_file
     )
+    # Create parent node (original table - NOT embedded)
     parent_id = str(uuid.uuid4())
     parent_node = TextNode(
         text=table_text,
         metadata={
             **table_meta,
             "is_table": True,
+            "is_parent": True,  # Flag to skip embedding
             "node_id": parent_id,
         }
     )
     parent_node.id_ = parent_id
+    # Create summary node (WILL be embedded for search)
     summary_node = TextNode(
         text=summary,
         metadata={
             **table_meta,
             "is_table_summary": True,
+            "parent_id": parent_id,  # Link to parent
         }
     )
+    table_id = f"Table {table_number}" if table_number else "table"
+    print(f"Created summary for {table_id} ({row_count} rows)")
     return [parent_node, summary_node]
 def _enrich_metadata(node: BaseNode, source_path: Path | None) -> None:
     if source_path:
         node.metadata.update({"source_path": str(source_path), "source_file": source_path.name})
     if "Học phần" in (text := node.get_content()) and (m := COURSE_PATTERN.search(text)):
 def _chunk_text(text: str, metadata: dict) -> List[BaseNode]:
     if len(text) <= CHUNK_SIZE:
         return [TextNode(text=text, metadata=metadata.copy())]
     return SentenceSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP).get_nodes_from_documents(
 def _extract_frontmatter(text: str) -> Tuple[Dict[str, Any], str]:
     match = FRONTMATTER_PATTERN.match(text)
     if not match:
         return {}, text
 def chunk_markdown(text: str, source_path: str | Path | None = None) -> List[BaseNode]:
     if not text or not text.strip():
         return []
     path = Path(source_path) if source_path else None
+    # Extract YAML frontmatter as metadata (not chunked)
     frontmatter_meta, text = _extract_frontmatter(text)
     tables, text_with_placeholders = _extract_tables(text)
+    # Base metadata from frontmatter + source path
     base_meta = {**frontmatter_meta}
     if path:
         base_meta.update({"source_path": str(path), "source_file": path.name})
+    # Parse by headings
     doc = Document(text=text_with_placeholders, metadata=base_meta.copy())
     heading_nodes = MarkdownNodeParser().get_nodes_from_documents([doc])
         last_end = 0
         for match in matches:
+            # Text before table
             before_text = content[last_end:match.start()].strip()
+            # Extract table number and title from text before table
             table_number = ""
             table_title = ""
             if before_text:
             if before_text and len(before_text) >= MIN_CHUNK_SIZE:
                 nodes.extend(_chunk_text(before_text, meta) if len(before_text) > CHUNK_SIZE else [TextNode(text=before_text, metadata=meta.copy())])
+            # Chunk table - using Small-to-Big pattern
             if (idx := int(match.group(1))) < len(tables):
                 header, rows = tables[idx]
                 table_chunks = _split_table(header, rows)
+                # Get context hint from header path
                 context_hint = meta.get("Header 1", "") or meta.get("section", "")
+                # Get source file for summary
                 source_file = meta.get("source_file", "") or (path.name if path else "")
                 for i, chunk in enumerate(table_chunks):
                     if len(table_chunks) > 1:
                         chunk_meta["table_part"] = f"{i+1}/{len(table_chunks)}"
+                    # Create parent + summary nodes if needed
                     table_nodes = _create_table_nodes(
                         chunk,
                         chunk_meta,
             last_end = match.end()
+        # Text after table
         if (after := content[last_end:].strip()) and len(after) >= MIN_CHUNK_SIZE:
             nodes.extend(_chunk_text(after, meta) if len(after) > CHUNK_SIZE else [TextNode(text=after, metadata=meta.copy())])
+    # Merge small nodes with next node
     final: List[BaseNode] = []
     i = 0
     while i < len(nodes):
         curr_content = curr.get_content()
         curr_is_table = curr.metadata.get("is_table")
+        # Skip empty nodes
         if not curr_content.strip():
             i += 1
             continue
+        # If current node is small and not a table -> merge with next
         if not curr_is_table and len(curr_content) < MIN_CHUNK_SIZE and i + 1 < len(nodes):
             next_node = nodes[i + 1]
             next_is_table = next_node.metadata.get("is_table")
 def chunk_markdown_file(path: str | Path) -> List[BaseNode]:
     p = Path(path)
     if not p.exists():
+        raise FileNotFoundError(f"File not found: {p}")
     return chunk_markdown(p.read_text(encoding="utf-8"), source_path=p)

core/rag/embedding_model.py CHANGED Viewed

@@ -13,18 +13,17 @@ logger = logging.getLogger(__name__)
 @dataclass
 class EmbeddingConfig:
-    """Cấu hình cho embedding model."""
-    api_base_url: str = "https://api.siliconflow.com/v1"  # SiliconFlow API
-    model: str = "Qwen/Qwen3-Embedding-4B"                # Model embedding
-    dimension: int = 2048                                  # Số chiều vector
-    batch_size: int = 16                                   # Số text mỗi batch
 _embed_config: EmbeddingConfig | None = None
 def get_embedding_config() -> EmbeddingConfig:
-    """Lấy cấu hình embedding (singleton pattern)."""
     global _embed_config
     if _embed_config is None:
         _embed_config = EmbeddingConfig()
@@ -32,32 +31,32 @@ def get_embedding_config() -> EmbeddingConfig:
 class QwenEmbeddings(Embeddings):
-    """Wrapper embedding model Qwen qua SiliconFlow API"""
     def __init__(self, config: EmbeddingConfig | None = None):
-        """Khởi tạo embedding client."""
         self.config = config or get_embedding_config()
         api_key = os.getenv("SILICONFLOW_API_KEY", "").strip()
         if not api_key:
-            raise ValueError("Chưa đặt biến môi trường SILICONFLOW_API_KEY")
         self._client = OpenAI(
             api_key=api_key,
             base_url=self.config.api_base_url,
         )
-        logger.info(f"Đã khởi tạo QwenEmbeddings: {self.config.model}")
     def embed_query(self, text: str) -> List[float]:
-        """Embed một câu query (dùng cho search)."""
         return self._embed_texts([text])[0]
     def embed_documents(self, texts: List[str]) -> List[List[float]]:
-        """Embed nhiều documents (dùng khi index)."""
         return self._embed_texts(texts)
     def _embed_texts(self, texts: Sequence[str]) -> List[List[float]]:
-        """Embed danh sách texts theo batch với retry logic."""
         if not texts:
             return []
@@ -65,11 +64,11 @@ class QwenEmbeddings(Embeddings):
         batch_size = self.config.batch_size
         max_retries = 3
-        # Xử lý theo batch
         for i in range(0, len(texts), batch_size):
             batch = list(texts[i:i + batch_size])
-            # Retry logic cho rate limit
             for attempt in range(max_retries):
                 try:
                     response = self._client.embeddings.create(
@@ -80,10 +79,10 @@ class QwenEmbeddings(Embeddings):
                         all_embeddings.append(item.embedding)
                     break
                 except Exception as e:
-                    # Nếu bị rate limit -> đợi rồi thử lại
                     if "rate" in str(e).lower() and attempt < max_retries - 1:
-                        wait_time = 2 ** attempt  # Exponential backoff: 1s, 2s, 4s
-                        logger.warning(f"Bị rate limit, đợi {wait_time}s...")
                         time.sleep(wait_time)
                     else:
                         raise
@@ -91,10 +90,10 @@ class QwenEmbeddings(Embeddings):
         return all_embeddings
     def embed_texts_np(self, texts: Sequence[str]) -> np.ndarray:
-        """Embed texts và trả về numpy array (tiện cho tính toán)."""
         return np.asarray(self._embed_texts(list(texts)), dtype=np.float32)
-# Alias để tương thích ngược
 SiliconFlowConfig = EmbeddingConfig
 get_config = get_embedding_config

 @dataclass
 class EmbeddingConfig:
+    api_base_url: str = "https://api.siliconflow.com/v1"
+    model: str = "Qwen/Qwen3-Embedding-4B"
+    dimension: int = 2048
+    batch_size: int = 16
 _embed_config: EmbeddingConfig | None = None
 def get_embedding_config() -> EmbeddingConfig:
     global _embed_config
     if _embed_config is None:
         _embed_config = EmbeddingConfig()
 class QwenEmbeddings(Embeddings):
     def __init__(self, config: EmbeddingConfig | None = None):
         self.config = config or get_embedding_config()
         api_key = os.getenv("SILICONFLOW_API_KEY", "").strip()
         if not api_key:
+            raise ValueError("Missing SILICONFLOW_API_KEY environment variable")
         self._client = OpenAI(
             api_key=api_key,
             base_url=self.config.api_base_url,
         )
+        logger.info(f"Initialized QwenEmbeddings: {self.config.model}")
     def embed_query(self, text: str) -> List[float]:
         return self._embed_texts([text])[0]
     def embed_documents(self, texts: List[str]) -> List[List[float]]:
         return self._embed_texts(texts)
     def _embed_texts(self, texts: Sequence[str]) -> List[List[float]]:
         if not texts:
             return []
         batch_size = self.config.batch_size
         max_retries = 3
+        # Process in batches
         for i in range(0, len(texts), batch_size):
             batch = list(texts[i:i + batch_size])
+            # Retry logic for rate limits
             for attempt in range(max_retries):
                 try:
                     response = self._client.embeddings.create(
                         all_embeddings.append(item.embedding)
                     break
                 except Exception as e:
+                    # Rate limit -> wait and retry
                     if "rate" in str(e).lower() and attempt < max_retries - 1:
+                        wait_time = 2 ** attempt
+                        logger.warning(f"Rate limited, waiting {wait_time}s...")
                         time.sleep(wait_time)
                     else:
                         raise
         return all_embeddings
     def embed_texts_np(self, texts: Sequence[str]) -> np.ndarray:
         return np.asarray(self._embed_texts(list(texts)), dtype=np.float32)
+# Backward compatibility aliases
 SiliconFlowConfig = EmbeddingConfig
 get_config = get_embedding_config

core/rag/generator.py CHANGED Viewed

@@ -2,21 +2,28 @@ from __future__ import annotations
 from typing import Any, Dict, List, TYPE_CHECKING
 if TYPE_CHECKING:
-    from core.rag.retrival import Retriever
-# System prompt cho LLM (export để gradio/eval dùng)
-SYSTEM_PROMPT = """Bạn là Trợ lý học vụ Đại học Bách khoa Hà Nội.
 ## NGUYÊN TẮC:
-1. Chỉ được đưa ra câu trả lời dựa trên CONTEXT được cung cấp. Không suy đoán, không bổ sung thông tin ngoài CONTEXT.
-2. Nếu CONTEXT chứa nhiều văn bản khác nhau, ưu tiên nội dung mới nhất, TRỪ KHI có điều khoản chuyển tiếp nói khác.
-3. Nếu không tìm thấy thông tin trong CONTEXT, trả lời: "Không tìm thấy thông tin trong dữ liệu hiện có."
 """
 def build_context(results: List[Dict[str, Any]], max_chars: int = 8000) -> str:
-    """Xây dựng context từ kết quả retrieval để đưa vào prompt."""
     parts = []
     for i, r in enumerate(results, 1):
         meta = r.get("metadata", {})
@@ -31,7 +38,7 @@ def build_context(results: List[Dict[str, Any]], max_chars: int = 8000) -> str:
         issued_year = meta.get("issued_year", "")
         content = r.get("content", "").strip()
-        # Tạo dòng metadata
         meta_info = f"Nguồn: {source}"
         if header and header != "/":
             meta_info += f" | Mục: {header}"
@@ -54,20 +61,20 @@ def build_context(results: List[Dict[str, Any]], max_chars: int = 8000) -> str:
         parts.append(f"[TÀI LIỆU {i}]\n{meta_info}\n{content}")
     context = "\n---\n".join(parts)
-    # Cắt ngắn nếu vượt quá giới hạn
     return context[:max_chars] if len(context) > max_chars else context
 def build_prompt(question: str, context: str) -> str:
-    """Ghép system prompt, context và câu hỏi thành prompt hoàn chỉnh."""
     return f"{SYSTEM_PROMPT}\n\n## CONTEXT:\n{context}\n\n## CÂU HỎI: {question}\n\n## TRẢ LỜI:"
 class RAGContextBuilder:
-    """Kết hợp retrieval và context building thành một bước."""
     def __init__(self, retriever: "Retriever", max_context_chars: int = 8000):
-        """Khởi tạo với retriever và giới hạn context."""
         self._retriever = retriever
         self._max_context_chars = max_context_chars
@@ -78,11 +85,10 @@ class RAGContextBuilder:
         initial_k: int = 20,
         mode: str = "hybrid_rerank"
     ) -> Dict[str, Any]:
-        """Retrieve documents và chuẩn bị context + prompt cho LLM."""
-        # Tìm kiếm documents liên quan
         results = self._retriever.flexible_search(question, k=k, initial_k=initial_k, mode=mode)
-        # Không tìm thấy kết quả
         if not results:
             return {
                 "results": [],
@@ -91,17 +97,17 @@ class RAGContextBuilder:
                 "prompt": "",
             }
-        # Xây dựng context và prompt
         context_text = build_context(results, self._max_context_chars)
         prompt = build_prompt(question, context_text)
         return {
-            "results": results,                                          # Kết quả retrieval gốc
-            "contexts": [r.get("content", "")[:1000] for r in results],  # Context rút gọn (cho eval)
-            "context_text": context_text,                                # Context đầy đủ
-            "prompt": prompt,                                            # Prompt hoàn chỉnh
         }
-# Alias để tương thích ngược
 RAGGenerator = RAGContextBuilder

 from typing import Any, Dict, List, TYPE_CHECKING
 if TYPE_CHECKING:
+    from core.rag.retrieval import Retriever
+# System prompt for LLM (exported for gradio/eval usage)
+SYSTEM_PROMPT = """Bạn là Trợ lý học vụ Đại học Bách khoa Hà Nội. Nhiệm vụ: trả lời câu hỏi của sinh viên về quy chế, quy định dựa trên tài liệu được cung cấp.
 ## NGUYÊN TẮC:
+1. **Chỉ dùng CONTEXT:** Chỉ trả lời dựa trên CONTEXT được cung cấp. Tuyệt đối không suy đoán hay bổ sung thông tin ngoài CONTEXT.
+2. **Ưu tiên văn bản mới:** Nếu CONTEXT chứa nhiều văn bản khác nhau, ưu tiên nội dung mới nhất (năm ban hành lớn hơn), TRỪ KHI có điều khoản chuyển tiếp nói khác.
+3. **Trích dẫn nguồn:** Ghi rõ thông tin lấy từ văn bản nào (tên file, điều/khoản nếu có) để sinh viên có thể tự tra cứu.
+4. **Lưu ý phạm vi áp dụng:** Nếu quy định chỉ áp dụng cho khóa/chương trình cụ thể, hãy nêu rõ điều kiện áp dụng.
+5. **Không tìm thấy:** Nếu CONTEXT không chứa thông tin liên quan, trả lời: "Không tìm thấy thông tin trong dữ liệu hiện có. Bạn nên liên hệ Phòng Đào tạo để được hỗ trợ."
+## CÁCH TRÌNH BÀY:
+- Trả lời rõ ràng, dễ hiểu, thân thiện với sinh viên.
+- Sử dụng bullet points khi liệt kê nhiều điều kiện/bước.
+- Nếu câu trả lời phức tạp, chia thành các phần nhỏ có tiêu đề.
 """
 def build_context(results: List[Dict[str, Any]], max_chars: int = 8000) -> str:
     parts = []
     for i, r in enumerate(results, 1):
         meta = r.get("metadata", {})
         issued_year = meta.get("issued_year", "")
         content = r.get("content", "").strip()
+        # Build metadata line
         meta_info = f"Nguồn: {source}"
         if header and header != "/":
             meta_info += f" | Mục: {header}"
         parts.append(f"[TÀI LIỆU {i}]\n{meta_info}\n{content}")
     context = "\n---\n".join(parts)
+    # Truncate if exceeds limit
     return context[:max_chars] if len(context) > max_chars else context
 def build_prompt(question: str, context: str) -> str:
     return f"{SYSTEM_PROMPT}\n\n## CONTEXT:\n{context}\n\n## CÂU HỎI: {question}\n\n## TRẢ LỜI:"
 class RAGContextBuilder:
     def __init__(self, retriever: "Retriever", max_context_chars: int = 8000):
         self._retriever = retriever
         self._max_context_chars = max_context_chars
         initial_k: int = 20,
         mode: str = "hybrid_rerank"
     ) -> Dict[str, Any]:
+        # Search for relevant documents
         results = self._retriever.flexible_search(question, k=k, initial_k=initial_k, mode=mode)
+        # No results found
         if not results:
             return {
                 "results": [],
                 "prompt": "",
             }
+        # Build context and prompt
         context_text = build_context(results, self._max_context_chars)
         prompt = build_prompt(question, context_text)
         return {
+            "results": results,
+            "contexts": [r.get("content", "")[:1000] for r in results],
+            "context_text": context_text,
+            "prompt": prompt,
         }
+# Backward compatibility alias
 RAGGenerator = RAGContextBuilder

core/rag/{retrival.py → retrieval.py} RENAMED Viewed

@@ -22,30 +22,28 @@ logger = logging.getLogger(__name__)
 class RetrievalMode(str, Enum):
-    """Các chế độ retrieval hỗ trợ."""
-    VECTOR_ONLY = "vector_only"      # Chỉ dùng vector search
-    BM25_ONLY = "bm25_only"          # Chỉ dùng BM25 keyword search
-    HYBRID = "hybrid"                 # Kết hợp vector + BM25
-    HYBRID_RERANK = "hybrid_rerank"   # Hybrid + reranking
 @dataclass
 class RetrievalConfig:
-    """Cấu hình cho retrieval system."""
-    rerank_api_base_url: str = "https://api.siliconflow.com/v1"  # API reranker
-    rerank_model: str = "Qwen/Qwen3-Reranker-8B"                 # Model reranker
-    rerank_top_n: int = 10                                        # Số kết quả sau rerank
-    initial_k: int = 25                                           # Số docs lấy ban đầu
-    top_k: int = 5                                                # Số kết quả cuối cùng
-    vector_weight: float = 0.5                                    # Trọng số vector search
-    bm25_weight: float = 0.5                                      # Trọng số BM25
 _retrieval_config: RetrievalConfig | None = None
 def get_retrieval_config() -> RetrievalConfig:
-    """Lấy cấu hình retrieval (singleton pattern)."""
     global _retrieval_config
     if _retrieval_config is None:
         _retrieval_config = RetrievalConfig()
@@ -53,7 +51,7 @@ def get_retrieval_config() -> RetrievalConfig:
 class SiliconFlowReranker(BaseDocumentCompressor):
-    """Reranker sử dụng SiliconFlow API để sắp xếp lại kết quả."""
     api_key: str = Field(default="")
     api_base_url: str = Field(default="")
     model: str = Field(default="")
@@ -68,11 +66,11 @@ class SiliconFlowReranker(BaseDocumentCompressor):
         query: str,
         callbacks: Optional[Callbacks] = None,
     ) -> Sequence[Document]:
-        """Rerank documents dựa trên độ liên quan với query."""
         if not documents or not self.api_key:
             return list(documents)
-        # Retry logic với exponential backoff
         for attempt in range(3):
             try:
                 response = requests.post(
@@ -95,7 +93,7 @@ class SiliconFlowReranker(BaseDocumentCompressor):
                 if "results" not in data:
                     return list(documents)
-                # Tạo danh sách documents đã rerank với score
                 reranked: List[Document] = []
                 for result in data["results"]:
                     doc = documents[result["index"]]
@@ -106,36 +104,36 @@ class SiliconFlowReranker(BaseDocumentCompressor):
                 return reranked
             except Exception as e:
-                # Rate limit -> đợi rồi thử lại
                 if "rate" in str(e).lower() and attempt < 2:
                     time.sleep(2 ** attempt)
                 else:
-                    logger.error(f"Lỗi rerank: {e}")
                     return list(documents)
         return list(documents)
 class Retriever:
-    """Retriever chính hỗ trợ nhiều chế độ tìm kiếm."""
     def __init__(self, vector_db: "ChromaVectorDB", use_reranker: bool = True):
-        """Khởi tạo retriever với vector DB và reranker."""
         self._vector_db = vector_db
         self._config = get_retrieval_config()
         self._reranker: Optional[SiliconFlowReranker] = None
-        # Vector retriever từ ChromaDB
         self._vector_retriever = self._vector_db.vectorstore.as_retriever(
             search_kwargs={"k": self._config.initial_k}
         )
-        # Lazy-load BM25 - chỉ khởi tạo khi cần
         self._bm25_retriever: Optional[BM25Retriever] = None
         self._bm25_initialized = False
         self._ensemble_retriever: Optional[EnsembleRetriever] = None
-        # Đường dẫn cache BM25 (lưu vào disk)
         from pathlib import Path
         persist_dir = getattr(self._vector_db.config, 'persist_dir', None)
         if persist_dir:
@@ -146,22 +144,22 @@ class Retriever:
         if use_reranker:
             self._reranker = self._init_reranker()
-        logger.info("Đã khởi tạo Retriever")
     def _save_bm25_cache(self, bm25: BM25Retriever) -> None:
-        """Lưu BM25 index vào cache file."""
         if not self._bm25_cache_path:
             return
         try:
             import pickle
             with open(self._bm25_cache_path, 'wb') as f:
                 pickle.dump(bm25, f)
-            logger.info(f"Đã lưu BM25 cache vào {self._bm25_cache_path}")
         except Exception as e:
-            logger.warning(f"Không thể lưu BM25 cache: {e}")
     def _load_bm25_cache(self) -> Optional[BM25Retriever]:
-        """Tải BM25 index từ cache file."""
         if not self._bm25_cache_path or not self._bm25_cache_path.exists():
             return None
         try:
@@ -170,33 +168,33 @@ class Retriever:
             with open(self._bm25_cache_path, 'rb') as f:
                 bm25 = pickle.load(f)
             bm25.k = self._config.initial_k
-            logger.info(f"Đã tải BM25 từ cache trong {time.time() - start:.2f}s")
             return bm25
         except Exception as e:
-            logger.warning(f"Không thể tải BM25 cache: {e}")
             return None
     def _init_bm25(self) -> Optional[BM25Retriever]:
-        """Khởi tạo BM25 retriever (lazy-load với cache)."""
         if self._bm25_initialized:
             return self._bm25_retriever
         self._bm25_initialized = True
-        # Thử tải từ cache trước
         cached = self._load_bm25_cache()
         if cached:
             self._bm25_retriever = cached
             return cached
-        # Build từ đầu nếu không có cache
         try:
             start = time.time()
-            logger.info("Đang xây dựng BM25 index từ documents...")
             docs = self._vector_db.get_all_documents()
             if not docs:
-                logger.warning("Không tìm thấy documents cho BM25")
                 return None
             lc_docs = [
@@ -207,18 +205,18 @@ class Retriever:
             bm25.k = self._config.initial_k
             self._bm25_retriever = bm25
-            logger.info(f"Đã xây dựng BM25 với {len(docs)} docs trong {time.time() - start:.2f}s")
-            # Lưu vào cache cho lần sau
             self._save_bm25_cache(bm25)
             return bm25
         except Exception as e:
-            logger.error(f"Không thể khởi tạo BM25: {e}")
             return None
     def _get_ensemble_retriever(self) -> EnsembleRetriever:
-        """Lấy ensemble retriever (vector + BM25)."""
         if self._ensemble_retriever is not None:
             return self._ensemble_retriever
@@ -229,7 +227,7 @@ class Retriever:
                 weights=[self._config.vector_weight, self._config.bm25_weight]
             )
         else:
-            # Fallback về vector only
             self._ensemble_retriever = EnsembleRetriever(
                 retrievers=[self._vector_retriever],
                 weights=[1.0]
@@ -237,7 +235,7 @@ class Retriever:
         return self._ensemble_retriever
     def _init_reranker(self) -> Optional[SiliconFlowReranker]:
-        """Khởi tạo reranker nếu có API key."""
         api_key = os.getenv("SILICONFLOW_API_KEY", "").strip()
         if not api_key:
             return None
@@ -249,7 +247,7 @@ class Retriever:
         )
     def _build_final(self):
-        """Build retriever cuối cùng (ensemble + reranker nếu có)."""
         ensemble = self._get_ensemble_retriever()
         if self._reranker:
             return ContextualCompressionRetriever(
@@ -260,20 +258,20 @@ class Retriever:
     @property
     def has_reranker(self) -> bool:
-        """Kiểm tra có reranker không."""
         return self._reranker is not None
     def _to_result(self, doc: Document, rank: int, **extra) -> Dict[str, Any]:
-        """Chuyển Document thành dict result, xử lý Small-to-Big."""
         metadata = doc.metadata or {}
         content = doc.page_content
-        # Small-to-Big: Nếu là summary node -> swap với parent (bảng gốc)
         if metadata.get("is_table_summary") and metadata.get("parent_id"):
             parent = self._vector_db.get_parent_node(metadata["parent_id"])
             if parent:
                 content = parent.get("content", content)
-                # Merge metadata, giữ lại info summary để debug
                 metadata = {
                     **parent.get("metadata", {}),
                     "original_summary": doc.page_content[:200],
@@ -291,7 +289,7 @@ class Retriever:
     def vector_search(
         self, text: str, *, k: int | None = None, where: Optional[Dict[str, Any]] = None
     ) -> List[Dict[str, Any]]:
-        """Tìm kiếm bằng vector similarity."""
         if not text.strip():
             return []
         k = k or self._config.top_k
@@ -299,7 +297,7 @@ class Retriever:
         return [self._to_result(doc, i + 1, distance=score) for i, (doc, score) in enumerate(results)]
     def bm25_search(self, text: str, *, k: int | None = None) -> List[Dict[str, Any]]:
-        """Tìm kiếm bằng BM25 keyword matching."""
         if not text.strip():
             return []
         bm25 = self._init_bm25()  # Lazy-load BM25
@@ -313,7 +311,7 @@ class Retriever:
     def hybrid_search(
         self, text: str, *, k: int | None = None, initial_k: int | None = None
     ) -> List[Dict[str, Any]]:
-        """Tìm kiếm hybrid (vector + BM25) không có rerank."""
         if not text.strip():
             return []
         k = k or self._config.top_k
@@ -335,13 +333,13 @@ class Retriever:
         where: Optional[Dict[str, Any]] = None,
         initial_k: int | None = None,
     ) -> List[Dict[str, Any]]:
-        """Tìm kiếm hybrid + reranking để có kết quả tốt nhất."""
         if not text.strip():
             return []
         k = k or self._config.top_k
         initial_k = initial_k or self._config.initial_k
-        # Có filter -> dùng vector search + manual rerank
         if where:
             results = self._vector_db.vectorstore.similarity_search(text, k=initial_k, filter=where)
             if self._reranker:
@@ -351,7 +349,7 @@ class Retriever:
                 for i, doc in enumerate(results[:k])
             ]
-        # Cập nhật k cho initial fetch
         if initial_k:
             self._vector_retriever.search_kwargs["k"] = initial_k
             bm25 = self._init_bm25()
@@ -362,7 +360,7 @@ class Retriever:
         ensemble = self._get_ensemble_retriever()
         ensemble_results = ensemble.invoke(text)
-        # Rerank nếu có
         if self._reranker:
             results = self._reranker.compress_documents(ensemble_results, text)
         else:
@@ -382,11 +380,11 @@ class Retriever:
         initial_k: int | None = None,
         where: Optional[Dict[str, Any]] = None,
     ) -> List[Dict[str, Any]]:
-        """Tìm kiếm linh hoạt với nhiều chế độ."""
         if not text.strip():
             return []
-        # Parse mode từ string
         if isinstance(mode, str):
             try:
                 mode = RetrievalMode(mode.lower())
@@ -396,7 +394,7 @@ class Retriever:
         k = k or self._config.top_k
         initial_k = initial_k or self._config.initial_k
-        # Gọi method tương ứng theo mode
         if mode == RetrievalMode.VECTOR_ONLY:
             return self.vector_search(text, k=k, where=where)
         elif mode == RetrievalMode.BM25_ONLY:
@@ -408,5 +406,5 @@ class Retriever:
         else:  # HYBRID_RERANK
             return self.search_with_rerank(text, k=k, where=where, initial_k=initial_k)
-    # Alias để tương thích ngược
     query = vector_search

 class RetrievalMode(str, Enum):
+    VECTOR_ONLY = "vector_only"
+    BM25_ONLY = "bm25_only"
+    HYBRID = "hybrid"
+    HYBRID_RERANK = "hybrid_rerank"
 @dataclass
 class RetrievalConfig:
+    rerank_api_base_url: str = "https://api.siliconflow.com/v1"
+    rerank_model: str = "Qwen/Qwen3-Reranker-8B"
+    rerank_top_n: int = 10
+    initial_k: int = 25
+    top_k: int = 5
+    vector_weight: float = 0.5
+    bm25_weight: float = 0.5
 _retrieval_config: RetrievalConfig | None = None
 def get_retrieval_config() -> RetrievalConfig:
     global _retrieval_config
     if _retrieval_config is None:
         _retrieval_config = RetrievalConfig()
 class SiliconFlowReranker(BaseDocumentCompressor):
     api_key: str = Field(default="")
     api_base_url: str = Field(default="")
     model: str = Field(default="")
         query: str,
         callbacks: Optional[Callbacks] = None,
     ) -> Sequence[Document]:
         if not documents or not self.api_key:
             return list(documents)
+        # Retry with exponential backoff
         for attempt in range(3):
             try:
                 response = requests.post(
                 if "results" not in data:
                     return list(documents)
+                # Build reranked document list with scores
                 reranked: List[Document] = []
                 for result in data["results"]:
                     doc = documents[result["index"]]
                 return reranked
             except Exception as e:
+                # Rate limit -> wait and retry
                 if "rate" in str(e).lower() and attempt < 2:
                     time.sleep(2 ** attempt)
                 else:
+                    logger.error(f"Rerank error: {e}")
                     return list(documents)
         return list(documents)
 class Retriever:
     def __init__(self, vector_db: "ChromaVectorDB", use_reranker: bool = True):
         self._vector_db = vector_db
         self._config = get_retrieval_config()
         self._reranker: Optional[SiliconFlowReranker] = None
+        # Vector retriever from ChromaDB
         self._vector_retriever = self._vector_db.vectorstore.as_retriever(
             search_kwargs={"k": self._config.initial_k}
         )
+        # Lazy-load BM25 - only initialized when needed
         self._bm25_retriever: Optional[BM25Retriever] = None
         self._bm25_initialized = False
         self._ensemble_retriever: Optional[EnsembleRetriever] = None
+        # BM25 cache path (saved to disk)
         from pathlib import Path
         persist_dir = getattr(self._vector_db.config, 'persist_dir', None)
         if persist_dir:
         if use_reranker:
             self._reranker = self._init_reranker()
+        logger.info("Initialized Retriever")
     def _save_bm25_cache(self, bm25: BM25Retriever) -> None:
         if not self._bm25_cache_path:
             return
         try:
             import pickle
             with open(self._bm25_cache_path, 'wb') as f:
                 pickle.dump(bm25, f)
+            logger.info(f"Saved BM25 cache to {self._bm25_cache_path}")
         except Exception as e:
+            logger.warning(f"Failed to save BM25 cache: {e}")
     def _load_bm25_cache(self) -> Optional[BM25Retriever]:
         if not self._bm25_cache_path or not self._bm25_cache_path.exists():
             return None
         try:
             with open(self._bm25_cache_path, 'rb') as f:
                 bm25 = pickle.load(f)
             bm25.k = self._config.initial_k
+            logger.info(f"Loaded BM25 from cache in {time.time() - start:.2f}s")
             return bm25
         except Exception as e:
+            logger.warning(f"Failed to load BM25 cache: {e}")
             return None
     def _init_bm25(self) -> Optional[BM25Retriever]:
         if self._bm25_initialized:
             return self._bm25_retriever
         self._bm25_initialized = True
+        # Try loading from cache first
         cached = self._load_bm25_cache()
         if cached:
             self._bm25_retriever = cached
             return cached
+        # Build from scratch if no cache
         try:
             start = time.time()
+            logger.info("Building BM25 index from documents...")
             docs = self._vector_db.get_all_documents()
             if not docs:
+                logger.warning("No documents found for BM25")
                 return None
             lc_docs = [
             bm25.k = self._config.initial_k
             self._bm25_retriever = bm25
+            logger.info(f"Built BM25 with {len(docs)} docs in {time.time() - start:.2f}s")
+            # Save to cache for next time
             self._save_bm25_cache(bm25)
             return bm25
         except Exception as e:
+            logger.error(f"Failed to initialize BM25: {e}")
             return None
     def _get_ensemble_retriever(self) -> EnsembleRetriever:
         if self._ensemble_retriever is not None:
             return self._ensemble_retriever
                 weights=[self._config.vector_weight, self._config.bm25_weight]
             )
         else:
+            # Fallback to vector only
             self._ensemble_retriever = EnsembleRetriever(
                 retrievers=[self._vector_retriever],
                 weights=[1.0]
         return self._ensemble_retriever
     def _init_reranker(self) -> Optional[SiliconFlowReranker]:
         api_key = os.getenv("SILICONFLOW_API_KEY", "").strip()
         if not api_key:
             return None
         )
     def _build_final(self):
         ensemble = self._get_ensemble_retriever()
         if self._reranker:
             return ContextualCompressionRetriever(
     @property
     def has_reranker(self) -> bool:
         return self._reranker is not None
     def _to_result(self, doc: Document, rank: int, **extra) -> Dict[str, Any]:
         metadata = doc.metadata or {}
         content = doc.page_content
+        # Small-to-Big: if summary node -> swap with parent (original table)
         if metadata.get("is_table_summary") and metadata.get("parent_id"):
             parent = self._vector_db.get_parent_node(metadata["parent_id"])
             if parent:
                 content = parent.get("content", content)
+                # Merge metadata, keep summary info for debugging
                 metadata = {
                     **parent.get("metadata", {}),
                     "original_summary": doc.page_content[:200],
     def vector_search(
         self, text: str, *, k: int | None = None, where: Optional[Dict[str, Any]] = None
     ) -> List[Dict[str, Any]]:
         if not text.strip():
             return []
         k = k or self._config.top_k
         return [self._to_result(doc, i + 1, distance=score) for i, (doc, score) in enumerate(results)]
     def bm25_search(self, text: str, *, k: int | None = None) -> List[Dict[str, Any]]:
         if not text.strip():
             return []
         bm25 = self._init_bm25()  # Lazy-load BM25
     def hybrid_search(
         self, text: str, *, k: int | None = None, initial_k: int | None = None
     ) -> List[Dict[str, Any]]:
         if not text.strip():
             return []
         k = k or self._config.top_k
         where: Optional[Dict[str, Any]] = None,
         initial_k: int | None = None,
     ) -> List[Dict[str, Any]]:
         if not text.strip():
             return []
         k = k or self._config.top_k
         initial_k = initial_k or self._config.initial_k
+        # Has filter -> use vector search + manual rerank
         if where:
             results = self._vector_db.vectorstore.similarity_search(text, k=initial_k, filter=where)
             if self._reranker:
                 for i, doc in enumerate(results[:k])
             ]
+        # Update k for initial fetch
         if initial_k:
             self._vector_retriever.search_kwargs["k"] = initial_k
             bm25 = self._init_bm25()
         ensemble = self._get_ensemble_retriever()
         ensemble_results = ensemble.invoke(text)
+        # Rerank if available
         if self._reranker:
             results = self._reranker.compress_documents(ensemble_results, text)
         else:
         initial_k: int | None = None,
         where: Optional[Dict[str, Any]] = None,
     ) -> List[Dict[str, Any]]:
         if not text.strip():
             return []
+        # Parse mode from string
         if isinstance(mode, str):
             try:
                 mode = RetrievalMode(mode.lower())
         k = k or self._config.top_k
         initial_k = initial_k or self._config.initial_k
+        # Dispatch to corresponding method by mode
         if mode == RetrievalMode.VECTOR_ONLY:
             return self.vector_search(text, k=k, where=where)
         elif mode == RetrievalMode.BM25_ONLY:
         else:  # HYBRID_RERANK
             return self.search_with_rerank(text, k=k, where=where, initial_k=initial_k)
+    # Backward compatibility alias
     query = vector_search

core/rag/vector_store.py CHANGED Viewed

@@ -13,76 +13,76 @@ logger = logging.getLogger(__name__)
 @dataclass
 class ChromaConfig:
-    """Cấu hình cho ChromaDB."""
     def _default_persist_dir() -> str:
-        """Lấy đường dẫn mặc định cho persist directory."""
         repo_root = Path(__file__).resolve().parents[2]
         return str((repo_root / "data" / "chroma").resolve())
-    persist_dir: str = field(default_factory=_default_persist_dir)  # Thư mục lưu DB
-    collection_name: str = "hust_rag_collection"                    # Tên collection
 class ChromaVectorDB:
-    """Wrapper cho ChromaDB với hỗ trợ Small-to-Big retrieval."""
     def __init__(
         self,
         embedder: Any,
         config: ChromaConfig | None = None,
     ):
-        """Khởi tạo ChromaDB với embedder và config."""
         self.embedder = embedder
         self.config = config or ChromaConfig()
         self._hasher = HashProcessor(verbose=False)
-        # Lưu trữ parent nodes (không embed, dùng cho Small-to-Big)
         self._parent_nodes_path = Path(self.config.persist_dir) / "parent_nodes.json"
         self._parent_nodes: Dict[str, Dict[str, Any]] = self._load_parent_nodes()
-        # Khởi tạo ChromaDB
         self._vs = Chroma(
             collection_name=self.config.collection_name,
             embedding_function=self.embedder,
             persist_directory=self.config.persist_dir,
         )
-        logger.info(f"Đã khởi tạo ChromaVectorDB: {self.config.collection_name}")
     def _load_parent_nodes(self) -> Dict[str, Dict[str, Any]]:
-        """Tải parent nodes từ file JSON."""
         if self._parent_nodes_path.exists():
             try:
                 with open(self._parent_nodes_path, 'r', encoding='utf-8') as f:
                     data = json.load(f)
-                    logger.info(f"Đã tải {len(data)} parent nodes từ {self._parent_nodes_path}")
                     return data
             except Exception as e:
-                logger.warning(f"Không thể tải parent nodes: {e}")
         return {}
     def _save_parent_nodes(self) -> None:
-        """Lưu parent nodes vào file JSON."""
         try:
             self._parent_nodes_path.parent.mkdir(parents=True, exist_ok=True)
             with open(self._parent_nodes_path, 'w', encoding='utf-8') as f:
                 json.dump(self._parent_nodes, f, ensure_ascii=False, indent=2)
-            logger.info(f"Đã lưu {len(self._parent_nodes)} parent nodes vào {self._parent_nodes_path}")
         except Exception as e:
-            logger.warning(f"Không thể lưu parent nodes: {e}")
     @property
     def collection(self):
-        """Lấy collection gốc của ChromaDB."""
         return getattr(self._vs, "_collection", None)
     @property
     def vectorstore(self):
-        """Lấy LangChain Chroma vectorstore."""
         return self._vs
     def _flatten_metadata(self, metadata: Dict[str, Any]) -> Dict[str, Any]:
-        """Chuyển metadata phức tạp thành format ChromaDB hỗ trợ."""
         out: Dict[str, Any] = {}
         for k, v in (metadata or {}).items():
             if v is None:
@@ -90,33 +90,33 @@ class ChromaVectorDB:
             if isinstance(v, (str, int, float, bool)):
                 out[str(k)] = v
             elif isinstance(v, (list, tuple, set, dict)):
-                # Chuyển list/dict thành JSON string
                 out[str(k)] = json.dumps(v, ensure_ascii=False)
             else:
                 out[str(k)] = str(v)
         return out
     def _normalize_doc(self, doc: Any) -> Dict[str, Any]:
-        """Chuẩn hóa document từ nhiều format khác nhau thành dict."""
-        # Đã là dict
         if isinstance(doc, dict):
             return doc
-        # TextNode/BaseNode từ llama_index
         if hasattr(doc, "get_content") and hasattr(doc, "metadata"):
             return {
                 "content": doc.get_content(),
                 "metadata": dict(doc.metadata) if doc.metadata else {},
             }
-        # Document từ LangChain
         if hasattr(doc, "page_content") and hasattr(doc, "metadata"):
             return {
                 "content": doc.page_content,
                 "metadata": dict(doc.metadata) if doc.metadata else {},
             }
-        raise TypeError(f"Không hỗ trợ loại document: {type(doc)}")
     def _to_documents(self, docs: Sequence[Any], ids: Sequence[str]) -> List[Document]:
-        """Chuyển danh sách docs thành LangChain Documents."""
         out: List[Document] = []
         for d, doc_id in zip(docs, ids):
             normalized = self._normalize_doc(d)
@@ -126,7 +126,7 @@ class ChromaVectorDB:
         return out
     def _doc_id(self, doc: Any) -> str:
-        """Tạo ID duy nhất cho document dựa trên nội dung."""
         normalized = self._normalize_doc(doc)
         md = normalized.get("metadata") or {}
         key = {
@@ -144,14 +144,14 @@ class ChromaVectorDB:
         ids: Optional[Sequence[str]] = None,
         batch_size: int = 128,
     ) -> int:
-        """Thêm documents vào vector store."""
         if not docs:
             return 0
         if ids is not None and len(ids) != len(docs):
-            raise ValueError("Số lượng ids phải bằng số lượng docs")
-        # Tách parent nodes (không embed) khỏi regular nodes
         regular_docs = []
         regular_ids = []
         parent_count = 0
@@ -162,7 +162,7 @@ class ChromaVectorDB:
             doc_id = ids[i] if ids else self._doc_id(d)
             if md.get("is_parent"):
-                # Lưu parent node riêng (cho Small-to-Big)
                 parent_id = md.get("node_id", doc_id)
                 self._parent_nodes[parent_id] = {
                     "id": parent_id,
@@ -175,13 +175,13 @@ class ChromaVectorDB:
                 regular_ids.append(doc_id)
         if parent_count > 0:
-            logger.info(f"Đã lưu {parent_count} parent nodes (không embed)")
             self._save_parent_nodes()
         if not regular_docs:
             return parent_count
-        # Thêm theo batch
         bs = max(1, batch_size)
         total = 0
@@ -193,13 +193,13 @@ class ChromaVectorDB:
             try:
                 self._vs.add_documents(lc_docs, ids=batch_ids)
             except TypeError:
-                # Fallback nếu add_documents không nhận ids
                 texts = [d.page_content for d in lc_docs]
                 metas = [d.metadata for d in lc_docs]
                 self._vs.add_texts(texts=texts, metadatas=metas, ids=batch_ids)
             total += len(batch)
-        logger.info(f"Đã thêm {total} documents vào vector store")
         return total + parent_count
     def upsert_documents(
@@ -209,14 +209,14 @@ class ChromaVectorDB:
         ids: Optional[Sequence[str]] = None,
         batch_size: int = 128,
     ) -> int:
-        """Upsert documents (thêm mới hoặc cập nhật nếu đã tồn tại)."""
         if not docs:
             return 0
         if ids is not None and len(ids) != len(docs):
-            raise ValueError("Số lượng ids phải bằng số lượng docs")
-        # Tách parent nodes khỏi regular nodes
         regular_docs = []
         regular_ids = []
         parent_count = 0
@@ -227,7 +227,7 @@ class ChromaVectorDB:
             doc_id = ids[i] if ids else self._doc_id(d)
             if md.get("is_parent"):
-                # Lưu parent node riêng
                 parent_id = md.get("node_id", doc_id)
                 self._parent_nodes[parent_id] = {
                     "id": parent_id,
@@ -240,7 +240,7 @@ class ChromaVectorDB:
                 regular_ids.append(doc_id)
         if parent_count > 0:
-            logger.info(f"Đã lưu {parent_count} parent nodes (không embed)")
             self._save_parent_nodes()
         if not regular_docs:
@@ -249,11 +249,11 @@ class ChromaVectorDB:
         bs = max(1, batch_size)
         col = self.collection
-        # Fallback nếu không có collection
         if col is None:
             return self.add_documents(regular_docs, ids=regular_ids, batch_size=bs) + parent_count
-        # Upsert theo batch
         total = 0
         for start in range(0, len(regular_docs), bs):
             batch = regular_docs[start : start + bs]
@@ -265,16 +265,16 @@ class ChromaVectorDB:
             col.upsert(ids=batch_ids, documents=texts, metadatas=metas, embeddings=embs)
             total += len(batch)
-        logger.info(f"Đã upsert {total} documents vào vector store")
         return total + parent_count
     def count(self) -> int:
-        """Đếm số documents trong collection."""
         col = self.collection
         return int(col.count()) if col else 0
     def get_all_documents(self, limit: int = 5000) -> List[Dict[str, Any]]:
-        """Lấy tất cả documents từ collection."""
         col = self.collection
         if col is None:
             return []
@@ -291,7 +291,7 @@ class ChromaVectorDB:
         return docs
     def delete_documents(self, ids: Sequence[str]) -> int:
-        """Xóa documents theo danh sách IDs."""
         if not ids:
             return 0
@@ -300,14 +300,14 @@ class ChromaVectorDB:
             return 0
         col.delete(ids=list(ids))
-        logger.info(f"Đã xóa {len(ids)} documents khỏi vector store")
         return len(ids)
     def get_parent_node(self, parent_id: str) -> Optional[Dict[str, Any]]:
-        """Lấy parent node theo ID (cho Small-to-Big)."""
         return self._parent_nodes.get(parent_id)
     @property
     def parent_nodes(self) -> Dict[str, Dict[str, Any]]:
-        """Lấy tất cả parent nodes."""
         return self._parent_nodes

 @dataclass
 class ChromaConfig:
     def _default_persist_dir() -> str:
         repo_root = Path(__file__).resolve().parents[2]
         return str((repo_root / "data" / "chroma").resolve())
+    persist_dir: str = field(default_factory=_default_persist_dir)
+    collection_name: str = "hust_rag_collection"
 class ChromaVectorDB:
     def __init__(
         self,
         embedder: Any,
         config: ChromaConfig | None = None,
     ):
         self.embedder = embedder
         self.config = config or ChromaConfig()
         self._hasher = HashProcessor(verbose=False)
+        # Parent node storage (not embedded, used for Small-to-Big)
         self._parent_nodes_path = Path(self.config.persist_dir) / "parent_nodes.json"
         self._parent_nodes: Dict[str, Dict[str, Any]] = self._load_parent_nodes()
+        # Initialize ChromaDB
         self._vs = Chroma(
             collection_name=self.config.collection_name,
             embedding_function=self.embedder,
             persist_directory=self.config.persist_dir,
         )
+        logger.info(f"Initialized ChromaVectorDB: {self.config.collection_name}")
     def _load_parent_nodes(self) -> Dict[str, Dict[str, Any]]:
         if self._parent_nodes_path.exists():
             try:
                 with open(self._parent_nodes_path, 'r', encoding='utf-8') as f:
                     data = json.load(f)
+                    logger.info(f"Loaded {len(data)} parent nodes from {self._parent_nodes_path}")
                     return data
             except Exception as e:
+                logger.warning(f"Failed to load parent nodes: {e}")
         return {}
     def _save_parent_nodes(self) -> None:
         try:
             self._parent_nodes_path.parent.mkdir(parents=True, exist_ok=True)
             with open(self._parent_nodes_path, 'w', encoding='utf-8') as f:
                 json.dump(self._parent_nodes, f, ensure_ascii=False, indent=2)
+            logger.info(f"Saved {len(self._parent_nodes)} parent nodes to {self._parent_nodes_path}")
         except Exception as e:
+            logger.warning(f"Failed to save parent nodes: {e}")
     @property
     def collection(self):
         return getattr(self._vs, "_collection", None)
     @property
     def vectorstore(self):
         return self._vs
     def _flatten_metadata(self, metadata: Dict[str, Any]) -> Dict[str, Any]:
         out: Dict[str, Any] = {}
         for k, v in (metadata or {}).items():
             if v is None:
             if isinstance(v, (str, int, float, bool)):
                 out[str(k)] = v
             elif isinstance(v, (list, tuple, set, dict)):
+                # Convert list/dict to JSON string
                 out[str(k)] = json.dumps(v, ensure_ascii=False)
             else:
                 out[str(k)] = str(v)
         return out
     def _normalize_doc(self, doc: Any) -> Dict[str, Any]:
+        # Already a dict
         if isinstance(doc, dict):
             return doc
+        # TextNode/BaseNode from llama_index
         if hasattr(doc, "get_content") and hasattr(doc, "metadata"):
             return {
                 "content": doc.get_content(),
                 "metadata": dict(doc.metadata) if doc.metadata else {},
             }
+        # Document from LangChain
         if hasattr(doc, "page_content") and hasattr(doc, "metadata"):
             return {
                 "content": doc.page_content,
                 "metadata": dict(doc.metadata) if doc.metadata else {},
             }
+        raise TypeError(f"Unsupported document type: {type(doc)}")
     def _to_documents(self, docs: Sequence[Any], ids: Sequence[str]) -> List[Document]:
         out: List[Document] = []
         for d, doc_id in zip(docs, ids):
             normalized = self._normalize_doc(d)
         return out
     def _doc_id(self, doc: Any) -> str:
         normalized = self._normalize_doc(doc)
         md = normalized.get("metadata") or {}
         key = {
         ids: Optional[Sequence[str]] = None,
         batch_size: int = 128,
     ) -> int:
         if not docs:
             return 0
         if ids is not None and len(ids) != len(docs):
+            raise ValueError("Number of ids must match number of docs")
+        # Separate parent nodes (not embedded) from regular nodes
         regular_docs = []
         regular_ids = []
         parent_count = 0
             doc_id = ids[i] if ids else self._doc_id(d)
             if md.get("is_parent"):
+                # Store parent node separately (for Small-to-Big)
                 parent_id = md.get("node_id", doc_id)
                 self._parent_nodes[parent_id] = {
                     "id": parent_id,
                 regular_ids.append(doc_id)
         if parent_count > 0:
+            logger.info(f"Saved {parent_count} parent nodes (not embedded)")
             self._save_parent_nodes()
         if not regular_docs:
             return parent_count
+        # Add in batches
         bs = max(1, batch_size)
         total = 0
             try:
                 self._vs.add_documents(lc_docs, ids=batch_ids)
             except TypeError:
+                # Fallback if add_documents doesn't accept ids
                 texts = [d.page_content for d in lc_docs]
                 metas = [d.metadata for d in lc_docs]
                 self._vs.add_texts(texts=texts, metadatas=metas, ids=batch_ids)
             total += len(batch)
+        logger.info(f"Added {total} documents to vector store")
         return total + parent_count
     def upsert_documents(
         ids: Optional[Sequence[str]] = None,
         batch_size: int = 128,
     ) -> int:
         if not docs:
             return 0
         if ids is not None and len(ids) != len(docs):
+            raise ValueError("Number of ids must match number of docs")
+        # Separate parent nodes from regular nodes
         regular_docs = []
         regular_ids = []
         parent_count = 0
             doc_id = ids[i] if ids else self._doc_id(d)
             if md.get("is_parent"):
+                # Store parent node separately
                 parent_id = md.get("node_id", doc_id)
                 self._parent_nodes[parent_id] = {
                     "id": parent_id,
                 regular_ids.append(doc_id)
         if parent_count > 0:
+            logger.info(f"Saved {parent_count} parent nodes (not embedded)")
             self._save_parent_nodes()
         if not regular_docs:
         bs = max(1, batch_size)
         col = self.collection
+        # Fallback if no collection available
         if col is None:
             return self.add_documents(regular_docs, ids=regular_ids, batch_size=bs) + parent_count
+        # Upsert in batches
         total = 0
         for start in range(0, len(regular_docs), bs):
             batch = regular_docs[start : start + bs]
             col.upsert(ids=batch_ids, documents=texts, metadatas=metas, embeddings=embs)
             total += len(batch)
+        logger.info(f"Upserted {total} documents to vector store")
         return total + parent_count
     def count(self) -> int:
         col = self.collection
         return int(col.count()) if col else 0
     def get_all_documents(self, limit: int = 5000) -> List[Dict[str, Any]]:
         col = self.collection
         if col is None:
             return []
         return docs
     def delete_documents(self, ids: Sequence[str]) -> int:
         if not ids:
             return 0
             return 0
         col.delete(ids=list(ids))
+        logger.info(f"Deleted {len(ids)} documents from vector store")
         return len(ids)
     def get_parent_node(self, parent_id: str) -> Optional[Dict[str, Any]]:
         return self._parent_nodes.get(parent_id)
     @property
     def parent_nodes(self) -> Dict[str, Dict[str, Any]]:
         return self._parent_nodes

evaluation/eval_utils.py CHANGED Viewed

@@ -1,5 +1,3 @@
-"""Các utility functions cho evaluation."""
 import os
 import sys
 import re
@@ -17,17 +15,17 @@ load_dotenv(find_dotenv(usecwd=True))
 from openai import OpenAI
 from core.rag.embedding_model import EmbeddingConfig, QwenEmbeddings
 from core.rag.vector_store import ChromaConfig, ChromaVectorDB
-from core.rag.retrival import Retriever
 from core.rag.generator import RAGGenerator
 def strip_thinking(text: str) -> str:
-    """Loại bỏ các block <think>...</think> từ output của LLM."""
     return re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL).strip()
 def load_csv_data(csv_path: str, sample_size: int = 0) -> tuple[list, list]:
-    """Đọc dữ liệu câu hỏi và ground truth từ file CSV."""
     questions, ground_truths = [], []
     with open(csv_path, 'r', encoding='utf-8') as f:
         for row in csv.DictReader(f):
@@ -35,7 +33,7 @@ def load_csv_data(csv_path: str, sample_size: int = 0) -> tuple[list, list]:
                 questions.append(row['question'])
                 ground_truths.append(row['ground_truth'])
-    # Giới hạn số lượng sample
     if sample_size > 0:
         questions = questions[:sample_size]
         ground_truths = ground_truths[:sample_size]
@@ -44,16 +42,16 @@ def load_csv_data(csv_path: str, sample_size: int = 0) -> tuple[list, list]:
 def init_rag() -> tuple[RAGGenerator, QwenEmbeddings, OpenAI]:
-    """Khởi tạo các components RAG cho evaluation."""
     embeddings = QwenEmbeddings(EmbeddingConfig())
     db = ChromaVectorDB(embedder=embeddings, config=ChromaConfig())
     retriever = Retriever(vector_db=db)
     rag = RAGGenerator(retriever=retriever)
-    # Khởi tạo LLM client
     api_key = os.getenv("SILICONFLOW_API_KEY", "").strip()
     if not api_key:
-        raise ValueError("Chưa đặt SILICONFLOW_API_KEY")
     llm_client = OpenAI(api_key=api_key, base_url="https://api.siliconflow.com/v1", timeout=60.0)
     return rag, embeddings, llm_client
@@ -67,18 +65,18 @@ def generate_answers(
     retrieval_mode: str = "hybrid_rerank",
     max_workers: int = 8,
 ) -> tuple[list, list]:
-    """Generate câu trả lời cho danh sách câu hỏi với parallel processing."""
     def process(idx_q):
-        """Xử lý một câu hỏi: retrieve + generate."""
         idx, q = idx_q
         try:
-            # Retrieve và chuẩn bị context
             prepared = rag.retrieve_and_prepare(q, mode=retrieval_mode)
             if not prepared["results"]:
                 return idx, "Không tìm thấy thông tin.", []
-            # Gọi LLM để generate answer
             resp = llm_client.chat.completions.create(
                 model=llm_model,
                 messages=[{"role": "user", "content": prepared["prompt"]}],
@@ -88,20 +86,20 @@ def generate_answers(
             answer = strip_thinking(resp.choices[0].message.content or "")
             return idx, answer, prepared["contexts"]
         except Exception as e:
-            print(f"  Q{idx+1} Lỗi: {e}")
             return idx, "Không thể trả lời.", []
     n = len(questions)
     answers, contexts = [""] * n, [[] for _ in range(n)]
-    print(f"  Đang generate {n} câu trả lời...")
-    # Xử lý song song với ThreadPoolExecutor
     with ThreadPoolExecutor(max_workers=max_workers) as executor:
         futures = {executor.submit(process, (i, q)): i for i, q in enumerate(questions)}
         for i, future in enumerate(as_completed(futures), 1):
             idx, ans, ctx = future.result(timeout=120)
             answers[idx], contexts[idx] = ans, ctx
-            print(f"  [{i}/{n}] Xong")
     return answers, contexts

 import os
 import sys
 import re
 from openai import OpenAI
 from core.rag.embedding_model import EmbeddingConfig, QwenEmbeddings
 from core.rag.vector_store import ChromaConfig, ChromaVectorDB
+from core.rag.retrieval import Retriever
 from core.rag.generator import RAGGenerator
 def strip_thinking(text: str) -> str:
     return re.sub(r'<think>.*?</think>\s*', '', text, flags=re.DOTALL).strip()
 def load_csv_data(csv_path: str, sample_size: int = 0) -> tuple[list, list]:
     questions, ground_truths = [], []
     with open(csv_path, 'r', encoding='utf-8') as f:
         for row in csv.DictReader(f):
                 questions.append(row['question'])
                 ground_truths.append(row['ground_truth'])
+    # Limit sample size
     if sample_size > 0:
         questions = questions[:sample_size]
         ground_truths = ground_truths[:sample_size]
 def init_rag() -> tuple[RAGGenerator, QwenEmbeddings, OpenAI]:
     embeddings = QwenEmbeddings(EmbeddingConfig())
     db = ChromaVectorDB(embedder=embeddings, config=ChromaConfig())
     retriever = Retriever(vector_db=db)
     rag = RAGGenerator(retriever=retriever)
+    # Initialize LLM client
     api_key = os.getenv("SILICONFLOW_API_KEY", "").strip()
     if not api_key:
+        raise ValueError("Missing SILICONFLOW_API_KEY")
     llm_client = OpenAI(api_key=api_key, base_url="https://api.siliconflow.com/v1", timeout=60.0)
     return rag, embeddings, llm_client
     retrieval_mode: str = "hybrid_rerank",
     max_workers: int = 8,
 ) -> tuple[list, list]:
     def process(idx_q):
         idx, q = idx_q
         try:
+            # Retrieve and prepare context
             prepared = rag.retrieve_and_prepare(q, mode=retrieval_mode)
             if not prepared["results"]:
                 return idx, "Không tìm thấy thông tin.", []
+            # Call LLM to generate answer
             resp = llm_client.chat.completions.create(
                 model=llm_model,
                 messages=[{"role": "user", "content": prepared["prompt"]}],
             answer = strip_thinking(resp.choices[0].message.content or "")
             return idx, answer, prepared["contexts"]
         except Exception as e:
+            print(f"  Q{idx+1} Error: {e}")
             return idx, "Không thể trả lời.", []
     n = len(questions)
     answers, contexts = [""] * n, [[] for _ in range(n)]
+    print(f"  Generating {n} answers...")
+    # Parallel processing with ThreadPoolExecutor
     with ThreadPoolExecutor(max_workers=max_workers) as executor:
         futures = {executor.submit(process, (i, q)): i for i, q in enumerate(questions)}
         for i, future in enumerate(as_completed(futures), 1):
             idx, ans, ctx = future.result(timeout=120)
             answers[idx], contexts[idx] = ans, ctx
+            print(f"  [{i}/{n}] Done")
     return answers, contexts

evaluation/ragas_eval.py CHANGED Viewed

@@ -1,5 +1,3 @@
-"""Script đánh giá RAG bằng RAGAS framework."""
 import os
 import sys
 import json
@@ -23,34 +21,34 @@ from ragas.run_config import RunConfig
 from evaluation.eval_utils import load_csv_data, init_rag, generate_answers
-# Cấu hình
-CSV_PATH = "data/data.csv"                                        # File dữ liệu test
-OUTPUT_DIR = "evaluation/results"                                  # Thư mục output
-LLM_MODEL = os.getenv("EVAL_LLM_MODEL", "nex-agi/DeepSeek-V3.1-Nex-N1")  # Model đánh giá
 API_BASE = "https://api.siliconflow.com/v1"
 def run_evaluation(sample_size: int = 10, retrieval_mode: str = "hybrid_rerank") -> dict:
-    """Chạy đánh giá RAGAS trên dữ liệu test."""
     print(f"\n{'='*60}")
     print(f"RAGAS EVALUATION - Mode: {retrieval_mode}")
     print(f"{'='*60}")
-    # Khởi tạo RAG components
     rag, embeddings, llm_client = init_rag()
-    # Tải dữ liệu test
     questions, ground_truths = load_csv_data(str(REPO_ROOT / CSV_PATH), sample_size)
-    print(f"  Đã tải {len(questions)} samples")
-    # Generate câu trả lời
     answers, contexts = generate_answers(
         rag, questions, llm_client,
         llm_model=LLM_MODEL,
         retrieval_mode=retrieval_mode,
     )
-    # Thiết lập RAGAS evaluator
     api_key = os.getenv("SILICONFLOW_API_KEY", "")
     evaluator_llm = LangchainLLMWrapper(ChatOpenAI(
         model=LLM_MODEL,
@@ -62,7 +60,7 @@ def run_evaluation(sample_size: int = 10, retrieval_mode: str = "hybrid_rerank")
     ))
     evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)
-    # Chuyển dữ liệu thành format Dataset
     dataset = Dataset.from_dict({
         "question": questions,
         "answer": answers,
@@ -70,8 +68,8 @@ def run_evaluation(sample_size: int = 10, retrieval_mode: str = "hybrid_rerank")
         "ground_truth": ground_truths,
     })
-    # Chạy đánh giá RAGAS
-    print("\n  Đang chạy RAGAS metrics...")
     results = evaluate(
         dataset=dataset,
         metrics=[
@@ -79,9 +77,9 @@ def run_evaluation(sample_size: int = 10, retrieval_mode: str = "hybrid_rerank")
             answer_relevancy,       # Độ liên quan của câu trả lời
             context_precision,      # Độ chính xác của context
             context_recall,         # Độ bao phủ của context
-            RougeScore(rouge_type='rouge1', mode='fmeasure'),  # ROUGE-1
-            RougeScore(rouge_type='rouge2', mode='fmeasure'),  # ROUGE-2
-            RougeScore(rouge_type='rougeL', mode='fmeasure'),  # ROUGE-L
         ],
         llm=evaluator_llm,
         embeddings=evaluator_embeddings,
@@ -89,37 +87,47 @@ def run_evaluation(sample_size: int = 10, retrieval_mode: str = "hybrid_rerank")
         run_config=RunConfig(max_workers=8, timeout=600, max_retries=3),
     )
-    # Trích xuất điểm số
     df = results.to_pandas()
     metric_cols = [c for c in df.columns if c not in ("question", "answer", "contexts", "ground_truth", "user_input", "response", "reference", "retrieved_contexts")]
-    # Tính điểm trung bình cho mỗi metric
     avg_scores = {}
     for col in metric_cols:
         values = df[col].dropna().tolist()
         if values:
             avg_scores[col] = sum(values) / len(values)
-    # Lưu kết quả
     out_path = REPO_ROOT / OUTPUT_DIR
     out_path.mkdir(parents=True, exist_ok=True)
     timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-    # Lưu file CSV (tóm tắt)
     csv_path = out_path / f"ragas_{retrieval_mode}_{timestamp}.csv"
     with open(csv_path, 'w', encoding='utf-8') as f:
         f.write("retrieval_mode,sample_size," + ",".join(avg_scores.keys()) + "\n")
         f.write(f"{retrieval_mode},{len(questions)}," + ",".join(f"{v:.4f}" for v in avg_scores.values()) + "\n")
-    # In kết quả
     print(f"\n{'='*60}")
-    print(f"KẾT QUẢ - {retrieval_mode} ({len(questions)} samples)")
     print(f"{'='*60}")
     for metric, score in avg_scores.items():
         bar = "#" * int(score * 20) + "-" * (20 - int(score * 20))
         print(f"  {metric:25} [{bar}] {score:.4f}")
-    print(f"\nĐã lưu: {json_path}")
-    print(f"Đã lưu: {csv_path}")
     return avg_scores

 import os
 import sys
 import json
 from evaluation.eval_utils import load_csv_data, init_rag, generate_answers
+# Configuration
+CSV_PATH = "data/data.csv"
+OUTPUT_DIR = "evaluation/results"
+LLM_MODEL = os.getenv("EVAL_LLM_MODEL", "nex-agi/DeepSeek-V3.1-Nex-N1")
 API_BASE = "https://api.siliconflow.com/v1"
 def run_evaluation(sample_size: int = 10, retrieval_mode: str = "hybrid_rerank") -> dict:
     print(f"\n{'='*60}")
     print(f"RAGAS EVALUATION - Mode: {retrieval_mode}")
     print(f"{'='*60}")
+    # Initialize RAG components
     rag, embeddings, llm_client = init_rag()
+    # Load test data
     questions, ground_truths = load_csv_data(str(REPO_ROOT / CSV_PATH), sample_size)
+    print(f"  Loaded {len(questions)} samples")
+    # Generate answers
     answers, contexts = generate_answers(
         rag, questions, llm_client,
         llm_model=LLM_MODEL,
         retrieval_mode=retrieval_mode,
     )
+    # Setup RAGAS evaluator
     api_key = os.getenv("SILICONFLOW_API_KEY", "")
     evaluator_llm = LangchainLLMWrapper(ChatOpenAI(
         model=LLM_MODEL,
     ))
     evaluator_embeddings = LangchainEmbeddingsWrapper(embeddings)
+    # Convert data to Dataset format
     dataset = Dataset.from_dict({
         "question": questions,
         "answer": answers,
         "ground_truth": ground_truths,
     })
+    # Run RAGAS evaluation
+    print("\n  Running RAGAS metrics...")
     results = evaluate(
         dataset=dataset,
         metrics=[
             answer_relevancy,       # Độ liên quan của câu trả lời
             context_precision,      # Độ chính xác của context
             context_recall,         # Độ bao phủ của context
+            RougeScore(rouge_type='rouge1', mode='fmeasure'),
+            RougeScore(rouge_type='rouge2', mode='fmeasure'),
+            RougeScore(rouge_type='rougeL', mode='fmeasure'),
         ],
         llm=evaluator_llm,
         embeddings=evaluator_embeddings,
         run_config=RunConfig(max_workers=8, timeout=600, max_retries=3),
     )
+    # Extract scores
     df = results.to_pandas()
     metric_cols = [c for c in df.columns if c not in ("question", "answer", "contexts", "ground_truth", "user_input", "response", "reference", "retrieved_contexts")]
+    # Calculate average score for each metric
     avg_scores = {}
     for col in metric_cols:
         values = df[col].dropna().tolist()
         if values:
             avg_scores[col] = sum(values) / len(values)
+    # Save results
     out_path = REPO_ROOT / OUTPUT_DIR
     out_path.mkdir(parents=True, exist_ok=True)
     timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    # Save JSON file (detailed)
+    json_path = out_path / f"ragas_{retrieval_mode}_{timestamp}.json"
+    with open(json_path, 'w', encoding='utf-8') as f:
+        json.dump({
+            "retrieval_mode": retrieval_mode,
+            "sample_size": len(questions),
+            "timestamp": timestamp,
+            "scores": avg_scores,
+        }, f, ensure_ascii=False, indent=2)
+    # Save CSV file (summary)
     csv_path = out_path / f"ragas_{retrieval_mode}_{timestamp}.csv"
     with open(csv_path, 'w', encoding='utf-8') as f:
         f.write("retrieval_mode,sample_size," + ",".join(avg_scores.keys()) + "\n")
         f.write(f"{retrieval_mode},{len(questions)}," + ",".join(f"{v:.4f}" for v in avg_scores.values()) + "\n")
+    # Print results
     print(f"\n{'='*60}")
+    print(f"RESULTS - {retrieval_mode} ({len(questions)} samples)")
     print(f"{'='*60}")
     for metric, score in avg_scores.items():
         bar = "#" * int(score * 20) + "-" * (20 - int(score * 20))
         print(f"  {metric:25} [{bar}] {score:.4f}")
+    print(f"\nSaved: {json_path}")
+    print(f"Saved: {csv_path}")
     return avg_scores

scripts/build_data.py CHANGED Viewed

@@ -1,5 +1,3 @@
-"""Script build ChromaDB từ markdown files với incremental update."""
 import sys
 import argparse
 from pathlib import Path
@@ -20,7 +18,7 @@ _hasher = HashProcessor(verbose=False)
 def get_db_file_info(db: ChromaVectorDB) -> dict:
-    """Lấy thông tin files đã có trong DB (IDs và hash)."""
     docs = db.get_all_documents()
     file_to_ids = {}
     file_to_hash = {}
@@ -36,7 +34,7 @@ def get_db_file_info(db: ChromaVectorDB) -> dict:
                 file_to_ids[source] = set()
             file_to_ids[source].add(doc_id)
-            # Lưu hash đầu tiên tìm thấy cho file
             if source not in file_to_hash and content_hash:
                 file_to_hash[source] = content_hash
@@ -44,65 +42,65 @@ def get_db_file_info(db: ChromaVectorDB) -> dict:
 def main():
-    parser = argparse.ArgumentParser(description="Build ChromaDB từ markdown files")
-    parser.add_argument("--force", action="store_true", help="Build lại tất cả files")
-    parser.add_argument("--no-delete", action="store_true", help="Không xóa docs orphaned")
     args = parser.parse_args()
     print("=" * 60)
     print("BUILD HUST RAG DATABASE")
     print("=" * 60)
-    # Bước 1: Khởi tạo embedder
-    print("\n[1/5] Khởi tạo embedder...")
     emb_cfg = EmbeddingConfig()
     emb = QwenEmbeddings(emb_cfg)
     print(f"  Model: {emb_cfg.model}")
     print(f"  API: {emb_cfg.api_base_url}")
-    # Bước 2: Khởi tạo ChromaDB
-    print("\n[2/5] Khởi tạo ChromaDB...")
     db_cfg = ChromaConfig()
     db = ChromaVectorDB(embedder=emb, config=db_cfg)
     old_count = db.count()
     print(f"  Collection: {db_cfg.collection_name}")
-    print(f"  Số docs hiện tại: {old_count}")
-    # Lấy trạng thái hiện tại của DB
     db_info = {"ids": {}, "hashes": {}}
     if not args.force and old_count > 0:
-        print("\n  Đang quét documents trong DB...")
         db_info = get_db_file_info(db)
-        print(f"  Tìm thấy {len(db_info['ids'])} source files trong DB")
-    # Bước 3: Quét markdown files
-    print("\n[3/5] Quét markdown files...")
     root = REPO_ROOT / "data" / "data_process"
     md_files = sorted(root.rglob("*.md"))
-    print(f"  Tìm thấy {len(md_files)} markdown files trên disk")
-    # So sánh files trên disk vs trong DB
     current_files = {f.name for f in md_files}
     db_files = set(db_info["ids"].keys())
-    # Tìm files cần xóa (có trong DB nhưng không có trên disk)
     files_to_delete = db_files - current_files
-    # Bước 4: Xóa docs orphaned
     deleted_count = 0
     if files_to_delete and not args.no_delete:
-        print(f"\n[4/5] Dọn dẹp {len(files_to_delete)} files đã xóa...")
         for filename in files_to_delete:
             doc_ids = list(db_info["ids"].get(filename, []))
             if doc_ids:
                 db.delete_documents(doc_ids)
                 deleted_count += len(doc_ids)
-                print(f"  Đã xóa: {filename} ({len(doc_ids)} chunks)")
     else:
-        print("\n[4/5] Không có files cần xóa")
-    # Bước 5: Xử lý markdown files (thêm mới, cập nhật)
-    print("\n[5/5] Xử lý markdown files...")
     total_added = 0
     total_updated = 0
     skipped = 0
@@ -112,16 +110,16 @@ def main():
         db_hash = db_info["hashes"].get(f.name, "")
         existing_ids = db_info["ids"].get(f.name, set())
-        # Bỏ qua nếu hash khớp (file không thay đổi)
         if not args.force and db_hash == file_hash:
-            print(f"  [{i}/{len(md_files)}] {f.name}: BỎ QUA (không đổi)")
             skipped += 1
             continue
-        # Nếu file thay đổi, xóa chunks cũ trước
         if existing_ids and not args.force:
             db.delete_documents(list(existing_ids))
-            print(f"  [{i}/{len(md_files)}] {f.name}: CẬP NHẬT (xóa {len(existing_ids)} chunks cũ)")
             is_update = True
         else:
             is_update = False
@@ -129,7 +127,7 @@ def main():
         try:
             docs = chunk_markdown_file(f)
             if docs:
-                # Thêm hash vào metadata để phát hiện thay đổi lần sau
                 for doc in docs:
                     if hasattr(doc, 'metadata'):
                         doc.metadata["content_hash"] = file_hash
@@ -139,36 +137,36 @@ def main():
                 n = db.upsert_documents(docs)
                 if is_update:
                     total_updated += n
-                    print(f"  [{i}/{len(md_files)}] {f.name}: +{n} chunks mới")
                 else:
                     total_added += n
                     print(f"  [{i}/{len(md_files)}] {f.name}: {n} chunks")
             else:
-                print(f"  [{i}/{len(md_files)}] {f.name}: BỎ QUA (không có chunks)")
         except Exception as e:
-            print(f"  [{i}/{len(md_files)}] {f.name}: LỖI - {e}")
-    # Tổng kết
     new_count = db.count()
     has_changes = deleted_count > 0 or total_updated > 0 or total_added > 0
-    # Xóa BM25 cache nếu có thay đổi (vì BM25 không hỗ trợ incremental update)
     if has_changes:
         bm25_cache = REPO_ROOT / "data" / "chroma" / "bm25_cache.pkl"
         if bm25_cache.exists():
             bm25_cache.unlink()
-            print("\n[!] Đã xóa BM25 cache (sẽ tự rebuild khi query)")
     print(f"\n{'=' * 60}")
-    print("TỔNG KẾT")
     print("=" * 60)
-    print(f"  Đã xóa (orphaned): {deleted_count} chunks")
-    print(f"  Đã cập nhật: {total_updated} chunks")
-    print(f"  Đã thêm mới: {total_added} chunks")
-    print(f"  Đã bỏ qua: {skipped} files")
-    print(f"  Số docs trong DB: {old_count} -> {new_count} ({new_count - old_count:+d})")
-    print("\nHOÀN TẤT!")
 if __name__ == "__main__":

 import sys
 import argparse
 from pathlib import Path
 def get_db_file_info(db: ChromaVectorDB) -> dict:
     docs = db.get_all_documents()
     file_to_ids = {}
     file_to_hash = {}
                 file_to_ids[source] = set()
             file_to_ids[source].add(doc_id)
+            # Store first hash found for file
             if source not in file_to_hash and content_hash:
                 file_to_hash[source] = content_hash
 def main():
+    parser = argparse.ArgumentParser(description="Build ChromaDB from markdown files")
+    parser.add_argument("--force", action="store_true", help="Rebuild all files")
+    parser.add_argument("--no-delete", action="store_true", help="Don't delete orphaned docs")
     args = parser.parse_args()
     print("=" * 60)
     print("BUILD HUST RAG DATABASE")
     print("=" * 60)
+    # Step 1: Initialize embedder
+    print("\n[1/5] Initializing embedder...")
     emb_cfg = EmbeddingConfig()
     emb = QwenEmbeddings(emb_cfg)
     print(f"  Model: {emb_cfg.model}")
     print(f"  API: {emb_cfg.api_base_url}")
+    # Step 2: Initialize ChromaDB
+    print("\n[2/5] Initializing ChromaDB...")
     db_cfg = ChromaConfig()
     db = ChromaVectorDB(embedder=emb, config=db_cfg)
     old_count = db.count()
     print(f"  Collection: {db_cfg.collection_name}")
+    print(f"  Current docs: {old_count}")
+    # Get current DB state
     db_info = {"ids": {}, "hashes": {}}
     if not args.force and old_count > 0:
+        print("\n  Scanning documents in DB...")
         db_info = get_db_file_info(db)
+        print(f"  Found {len(db_info['ids'])} source files in DB")
+    # Step 3: Scan markdown files
+    print("\n[3/5] Scanning markdown files...")
     root = REPO_ROOT / "data" / "data_process"
     md_files = sorted(root.rglob("*.md"))
+    print(f"  Found {len(md_files)} markdown files on disk")
+    # Compare files on disk vs in DB
     current_files = {f.name for f in md_files}
     db_files = set(db_info["ids"].keys())
+    # Find files to delete (in DB but not on disk)
     files_to_delete = db_files - current_files
+    # Step 4: Delete orphaned docs
     deleted_count = 0
     if files_to_delete and not args.no_delete:
+        print(f"\n[4/5] Cleaning up {len(files_to_delete)} deleted files...")
         for filename in files_to_delete:
             doc_ids = list(db_info["ids"].get(filename, []))
             if doc_ids:
                 db.delete_documents(doc_ids)
                 deleted_count += len(doc_ids)
+                print(f"  Deleted: {filename} ({len(doc_ids)} chunks)")
     else:
+        print("\n[4/5] No files to delete")
+    # Step 5: Process markdown files (add new, update)
+    print("\n[5/5] Processing markdown files...")
     total_added = 0
     total_updated = 0
     skipped = 0
         db_hash = db_info["hashes"].get(f.name, "")
         existing_ids = db_info["ids"].get(f.name, set())
+        # Skip if hash matches (file unchanged)
         if not args.force and db_hash == file_hash:
+            print(f"  [{i}/{len(md_files)}] {f.name}: SKIPPED (unchanged)")
             skipped += 1
             continue
+        # If file changed, delete old chunks first
         if existing_ids and not args.force:
             db.delete_documents(list(existing_ids))
+            print(f"  [{i}/{len(md_files)}] {f.name}: UPDATED (deleted {len(existing_ids)} old chunks)")
             is_update = True
         else:
             is_update = False
         try:
             docs = chunk_markdown_file(f)
             if docs:
+                # Add hash to metadata for change detection
                 for doc in docs:
                     if hasattr(doc, 'metadata'):
                         doc.metadata["content_hash"] = file_hash
                 n = db.upsert_documents(docs)
                 if is_update:
                     total_updated += n
+                    print(f"  [{i}/{len(md_files)}] {f.name}: +{n} new chunks")
                 else:
                     total_added += n
                     print(f"  [{i}/{len(md_files)}] {f.name}: {n} chunks")
             else:
+                print(f"  [{i}/{len(md_files)}] {f.name}: SKIPPED (no chunks)")
         except Exception as e:
+            print(f"  [{i}/{len(md_files)}] {f.name}: ERROR - {e}")
+    # Summary
     new_count = db.count()
     has_changes = deleted_count > 0 or total_updated > 0 or total_added > 0
+    # Delete BM25 cache if changes detected (BM25 doesn't support incremental update)
     if has_changes:
         bm25_cache = REPO_ROOT / "data" / "chroma" / "bm25_cache.pkl"
         if bm25_cache.exists():
             bm25_cache.unlink()
+            print("\n[!] Deleted BM25 cache (will auto-rebuild on next query)")
     print(f"\n{'=' * 60}")
+    print("SUMMARY")
     print("=" * 60)
+    print(f"  Deleted (orphaned): {deleted_count} chunks")
+    print(f"  Updated: {total_updated} chunks")
+    print(f"  Added: {total_added} chunks")
+    print(f"  Skipped: {skipped} files")
+    print(f"  DB docs: {old_count} -> {new_count} ({new_count - old_count:+d})")
+    print("\nDONE!")
 if __name__ == "__main__":

scripts/run_app.py ADDED Viewed

	@@ -0,0 +1,52 @@

+import os
+import sys
+from pathlib import Path
+# Add project root to path
+REPO_ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(REPO_ROOT))
+def check_data():
+    data_path = REPO_ROOT / "data"
+    if not data_path.exists() or not any(data_path.iterdir()):
+        print("Data folder not found. Downloading from HuggingFace...")
+        from scripts.download_data import download_data
+        download_data()
+def check_env():
+    from dotenv import load_dotenv
+    load_dotenv()
+    required_vars = ["GROQ_API_KEY", "SILICONFLOW_API_KEY"]
+    missing = [var for var in required_vars if not os.getenv(var)]
+    if missing:
+        print(f"Missing environment variables: {', '.join(missing)}")
+        print("Please create a .env file with the required variables.")
+        print("Example:")
+        print("  GROQ_API_KEY=your_groq_key")
+        print("  SILICONFLOW_API_KEY=your_siliconflow_key")
+        sys.exit(1)
+def main():
+    print("=" * 60)
+    print("HUST RAG Assistant - Startup")
+    print("=" * 60)
+    # Check data
+    check_data()
+    # Check environment
+    check_env()
+    # Run Gradio app
+    print("\nStarting Gradio server...")
+    from core.gradio.user_gradio import demo, GRADIO_CFG
+    demo.launch(
+        server_name=GRADIO_CFG.server_host,
+        server_port=GRADIO_CFG.server_port
+    )
+if __name__ == "__main__":
+    main()

scripts/run_eval.py CHANGED Viewed

@@ -8,19 +8,19 @@ if str(REPO_ROOT) not in sys.path:
 def main():
-    parser = argparse.ArgumentParser(description="Đánh giá RAG bằng RAGAS")
-    parser.add_argument("--samples", type=int, default=10, help="Số lượng samples (0 = tất cả)")
     parser.add_argument("--mode", type=str, default="hybrid_rerank",
                         choices=["vector_only", "bm25_only", "hybrid", "hybrid_rerank", "all"],
-                        help="Chế độ retrieval")
     args = parser.parse_args()
     from evaluation.ragas_eval import run_evaluation
     if args.mode == "all":
-        # Chạy tất cả các chế độ retrieval
         print("\n" + "=" * 60)
-        print("CHẠY TẤT CẢ CÁC CHẾ ĐỘ RETRIEVAL")
         print("=" * 60)
         for mode in ["vector_only", "bm25_only", "hybrid", "hybrid_rerank"]:
             run_evaluation(args.samples, mode)

 def main():
+    parser = argparse.ArgumentParser(description="Evaluate RAG with RAGAS")
+    parser.add_argument("--samples", type=int, default=10, help="Number of samples (0 = all)")
     parser.add_argument("--mode", type=str, default="hybrid_rerank",
                         choices=["vector_only", "bm25_only", "hybrid", "hybrid_rerank", "all"],
+                        help="Retrieval mode")
     args = parser.parse_args()
     from evaluation.ragas_eval import run_evaluation
     if args.mode == "all":
+        # Run all retrieval modes
         print("\n" + "=" * 60)
+        print("RUNNING ALL RETRIEVAL MODES")
         print("=" * 60)
         for mode in ["vector_only", "bm25_only", "hybrid", "hybrid_rerank"]:
             run_evaluation(args.samples, mode)

setup.bat ADDED Viewed

	@@ -0,0 +1,35 @@

+@echo off
+echo HUST RAG - Setup Script
+echo.
+echo [1/5] Checking Python...
+python --version 2>nul || (echo Error: Python not found & exit /b 1)
+echo [2/5] Creating virtual environment...
+if exist "venv" if not exist "venv\Scripts\activate.bat" (
+    echo Removing broken venv...
+    rmdir /s /q venv
+)
+if not exist "venv" (
+    python -m venv venv || (echo Error: Cannot create venv & exit /b 1)
+)
+echo [3/5] Installing dependencies...
+call venv\Scripts\activate.bat
+pip install --upgrade pip -q
+pip install -r requirements.txt -q
+echo [4/5] Downloading data...
+if not exist "data\chroma" python scripts\download_data.py
+echo [5/5] Creating .env...
+if not exist ".env" (
+    echo SILICONFLOW_API_KEY=your_key_here> .env
+    echo GROQ_API_KEY=your_key_here>> .env
+    echo Please edit .env with your API keys
+)
+echo.
+echo Setup complete!
+echo Run: venv\Scripts\activate ^& python scripts\run_app.py
+pause

setup.sh ADDED Viewed

	@@ -0,0 +1,38 @@

+#!/bin/bash
+set -e
+echo "HUST RAG - Setup Script"
+echo ""
+echo "[1/5] Checking Python..."
+python3 --version || { echo "Error: Python 3.10+ required"; exit 1; }
+echo "[2/5] Creating virtual environment..."
+if [ -d "venv" ] && [ ! -f "venv/bin/activate" ]; then
+    echo "Removing broken venv..."
+    rm -rf venv
+fi
+if [ ! -d "venv" ]; then
+    python3 -m venv venv || { echo "Error: Cannot create venv. Run: sudo apt install python3-venv"; exit 1; }
+fi
+echo "[3/5] Installing dependencies..."
+source venv/bin/activate
+pip install --upgrade pip -q
+pip install -r requirements.txt -q
+echo "[4/5] Downloading data..."
+[ -d "data/chroma" ] || python scripts/download_data.py
+echo "[5/5] Creating .env..."
+if [ ! -f ".env" ]; then
+    echo "SILICONFLOW_API_KEY=your_key_here" > .env
+    echo "GROQ_API_KEY=your_key_here" >> .env
+    echo "Please edit .env with your API keys"
+fi
+echo ""
+echo "Setup complete!"
+echo "Run: source venv/bin/activate && python scripts/run_app.py"

test_chunk.md DELETED Viewed

@@ -1,696 +0,0 @@
-# NODE 0
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /
-- chunk_index: 0
-**Nội dung:**
-# 1. Tên chương trình: KỸ THUẬT CƠ ĐIỆN TỬ
-Chương trình đào tạo ngành Cơ điện tử hiện nay được xây dựng trên cơ sở phát triển chương trình đào tạo ngành Cơ điện tử năm 2009 kết hợp với sự tham khảo chương trình đào tạo ngành Cơ điện tử của các trường đại học nổi tiếng trên thế giới như Stanford, Chico (Koa Kỳ), Sibaura (Nhật Bản), Đại học Quốc gia Đài Loan (NTU)…; Chương trình được kiểm định theo tiêu chuẩn AUN -QA năm 2017;
-Sinh viên theo học ngành này sẽ được trang bị các kiến thức cơ sở và chuyên ngành vững chắc, có kỹ năng nghề nghiệp và năng lực nghiên cứu, khả năng làm việc và sáng tạo trong mọi môi trường lao động để giải quyết những vấn đề liên quan đến nghiên cứu thiết kế, chế tạo thiết bị, hệ thống cơ điện tử và vận hành các hệ thống sản xuất công nghiệp, nhanh chóng thích ứng với môi trường làm việc của cuộc cách mạng công nghiệp 4.0.
----
-# NODE 1
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- chunk_index: 1
-**Nội dung:**
-# 2. Kiến thức, kỹ năng đạt được sau tốt nghiệp
-## a. Kiến thức
-Có kiến thức chuyên môn rộng và vững chắc, thích ứng tốt với những công việc phù hợp với ngành, chú trọng khả năng áp dụng kiến thức cơ sở và cốt lõi ngành Cơ điện tử kết hợp khả năng sử dụng công cụ hiện đại để nghiên cứu, thiết kế, chế tạo, xây dựng và vận hành các hệ thống/quá trình/sản phẩm Cơ điện tử.
----
-# NODE 2
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- chunk_index: 2
-**Nội dung:**
-## b. Kỹ năng
-Thiết kế, chế tạo, lắp ráp, vận hành và bảo dưỡng các thiết bị, hệ thống, dây chuyền sản xuất Cơ điện tử như: Rô bốt, máy bay, ô tô… hay các hệ thống máy móc trong sản xuất công nghiệp;
-Có kỹ năng làm việc hiệu quả trong nhóm đa ngành và trong môi trường quốc tế;
-Có thể tham gia triển khai và thử nghiệm hệ thống/quá trình/sản phẩm/ giải pháp công nghệ kỹ thuật Cơ điện tử và năng lực vận hành/sử dụng/ khai thác hệ thống/sản phẩm/giải pháp kỹ thuật thuộc lĩnh vực Cơ điện tử.
----
-# NODE 3
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- chunk_index: 3
-**Nội dung:**
-## c. Ngoại ngữ
-Sử dụng hiệu quả ngôn ngữ tiếng Anh trong giao tiếp và công việc, đạt TOEIC từ 500 điểm trở lên.
-## 3.Thời gian đào tạo và khả năng học lên bậc học cao hơn
-- Đào tạo Cử nhân: 4 năm
-- Đào tạo Kỹ sư: 5 năm
-- Đào tạo tích hợp Cử nhân - Thạc sĩ: 5,5 năm
-- Đào tạo tích hợp Cử nhân - Thạc sĩ – Tiến sĩ: 8,5 năm
----
-# NODE 4
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, k�� năng đạt được sau tốt nghiệp/
-- chunk_index: 4
-**Nội dung:**
-## 4. Danh mục học phần và thời lượng học tập:
-Chương trình đào tạo có thể được điều chỉnh hàng năm để đảm bảo tính cập nhật với sự phát triển ển c ủa khoa học, kỹ thuật và công nghệ; tuy nhiên đảm bảo nguyên tắc không gây ảnh hưởng ngược tới kết quả người học đã tích lũy.
----
-# NODE 5
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 1/2
-- is_table: True
-- is_parent: True
-- node_id: f7ab4d2a-b1d3-4fb0-a188-a2b124ddc2f5
-- chunk_index: 5
-**Nội dung:**
-| TT                                         | MÃ SỐ                                      | TÊN HỌC PHẦN                                                   | KHỐI  LƯỢNG (Tín chỉ)   |
-|--------------------------------------------|--------------------------------------------|----------------------------------------------------------------|-------------------------|
-| Lý luận chính trị + Pháp luật đại cương    | Lý luận chính trị + Pháp luật đại cương    | Lý luận chính trị + Pháp luật đại cương                        | 12                      |
-| 1                                          | SSH1110                                    | Những NLCB của CN Mác-Lênin I                                  | 2(2-1-0-4)              |
-| 2                                          | SSH1120                                    | Những NLCB của CN Mác-Lênin II                                 | 3(2-1-0-6)              |
-| 3                                          | SSH1050                                    | Tư tưởng Hồ Chí Minh                                           | 2(2 - 0 - 0 - 4)        |
-| 4                                          | SSH1130                                    | Đường lối CM của Đảng CSVN                                     | 3(2 - 1 - 0 - 6)        |
-| 5                                          | EM1170                                     | Pháp luật đại cương                                            | 2(2-0-0-4)              |
-| Giáo dục thể chất                          | Giáo dục thể chất                          | Giáo dục thể chất                                              | 5                       |
-| 6                                          | PE1014                                     | Lý luận thể dục thể thao (bắt buộc)                            | 1(0 - 0 - 2 - 0)        |
-| 7                                          | PE1024                                     | Bơi lội (bắt buộc)                                             | 1(0 - 0 - 2 - 0)        |
-| 8                                          |                                            | Tự chọn thể dục 1                                              | 1(0 - 0 - 2 - 0)        |
-| 9                                          | Tự chọn trong danh  mục                    | Tự chọn thể dục 2                                              | 1(0 - 0 - 2 - 0)        |
-| 10                                         | Tự chọn trong danh  mục                    | Tự chọn thể dục 3                                              | 1(0 - 0 - 2 - 0)        |
-| Giáo dục Quốc phòng  -  An ninh (165 tiết) | Giáo dục Quốc phòng  -  An ninh (165 tiết) | Giáo dục Quốc phòng  -  An ninh (165 tiết)                     |                         |
-| 11                                         | MIL1110                                    | Đường lối quân sự của Đảng                                     | 0(3 - 0 - 0 - 6)        |
-| 12                                         | MIL1120                                    | Công tác quốc phòng, an ninh                                   | 0(3 - 0 - 0 - 6)        |
----
-# NODE 6
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 1/2
-- is_table_summary: True
-- parent_id: f7ab4d2a-b1d3-4fb0-a188-a2b124ddc2f5
-- chunk_index: 6
-**Nội dung:**
-**Bảng này thuộc file “1.1. Kỹ thuật Cơ điện tử.md”** – là bảng tổng hợp các h���c phần (môn học) và khối lượng tín chỉ được quy định cho chương trình đào tạo Kỹ thuật Cơ‑điện tử.
-### Nội dung bảng liệt kê
-Bảng liệt kê **các học phần bắt buộc và tự chọn** trong chương trình, kèm theo mã số môn, tên học phần và số tín chỉ (khối lượng) được phân chia thành các thành phần (giờ lý thuyết – giờ thực hành – giờ thí nghiệm – giờ tự học).
-### Các cột chính
-| Cột | Nội dung |
-|-----|----------|
-| **TT** | Số thứ tự (cũng có một số dòng tổng hợp như “Lý luận chính trị + Pháp luật đại cương”, “Giáo dục thể chất”, “Giáo dục Quốc phòng – An ninh”). |
-| **MÃ SỐ** | Mã số của học phần (ví dụ: SSH1110, PE1014, MIL1110…). |
-| **TÊN HỌC PHẦN** | Tên đầy đủ của môn học. |
-| **KHỐI LƯỢNG (Tín chỉ)** | Tổng số tín chỉ và chi tiết phân bố (giờ lý thuyết – giờ thực hành – giờ thí nghiệm – giờ tự học) trong ngoặc. |
-### Thông tin quan trọng / ví dụ số liệu
-- **Lý luận chính trị + Pháp luật đại cương**: Tổng cộng 12 tín chỉ (không chi tiết trong ngoặc).
-- **SSH1110 – “Những NLCB của CN Mác‑Lênin I”**: 2 tín chỉ, chi tiết **2‑1‑0‑4** (2 giờ lý thuyết, 1 giờ thực hành, 0 giờ thí nghiệm, 4 giờ tự học).
-- **SSH1120 – “Những NLCB của CN Mác‑Lênin II”**: 3 tín chỉ, chi tiết **2‑1‑0‑6**.
-- **SSH1050 – “Tư tưởng Hồ Chí Minh”**: 2 tín chỉ, chi tiết **2‑0‑0‑4**.
-- **EM1170 – “Pháp luật đại cương”**: 2 tín chỉ, chi tiết **2‑0‑0‑4**.
-- **Giáo dục thể chất**: Tổng 5 tín chỉ, bao gồm các môn bắt buộc như **PE1014 – Lý luận thể dục thể thao** (1 tín chỉ, **0‑0‑2‑0**) và **PE1024 – Bơi lội** (1 tín chỉ, **0‑0‑2‑0**), cùng ba môn tự chọn mỗi môn 1 tín chỉ (**0‑0‑2‑0**).
-- **Giáo dục Quốc phòng – An ninh (165 tiết)**: Ví dụ **MIL1110 – Đường lối quân sự của Đảng** có 0 tín chỉ tổng, nhưng chi tiết **3‑0‑0‑6** (đánh dấu 3 giờ lý thuyết, 6 giờ tự học).
-### Kết luận ngắn gọn
-Bảng này là danh sách chi tiết các học phần trong chương trình Kỹ thuật Cơ‑điện tử, nêu rõ **mã môn, tên môn và khối lượng tín chỉ** (cùng cách phân bố giờ học). Nó giúp sinh viên và nhà quản lý chương trình nắm bắt được yêu cầu học tập, số tín chỉ cần hoàn thành và cấu trúc thời gian học cho từng môn
----
-# NODE 7
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 2/2
-- is_table: True
-- is_parent: True
-- node_id: d02a07a0-6b6a-4372-912c-2b3668bd162d
-- chunk_index: 7
-**Nội dung:**
-| TT                                         | MÃ SỐ                                      | TÊN HỌC PHẦN                                                   | KHỐI  LƯỢNG (Tín chỉ)   |
-|--------------------------------------------|--------------------------------------------|----------------------------------------------------------------|-------------------------|
-| 13                                         | MIL1130                                    | QS chung và chiến thuật, kỹ thuật bắn súng tiểu liên AK  (CKC) | 0(3-0-2-8)              |
-| Tiếng Anh                                  | Tiếng Anh                                  | Tiếng Anh                                                      | 6                       |
-| 14                                         | FL1100                                     | Tiếng Anh I                                                    | 3(0 - 6 - 0 - 6)        |
-| 15                                         | FL1101                                     | Tiếng Anh II                                                   | 3(0 - 6 - 0 - 6)        |
-| Khối kiến thức Toán và Khoa học cơ bản     | Khối kiến thức Toán và Khoa học cơ bản     | Khối kiến thức Toán và Khoa học cơ bản                         | 32                      |
-| 16                                         | MI1111                                     | Giải tích I                                                    | 4(3-2-0-8)              |
-| 17                                         | MI1121                                     | Giải tích II                                                   | 3(2-2-0-6)              |
-| 18                                         | MI1131                                     | Giải tích III                                                  | 3(2-2-0-6)              |
-| 19                                         | MI1141                                     | Đại số                                                         | 4(3 - 2 - 0 - 8)        |
-| 20                                         | ME2030                                     | Cơ khí đại cương                                               | 2(2-0-0-4)              |
-| 21                                         | PH1110                                     | Vật lý đại cương I                                             | 3(2-1-1-6)              |
-| 22                                         | PH1120                                     | Vật lý đại cương II                                            | 3(2-1-1-6)              |
-| 23                                         | IT1110                                     | Tin học đại cương                                              | 4(3-1-1-8)              |
-| 24                                         | MI2110                                     | Phương pháp tính và Matlab                                     | 3(2-0-2-6)              |
-| 25                                         | ME2011                                     | Đồ họa kỹ thuật I                                              | 3(3 - 1 - 0 - 6)        |
-| Cơ sở và cốt lõi ngành                     | Cơ sở và cốt lõi ngành                     | Cơ sở và cốt lõi ngành                                         | 47                      |
----
-# NODE 8
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 2/2
-- is_table_summary: True
-- parent_id: d02a07a0-6b6a-4372-912c-2b3668bd162d
-- chunk_index: 8
-**Nội dung:**
-**Bảng này thuộc file “1.1. Kỹ thuật Cơ điện tử.md”**
-- **Nội dung bảng:** Liệt kê các môn học (hoặc khối kiến thức) trong chương trình đào tạo Kỹ thuật Cơ‑điện tử, kèm theo mã số môn, tên môn và số tín chỉ (cùng với cách phân bố tín chỉ/giờ học).
-- **Các cột chính:**
-  1. **TT** – Số thứ tự (hoặc tên khối kiến thức).
-  2. **MÃ SỐ** – Mã định danh của môn học.
-  3. **TÊN HỌC PHẦN** – Tên đầy đủ của môn hoặc khối kiến thức.
-  4. **KHỐI LƯỢNG (Tín chỉ)** – Số tín chỉ và thường đi kèm dạng “X(A‑B‑C‑D)” trong đó:
-     - **X** = tổng tín chỉ;
-     - **A** = tín chỉ lý thuyết;
-     - **B** = tín chỉ thực hành;
-     - **C** = tín chỉ thí nghiệm/lab;
-     - **D** = tổng số giờ học (theo chuẩn).
-- **Một số thông tin quan trọng / ví dụ:**
-  - Môn **MIL1130 – “QS chung và chiến thuật, kỹ thuật bắn súng tiểu liên AK (CKC)”** có khối lượng **0(3‑0‑2‑8)** → 0 tín chỉ tổng, trong đó 3 tín chỉ lý thuyết, 0 thực hành, 2 lab, 8 giờ học.
-  - Các môn **Tiếng Anh I (FL1100)** và **Tiếng Anh II (FL1101)** mỗi môn có **3(0‑6‑0‑6)** → 3 tín chỉ, toàn bộ là thực hành (6 tín chỉ thực hành tương đương 6 giờ).
-  - Khối **“Toán và Khoa học cơ bản”** được gộp lại với tổng **32** tín chỉ.
-  - Các môn cơ bản như **Giải tích I (MI1111)**, **Giải tích II (MI1121)**, **Giải tích III (MI1131)**, **Đại số (MI1141)** có khối lượng từ **3‑4** tín chỉ, với phân bố lý thuyết‑thực hành rõ ràng (ví dụ: MI1111 – 4(3‑2‑0‑8)).
-  - **Cơ khí đại cương (ME2030)**: 2(2‑0‑0‑4) → 2 tín chỉ, toàn lý thuyết, 4 giờ học.
-  - **Vật lý đại cương I & II (PH1110, PH1120)**: mỗi môn 3(2‑1‑1‑6).
-  - **Tin học đại cương (IT1110)**: 4(3‑1‑1‑8).
-  - **Phương pháp tính và Matlab (MI2110)**: 3(2‑0‑2‑6).
-Tóm lại, bảng này là danh sách chi tiết các môn học và khối kiến thức trong chương trình Kỹ thuật Cơ‑điện tử, kèm theo mã môn và thông tin về số tín chỉ cũng như cách phân bố tín chỉ/giờ học cho từng môn.
----
-# NODE 9
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện t���.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 1/3
-- is_table: True
-- is_parent: True
-- node_id: c2ee7f29-5ba5-4c27-b3b1-4848cf6c3bdc
-- chunk_index: 9
-**Nội dung:**
-| TT                                         | MÃ SỐ                                      | TÊN HỌC PHẦN                                                   | KHỐI  LƯỢNG (Tín chỉ)   |
-|--------------------------------------------|--------------------------------------------|----------------------------------------------------------------|-------------------------|
-| 26                                                  | ME2201                                              | Đồ họa kỹ thuật II                                  | 2(2 - 1 - 0 - 4)   |
-| 27                                                  | ME2002                                              | Nhập môn Cơ Điện Tử                                 | 3(2-1-1-6)         |
-| 28                                                  | EE2012                                              | Kỹ thuật điện                                       | 2(2-1-0-4)         |
-| 29                                                  | ET2012                                              | Kỹ thuật điện tử                                    | 2(2-0-1-6)         |
-| 30                                                  | ME2112                                              | Cơ học kỹ thuật I                                   | 2(2-1-0-4)         |
-| 31                                                  | ME2101                                              | Sức bền vật liệu I                                  | 2(2 - 0 - 1 - 4)   |
-| 32                                                  | ME2211                                              | Cơ học kỹ thuật II                                  | 3(2-2-0-6)         |
-| 33                                                  | ME2202                                              | Sức bền vật liệu II                                 | 2(2 - 0 - 1 - 4)   |
-| 34                                                  | ME2203                                              | Nguyên lý máy                                       | 2(2-0-1-4)         |
-| 35                                                  | EE3359                                              | LT Điều khiển tự động                               | 3(3 - 1 - 0 - 6)   |
-| 36                                                  | MSE2228                                             | Vật liệu học                                        | 2(2-0-1-4)         |
-| 37                                                  | ME3212                                              | Chi tiết máy                                        | 2(2 - 0 - 1 - 4)   |
-| 38                                                  | ME3072                                              | Kỹ thuật đo                                         | 2(2-0-1-4)         |
-| 39                                                  | IT3011                                              | Cấu trúc dữ liệu và thuật toán                      | 2(2 - 1 - 0 - 4)   |
-| 40                                                  | ME3031                                              | Công nghệ chế tạo máy                               | 3(3 - 0 - 1 - 6)   |
----
-# NODE 10
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 1/3
-- is_table_summary: True
-- parent_id: c2ee7f29-5ba5-4c27-b3b1-4848cf6c3bdc
-- chunk_index: 10
-**Nội dung:**
-**Bảng này thuộc file **`1.1. Kỹ thuật Cơ điện tử.md`** – nó liệt kê các môn học thuộc chương trình “Kỹ thuật Cơ‑điện‑tử” kèm theo mã số môn và khối lượng tín chỉ.**
-### Các cột chính của bảng
-| Cột | Nội dung |
-|-----|----------|
-| **TT** | Số thứ tự trong danh sách |
-| **MÃ SỐ** | Mã định danh của môn học (theo chuẩn trường) |
-| **TÊN HỌC PHẦN** | Tên đầy đủ của môn học |
-| **KHỐI LƯỢNG (Tín chỉ)** | Số tín chỉ và phân bố giờ học, thường dạng `X(Y‑Z‑W‑V)` trong đó: <br>• **X** = tổng tín chỉ <br>• **Y** = giờ lý thuyết <br>• **Z** = giờ thực hành <br>• **W** = giờ thí nghiệm/laboratory <br>• **V** = tổng giờ học (theo chuẩn 1 tín chỉ = 4‑6 giờ) |
-### Thông tin quan trọng / ví dụ cụ thể
-- **Môn “Đồ họa kỹ thuật II” (ME2201)**: khối lượng `2(2‑1‑0‑4)` → 2 tín chỉ, 2 giờ lý thuyết, 1 giờ thực hành, không có giờ thí nghiệm, tổng 4 giờ.
-- **Môn “Nhập môn Cơ Điện Tử” (ME2002)**: `3(2‑1‑1‑6)` → 3 tín chỉ, 2 giờ lý thuyết, 1 giờ thực hành, 1 giờ thí nghiệm, tổng 6 giờ.
-- **Môn “LT Điều khiển tự động” (EE3359)**: `3(3‑1‑0‑6)` → 3 tín chỉ, 3 giờ lý thuyết, 1 giờ thực hành, không thí nghiệm, tổng 6 giờ.
-- Các môn khác như “Kỹ thuật điện” (EE2012), “Cơ học kỹ thuật I” (ME2112), “Sức bền vật liệu I” (ME2101) đều có khối lượng `2(2‑1‑0‑4)` hoặc `2(2‑0‑1‑4)`, cho thấy mức độ cân bằng giữa lý thuyết và thực hành/laboratory.
-### Tổng quan nhanh
-- **Số môn liệt kê**: từ TT 26 đến 39 (tổng 14 môn).
-- **Mã số** đa dạng: bắt đầu bằng `ME`, `EE`, `ET`, `IT`, `MSE` phản ánh các ngành chuyên môn (Cơ‑điện‑tử, Điện tử, Vật liệu, Tin học…).
-- **Khối lượng tín chỉ** chủ yếu là 2 hoặc 3 tín chỉ, với cấu trúc giờ học tiêu chuẩn của chương trình kỹ thuật.
----
-# NODE 11
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 2/3
-- is_table: True
-- is_parent: True
-- node_id: c7fd469a-aa76-45cb-8c11-6faecdeb3404
-- chunk_index: 11
-**Nội dung:**
-| TT                                         | MÃ SỐ                                      | TÊN HỌC PHẦN                                                   | KHỐI  LƯỢNG (Tín chỉ)   |
-|--------------------------------------------|--------------------------------------------|----------------------------------------------------------------|-------------------------|
-| 41                                                  | ME3209                                              | Robotics                                            | 3(3-1-0-6)         |
-| 42                                                  | HE2012                                              | Kỹ thuật nhiệt                                      | 2(2-1-0-4)         |
-| 43                                                  | ME3213                                              | Kỹ thuật lập trình trong CĐT                        | 3(2-2-0-6)         |
-| 44                                                  | TE3600                                              | Kỹ thuật thủy khí                                   | 2(2-1-0-4)         |
-| 45                                                  | ME3215                                              | Cơ sở máy CNC                                       | 3(3-0-1-6)         |
-| Kiến thức bổ trợ                                    | Kiến thức bổ trợ                                    | Kiến thức bổ trợ                                    | 9                  |
-| 46                                                  | EM1010                                              | Quản trị học đại cương                              | 2(2-1-0-4)         |
-| 47                                                  | EM1180                                              | Văn hóa kinh doanh và tinh thần khởi nghiệp         | 2(2 - 1 - 0 - 4)   |
-| 48                                                  | ED3280                                              | Tâm lý học ứng dụng                                 | 2(1-2-0-4)         |
-| 49                                                  | ED3220                                              | Kỹ năng mềm                                         | 2(1 - 2 - 0 - 4)   |
-| 50                                                  | ET3262                                              | Tư duy công nghệ và thiết kế kỹ thuật               | 2(1 - 2 - 0 - 4)   |
-| 51                                                  | TEX3123                                             | Thiết kế mỹ thuật công nghiệp                       | 2(1 - 2 - 0 - 4)   |
-| 52                                                  | ME2021                                              | Technical Writing and Presentation                  | 3(2-2-0-6)         |
-| Tự chọn theo định hướng ứng dụng (chọn theo mô đun) | Tự chọn theo định hướng ứng dụng (chọn theo mô đun) | Tự chọn theo định hướng ứng dụng (chọn theo mô đun) |                    |
-| Mô đun 1: Hệ thống sản xuất tự đ���ng                 | Mô đun 1: Hệ thống sản xuất tự động                 | Mô đun 1: Hệ thống sản xuất tự động                 | 17                 |
----
-# NODE 12
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 2/3
-- is_table_summary: True
-- parent_id: c7fd469a-aa76-45cb-8c11-6faecdeb3404
-- chunk_index: 12
-**Nội dung:**
-**Bảng này thuộc file “1.1. Kỹ thuật Cơ điện tử.md”**
-- **Nội dung bảng:** Liệt kê các môn học (cả chuyên ngành và các môn bổ trợ) được đưa vào chương trình đào tạo Kỹ thuật Cơ điện tử, kèm theo mã số môn, tên môn và khối lượng tín chỉ.
-- **Các cột chính:**
-  1. **TT** – số thứ tự trong danh sách.
-  2. **MÃ SỐ** – mã định danh của môn học (ví dụ: ME3209, HE2012…).
-  3. **TÊN HỌC PHẦN** – tên đầy đủ của môn học (ví dụ: Robotics, Kỹ thuật nhiệt, Kỹ thuật lập trình trong CĐT…).
-  4. **KHỐI LƯỢNG (Tín chỉ)** – tổng số tín chỉ và phân bố theo dạng “Lý thuyết‑Thực hành‑Thực tập‑Giờ thực tế” (ví dụ: 3(3‑1‑0‑6) nghĩa là 3 tín chỉ, trong đó 3 giờ lý thuyết, 1 giờ thực hành, 0 giờ thực tập, 6 giờ thực tế).
-- **Thông tin quan trọng / ví dụ:**
-  - Các môn chuyên ngành như **Robotics (ME3209)** có khối lượng 3 tín chỉ (3‑1‑0‑6).
-  - Các môn kỹ thuật khác như **Kỹ thuật nhiệt (HE2012)**, **Kỹ thuật thủy khí (TE3600)** đều 2 tín chỉ (2‑1‑0‑4).
-  - Các môn “Kiến thức bổ trợ” tổng cộng 9 tín chỉ, không có mã số cụ thể.
-  - Các môn mềm và kỹ năng (ví dụ: **Văn hoá kinh doanh và tinh thần khởi nghiệp (EM1180)**, **Kỹ năng mềm (ED3220)**) cũng được liệt kê với khối lượng 2 tín chỉ (2‑1‑0‑4 hoặc 1‑2‑0‑4).
-  - Cuối bảng có mục “Tự chọn theo định hướng ứng dụng (chọn theo mô đun)”, cho phép sinh viên lựa chọn các môn theo mô-đun (ví dụ: Mô đun 1: Hệ thống sản xuất tự động).
-Tóm lại, bảng này là danh sách chi tiết các môn học và số tín chỉ tương ứng trong chương trình Kỹ thuật Cơ điện tử, phân loại rõ ràng giữa môn chuyên ngành, kiến thức bổ trợ và các môn kỹ năng mềm.
----
-# NODE 13
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 3/3
-- is_table: True
-- is_parent: True
-- node_id: 07a25372-e992-4a10-9b3f-4a05fc5e1960
-- chunk_index: 13
-**Nội dung:**
-| TT                                         | MÃ SỐ                                      | TÊN HỌC PHẦN                                                   | KHỐI  LƯỢNG (Tín chỉ)   |
-|--------------------------------------------|--------------------------------------------|----------------------------------------------------------------|-------------------------|
-| 53                                                  | IT4162                                              | Vi xử lý                                            | 2(2-1-0-4)         |
-| 54                                                  | ME4511                                              | Cảm biến & xử lý tín hiệu                           | 2(2 - 1 - 0 - 4)   |
-| 55                                                  | ME4601                                              | Thực tập xưởng Hệ thống SXTĐ                        | 2(2 - 0 - 1 - 4)   |
-| 56                                                  | ME4181                                              | Phương pháp phần tử hữu hạn                         | 2(2 - 1 - 0 - 4)   |
-| 57                                                  | ME4503                                              | ĐA TKHT Cơ khí - SXTĐ                               | 3(1-2-2-6)         |
-| 58                                                  | ME4501                                              | PLC và mạng công nghiệp                             | 2(2-1-0-4)         |
-| 59                                                  | ME4082                                              | Công nghệ CNC                                       | 2(2-1-0-4)         |
-| 60                                                  | ME4112                                              | Tự động hóa sản xuất                                | 2(2 - 1 - 0 - 4)   |
-| Mô đun 2: Thiết bị tự động                          | Mô đun 2: Thiết bị tự động                          | Mô đun 2: Thiết bị tự động                          | 17                 |
-| 61                                                  | IT4162                                              | Vi xử lý                                            | 2(2-1-0-4)         |
-| 62                                                  | ME4511                                              | Cảm biến & xử lý tín hiệu                           | 2(2 - 1 - 0 - 4)   |
-| 63                                                  | ME4602                                              | Thực tập xưởng TBTĐ                                 | 2(2-0-1-4)         |
----
-# NODE 14
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 3/3
-- is_table_summary: True
-- parent_id: 07a25372-e992-4a10-9b3f-4a05fc5e1960
-- chunk_index: 14
-**Nội dung:**
-**Bảng này thuộc file:** **1.1. Kỹ thuật Cơ điện tử.md**
-**Nội dung bảng:**
-Bảng liệt kê danh sách các môn học (học phần) thuộc chương trình “Thiết bị tự động” của ngành Kỹ thuật Cơ điện tử, kèm theo mã số môn và khối lượng tín chỉ (cấu trúc tín chỉ/giờ học).
-**Các cột chính của bảng**
-| Cột | Nội dung |
-|-----|----------|
-| **TT** | Số thứ tự trong danh sách |
-| **MÃ SỐ** | Mã số định danh của môn học (theo chuẩn trường) |
-| **TÊN HỌC PHẦN** | Tên đầy đủ của môn học |
-| **KHỐI LƯỢNG (Tín chỉ)** | Số tín chỉ và phân bố giờ học (theory‑lab‑practice‑total) |
-**Thông tin quan trọng / ví dụ cụ thể**
-- Hầu hết các môn có **khối lượng 2 tín chỉ** với định dạng `2(2‑1‑0‑4)`, nghĩa là:
-  - 2 tín chỉ tổng cộng,
-  - 2 giờ lý thuyết, 1 giờ thực hành (lab), 0 giờ thực tập, và 4 giờ học tổng cộng mỗi tuần.
-- Một số môn có cấu trúc khác, ví dụ:
-  - **ĐA TKHT Cơ khí - SXTĐ** (mã ME4503) có **3 tín chỉ** với cấu trúc `3(1‑2‑2‑6)` → 1 giờ lý thuyết, 2 giờ lab, 2 giờ thực tập, tổng 6 giờ.
-- Bảng còn có một dòng tổng hợp **“Mô đun 2: Thiết bị tự động”** với **khối lượng 17 tín chỉ**, thể hiện tổng số tín chỉ của toàn bộ các môn trong mô-đun này.
-- Các môn lặp lại ở phần cuối (TT 61‑63) là phiên bản cập nhật/tiếp nối của các môn đã liệt kê ở trên, ví dụ:
-  - **Vi xử lý** (IT4162) và **Cảm biến & xử lý tín hiệu** (ME4511) xuất hiện lại với cùng khối lượng tín chỉ.
-Như vậy, bảng cung cấp một cái nhìn tổng quan về các học phần, mã số và khối lượng tín chỉ cần hoàn thành trong mô-đun “Thiết bị tự động” của chương trình Kỹ thuật Cơ điện tử.
----
-# NODE 15
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 1/2
-- is_table: True
-- is_parent: True
-- node_id: 59b96dee-6f57-4e27-b073-b66959fe2598
-- chunk_index: 15
-**Nội dung:**
-| TT                                         | MÃ SỐ                                      | TÊN HỌC PHẦN                                                   | KHỐI  LƯỢNG (Tín chỉ)   |
-|--------------------------------------------|--------------------------------------------|----------------------------------------------------------------|-------------------------|
-| 64                                            | ME4181                                        | Phương pháp phần tử hữu hạn                   | 2(2 - 1 - 0 - 4)   |
-| 65                                            | ME4504                                        | ĐA TKHT Cơ khí - TBTĐ                         | 3(1-2-2-6)         |
-| 66                                            | ME4501                                        | PLC và mạng công nghiệp                       | 2(2-1-0-4)         |
-| 67                                            | ME4082                                        | Công nghệ CNC                                 | 2(2-1-0-4)         |
-| 68                                            | ME4507                                        | Robot công nghiệp                             | 2(2-1-0-4)         |
-| Mô đun 3: Robot                               | Mô đun 3: Robot                               | Mô đun 3: Robot                               | 17                 |
-| 69                                            | IT4162                                        | Vi xử lý                                      | 2(2-1-0-4)         |
-| 70                                            | ME4511                                        | Cảm biến & xử lý tín hiệu                     | 2(2 - 1 - 0 - 4)   |
-| 71                                            | ME4603                                        | Thực tập xưởng Robot                          | 2(2-1-0-4)         |
-| 72                                            | ME4181                                        | Phương pháp phần tử hữu hạn                   | 2(2 - 1 - 0 - 4)   |
-| 73                                            | ME4505                                        | ĐA TKHTCK - Robot                             | 3(1-2-2-6)         |
-| 74                                            | ME4508                                        | Giao diện người máy                           | 2(0-0-4-4)         |
-| 75                                            | ME4509                                        | Xử lý ảnh                                     | 2(2-1-0-4)         |
-| 76                                            | ME4512                                        | Robot tự hành                                 | 2(2-1-0-4)         |
-| Mô đun 4: Hệ thống cơ điện tử thông minh      | Mô đun 4: Hệ thống cơ điện tử thông minh      | Mô đun 4: Hệ thống cơ điện tử thông minh      | 17                 |
----
-# NODE 16
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 1/2
-- is_table_summary: True
-- parent_id: 59b96dee-6f57-4e27-b073-b66959fe2598
-- chunk_index: 16
-**Nội dung:**
-**Bảng này thuộc file “1.1. Kỹ thuật Cơ điện tử.md”** – là bảng liệt kê các môn học (học phần) trong chương trình đào tạo Kỹ thuật Cơ điện tử, kèm theo mã số, tên môn và khối lượng tín chỉ.
-### Nội dung chính của bảng
-- **Liệt kê các học phần** thuộc các mô-đun (Robot, Hệ thống cơ điện tử thông minh…) của chương trình.
-- **Quy định** số tín chỉ và cấu trúc phân bổ giờ học (lý thuyết – thực hành – thí nghiệm – tự học) cho mỗi môn.
-### Các cột chính
-| Cột | Nội dung |
-|-----|----------|
-| **TT** | Số thứ tự trong danh sách |
-| **MÃ SỐ** | Mã định danh của môn học (ví dụ: ME4181) |
-| **TÊN HỌC PHẦN** | Tên đầy đủ của môn học (ví dụ: “Phương pháp phần tử hữu hạn”) |
-| **KHỐI LƯỢNG (Tín chỉ)** | Tổng số tín chỉ và phân bố giờ học dưới dạng “T( L‑T‑N‑Tự )” (L = lý thuyết, T = thực hành, N = thí nghiệm, Tự = tự học). Ví dụ: **2(2‑1‑0‑4)** nghĩa là 2 tín chỉ, trong đó 2 giờ lý thuyết, 1 giờ thực hành, 0 giờ thí nghiệm, 4 giờ tự học. |
-### Thông tin quan trọng / ví dụ
-- Các môn có **khối lượng tín chỉ 2** thường có định dạng **2(2‑1‑0‑4)**; các môn 3 tín chỉ có dạng **3(1‑2‑2‑6)**.
-- Hai dòng “Mô đun 3: Robot” và “Mô đun 4: Hệ thống cơ điện tử thông minh” không phải là môn học mà là tiêu đề mô‑đun, mỗi mô‑đun tổng cộng **17 tín chỉ**.
-- Một số môn đặc thù:
-  - **ME4508 – Giao diện người máy** có khối lượng **2(0‑0‑4‑4)** (không có giờ lý thuyết hay thực hành, chỉ 4 giờ thí nghiệm và 4 giờ tự học).
-  - **ME4603 – Thực tập xưởng Robot** cũng có **2(2‑1‑0‑4)**, cho thấy thực tập vẫn bao gồm giờ lý thuyết và thực hành.
-Tóm lại, bảng này cung cấp danh sách chi tiết các học phần trong chương trình Kỹ thuật Cơ điện tử, kèm mã số, tên môn và cấu trúc tín chỉ/giờ học cho từng môn, đồng thời tóm lược tổng tín chỉ của các mô‑đun chính.
----
-# NODE 17
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 2/2
-- is_table: True
-- is_parent: True
-- node_id: cda6c2e6-7fea-4fa6-b3ef-3b99e583b948
-- chunk_index: 17
-**Nội dung:**
-| TT                                         | MÃ SỐ                                      | TÊN HỌC PHẦN                                                   | KHỐI  LƯỢNG (Tín chỉ)   |
-|--------------------------------------------|--------------------------------------------|----------------------------------------------------------------|-------------------------|
-| 77                                            | IT4162                                        | Vi xử lý                                      | 2(2-1-0-4)         |
-| 78                                            | ME4511                                        | Cảm biến & xử lý tín hiệu                     | 2(2 - 1 - 0 - 4)   |
-| 79                                            | ME4604                                        | Thực tập xưởng HTCĐT TM                       | 2(2-1-0-4)         |
-| 80                                            | ME4181                                        | Phương pháp phần tử hữu hạn                   | 2(2 - 1 - 0 - 4)   |
-| 81                                            | ME4506                                        | ĐA TKHTCK - CĐTTM                             | 3(1-2-2-6)         |
-| 82                                            | ME4508                                        | Giao diện người máy                           | 2(0-0-4-4)         |
-| 83                                            | ME4509                                        | Xử lý ảnh                                     | 2(2-1-0-4)         |
-| 84                                            | EE4829                                        | Điều khiển nối mạng                           | 2(2 - 1 - 0 - 4)   |
-| Thực tập kỹ thuật và Đồ án tốt nghiệp Cử nhân | Thực tập kỹ thuật và Đồ án tốt nghiệp Cử nhân | Thực tập kỹ thuật và Đồ án tốt nghiệp Cử nhân | 8                  |
-| 85                                            | ME4258                                        | Thực tập kỹ thuật                             | 2(0-0-6-4)         |
-| 86                                            | ME4992                                        | Đồ án tốt nghiệp cử nhân                      | 6(0 - 0 - 12 - 12) |
-| Khối kiến thức kỹ sư                          | Khối kiến thức kỹ sư                          | Khối kiến thức kỹ sư                          | 35                 |
-|                                               |                                               | Tự chọn kỹ sư                                 | 19                 |
-|                                               |                                               | Thực tập kỹ sư                                | 4                  |
-|                                               |                                               | Đồ án tốt nghiệp kỹ sư                        | 12                 |
----
-# NODE 18
-**Loại:** TextNode
-**Metadata:**
-- document_type: chuong_trinh_dao_tao
-- program_name: Kỹ thuật Cơ điện tử
-- program_code: ME1
-- faculty: Trường Cơ khí
-- degree_levels: ['Cu nhan', 'Ky su']
-- source_file: 1.1. Kỹ thuật Cơ điện tử.md
-- source_path: data/data_process/chuong_trinh_dao_tao/1.1. Kỹ thuật Cơ điện tử.md
-- header_path: /2. Kiến thức, kỹ năng đạt được sau tốt nghiệp/
-- table_part: 2/2
-- is_table_summary: True
-- parent_id: cda6c2e6-7fea-4fa6-b3ef-3b99e583b948
-- chunk_index: 18
-**Nội dung:**
-**Bảng này thuộc file 1.1. Kỹ thuật Cơ điện tử.md** – nó là **bảng liệt kê các môn học, mã số và khối lượng tín chỉ (theo dạng “tín chỉ‑giờ‑bài‑thực‑lab”) của chương trình Kỹ thuật Cơ‑điện tử**.
-### Các cột chính
-| Cột | Nội dung |
-|-----|----------|
-| **TT** | Số thứ tự trong danh sách môn học |
-| **MÃ SỐ** | Mã định danh của môn (theo chuẩn trường) |
-| **TÊN HỌC PHẦN** | Tên môn học |
-| **KHỐI LƯỢNG (Tín chỉ)** | Tổng số tín chỉ và phân bố giờ học (giờ lý thuyết – giờ thực hành – giờ thí nghiệm – giờ lab) |
-### Thông tin nổi bật
-- **Môn học chuyên ngành**: ví dụ
-  - `IT4162 – Vi xử lý – 2(2‑1‑0‑4)` → 2 tín chỉ, gồm 2 giờ lý thuyết, 1 giờ thực hành, 0 giờ thí nghiệm, 4 giờ lab.
-  - `ME4506 – ĐA TKHTCK - CĐTTM – 3(1‑2‑2‑6)` → 3 tín chỉ, 1 giờ lý thuyết, 2 giờ thực hành, 2 giờ thí nghiệm, 6 giờ lab.
-  - `ME4508 – Giao diện người máy – 2(0‑0‑4‑4)` → 2 tín chỉ, toàn phần là thí nghiệm và lab (4 giờ mỗi loại).
-- **Môn thực tập & đồ án**:
-  - `Thực tập kỹ thuật và Đồ án tốt nghiệp Cử nhân` (khối 8 tín chỉ).
-  - `ME4258 – Thực tập kỹ thuật – 2(0‑0‑6‑4)`.
-  - `ME4992 – Đồ án tốt nghiệp cử nhân – 6(0‑0‑12‑12)`.
-- **Tổng khối lượng tín chỉ** (theo các nhóm kiến thức):
-  - **Khối kiến thức kỹ sư**: 35 tín chỉ.
-  - **Tự chọn kỹ sư**: 19 tín chỉ.
-  - **Thực tập kỹ sư**: 4 tín chỉ.
-  - **Đồ án tốt nghiệp kỹ sư**: 12 tín chỉ.
-Như vậy, bảng cung cấp một cái nhìn tổng quan về cấu trúc môn học và phân bổ tín chỉ cho chương trình Kỹ thuật Cơ‑điện tử, giúp sinh viên và giảng viên nắm rõ yêu cầu học tập và thời lượng mỗi môn.
----