Spaces:

ManB2207540
/

demo-question-generation

Runtime error

App Files Files Community

ManB2207540 commited on Jul 18, 2025

Commit

c750faa

1 Parent(s): e33772f

generate demo

Browse files

Files changed (4) hide show

README.md +153 -11
app.py +186 -0
requirements.txt +70 -0
test.py +106 -0

README.md CHANGED Viewed

@@ -1,14 +1,156 @@
 ---
-title: Demo Question Generation
-emoji: 📈
-colorFrom: indigo
-colorTo: red
-sdk: gradio
-sdk_version: 5.38.0
-app_file: app.py
-pinned: false
-license: unknown
-short_description: A demo for using transformer models for question generation.
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# 📘 Hệ thống sinh câu hỏi từ Context (ProphetNet + spaCy NER)
+Dự án này triển khai một hệ thống sinh câu hỏi tự động dựa trên một đoạn văn bản (context) và các thực thể được trích xuất từ đó. Nó sử dụng mô hình ProphetNet đã được fine-tuned để sinh câu hỏi và thư viện spaCy cho việc trích xuất thực thể có tên (Named Entity Recognition - NER). Giao diện người dùng được xây dựng bằng Gradio, cho phép tương tác dễ dàng.
+---
+## 🚀 Tính năng chính
+- **Sinh câu hỏi:** Dựa trên một đoạn văn bản (context) và các câu trả lời (entities) được trích xuất.
+- **Trích xuất thực thể (NER):** Sử dụng `en_core_web_md` của spaCy để xác định các thực thể tiềm năng làm câu trả lời.
+- **Hỗ trợ nhiều mô hình:** Cho phép lựa chọn giữa các phiên bản mô hình ProphetNet đã được huấn luyện.
+- **Giao diện web thân thiện:** Được xây dựng bằng Gradio, dễ dàng sử dụng và kiểm tra.
+- **Khả năng tái tạo:** Hướng dẫn chi tiết để bạn có thể cài đặt và chạy dự án này trên máy của mình.
+---
+## 🛠️ Yêu cầu hệ thống
+- Python 3.8+ (nên sử dụng môi trường ảo như Conda hoặc `venv`).
+- Ít nhất 8GB RAM (để tải các mô hình ngôn ngữ lớn).
+- Đề xuất có GPU với VRAM đủ lớn (ví dụ: 8GB+) để có hiệu suất sinh câu hỏi nhanh hơn. Nếu không có GPU, mô hình sẽ chạy trên CPU nhưng có thể chậm hơn đáng kể.
 ---
+## 📦 Hướng dẫn cài đặt và chạy dự án
+Bạn có thể sử dụng `conda` (nếu đã cài Anaconda) hoặc `venv` để tạo môi trường ảo.
+### Phương pháp 1: Sử dụng Conda (Khuyến nghị)
+Nếu bạn đã cài đặt **Anaconda** hoặc **Miniconda**:
+1.  **Tạo và kích hoạt môi trường Conda mới:**
+    ```bash
+    conda create -n qg_env python=3.9
+    conda activate qg_env
+    ```
+    (Bạn có thể chọn phiên bản Python khác như 3.8 hoặc 3.10 nếu muốn, nhưng 3.9 là một lựa chọn tốt).
+2.  **Cài đặt các thư viện từ `requirements.txt`:**
+    **Quan trọng:** Bạn cần file `requirements.txt` chứa danh sách các thư viện được sử dụng trong dự án này. Nếu bạn chưa có, hãy tạo nó:
+    ```bash
+    # Điều hướng đến thư mục gốc của dự án này
+    cd path/to/your/project_folder
+    # Tạo file requirements.txt
+    pip freeze > requirements.txt
+    ```
+    Sau khi có file `requirements.txt` trong thư mục gốc của dự án, hãy chạy:
+    ```bash
+    pip install -r requirements.txt
+    ```
+3.  **Tải mô hình ngôn ngữ `en_core_web_md` của spaCy:**
+    ```bash
+    python -m spacy download en_core_web_md
+    ```
+4.  **Tải và đặt các mô hình ProphetNet:**
+    Dự án này sử dụng các mô hình ProphetNet đã được fine-tuned. Bạn cần tải chúng về và đặt vào các đường dẫn chính xác như đã khai báo trong code (hoặc chỉnh sửa code cho phù hợp với đường dẫn của bạn).
+    Theo code mẫu:
+    - `prophetnet1`: `/Users/trantieuman/Downloads/prophetnet_1epoch/prophetnet_context_to_question_finetuned`
+    - `prophetnet2`: `/Users/trantieuman/Downloads/prophetnet_2epoch_final/final_model`
+    - (Nếu có) `prophetnet3`: `/path/to/prophetnet_model_3`
+    **Lưu ý:** Để dễ dàng quản lý, bạn nên tạo một thư mục con trong dự án (ví dụ: `models/prophetnet_1epoch_finetuned`) và đặt các mô hình vào đó, sau đó cập nhật `MODEL_PATHS` trong code của bạn thành các đường dẫn tương đối. Ví dụ:
+    ```python
+    MODEL_PATHS = {
+        "prophetnet1": "./models/prophetnet_1epoch_finetuned",
+        "prophetnet2": "./models/prophetnet_2epoch_final",
+        # ...
+    }
+    ```
+    Đảm bảo các thư mục mô hình này chứa các tệp như `config.json`, `pytorch_model.bin`, `tokenizer_config.json`, `vocab.json`, v.v.
+### Phương pháp 2: Sử dụng `venv`
+1.  **Tạo và kích hoạt môi trường ảo mới:**
+    ```bash
+    # Điều hướng đến thư mục gốc của dự án này
+    cd path/to/your/project_folder
+    # Tạo môi trường ảo
+    python -m venv venv_qg
+    # Kích hoạt môi trường ảo
+    source venv_qg/bin/activate  # Trên Linux/macOS
+    # Hoặc: .\venv_qg\Scripts\activate # Trên Windows
+    ```
+2.  **Cài đặt các thư viện từ `requirements.txt`:**
+    Tương tự như bước 2 của phương pháp Conda, tạo file `requirements.txt` nếu chưa có:
+    ```bash
+    pip freeze > requirements.txt
+    ```
+    Sau đó cài đặt:
+    ```bash
+    pip install -r requirements.txt
+    ```
+3.  **Tải mô hình ngôn ngữ `en_core_web_md` của spaCy:**
+    ```bash
+    python -m spacy download en_core_web_md
+    ```
+4.  **Tải và đặt các mô hình ProphetNet:**
+    Tương tự như bước 4 của phương pháp Conda, đảm bảo các file mô hình ProphetNet đã fine-tuned được đặt đúng đường dẫn.
+---
+## 🏃 Cách chạy dự án
+Sau khi đã hoàn thành các bước cài đặt và kích hoạt môi trường ảo:
+1.  **Đảm bảo bạn đang ở trong thư mục gốc của dự án.**
+2.  **Chạy script chính:**
+    ```bash
+    python demo.py
+    ```
+3.  **Mở trình duyệt:**
+    Khi ứng dụng Gradio khởi chạy, bạn sẽ thấy một URL trong terminal (thường là `http://127.0.0.1:7860` hoặc tương tự). Sao chép URL này và dán vào trình duyệt web của bạn để tương tác với giao diện hệ thống sinh câu hỏi.
+---
+## ⚠️ Lưu ý quan trọng
+- **Đường dẫn mô hình:** Hãy kiểm tra và điều chỉnh các đường dẫn trong biến `MODEL_PATHS` trong code của bạn (`demo.py` hoặc tên file tương ứng) để chúng trỏ đến đúng vị trí các thư mục mô hình ProphetNet đã được tải về trên máy của bạn.
+- **Hiệu suất GPU:** Việc sử dụng GPU sẽ cải thiện đáng kể tốc độ sinh câu hỏi. Đảm bảo cài đặt CUDA và PyTorch với hỗ trợ CUDA nếu bạn muốn tận dụng GPU.
+- **Kiểm tra cache:** Việc sử dụng quá nhiều model co thể gây tràn cache.
+  ```bash
+  du -sh ~/.cache/huggingface/hub
+  ```
+- **Xoá cache:** Việc sử dụng quá nhiều model co thể gây tràn cache hãy xoá nếu không sử dụng.
+  ```bash
+  huggingface-cli delete-cache
+  ```
 ---
+Hy vọng hướng dẫn này sẽ giúp bạn và những người khác dễ dàng thiết lập và chạy dự án của mình!

app.py ADDED Viewed

	@@ -0,0 +1,186 @@

+import gradio as gr
+import spacy
+from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, pipeline
+import torch
+import time
+import re
+nlp = spacy.load("en_core_web_md")
+MODEL_PATHS = {
+    "prophetnet2": "ManB2207540/prophetnet_SQuAD_1.1-2epoch_break",
+    "prophetnet tieu chuan": "microsoft/prophetnet-large-uncased-squad-qg"
+}
+def load_pipeline(model_path):
+    tokenizer = ProphetNetTokenizer.from_pretrained(model_path)
+    model = ProphetNetForConditionalGeneration.from_pretrained(model_path)
+    return pipeline(
+        "text2text-generation",
+        model=model,
+        tokenizer=tokenizer,
+        max_length=256,
+        num_return_sequences=1,
+        device=0 if torch.cuda.is_available() else -1
+    )
+pipeline_cache = {}
+def get_pipeline(model_name):
+    model_path = MODEL_PATHS[model_name]
+    if model_name not in pipeline_cache:
+        pipeline_cache[model_name] = load_pipeline(model_path)
+    return pipeline_cache[model_name]
+# Tự viết hàm capitalize thông minh
+def smart_capitalize(text):
+    # Giữ nguyên cách viết hoa phần còn lại, chỉ viết hoa chữ đầu nếu cần
+    text = text.strip()
+    if not text:
+        return text
+    text = text[0].upper() + text[1:]
+    if not re.search(r'[.?!]$', text):
+        text += '.'
+    return text
+def generate_question(context, answer, model_name):
+    pipe = get_pipeline(model_name)
+    tokenizer = pipe.tokenizer
+    prompt = f"context: {context} answer: {answer}"
+    # Cắt prompt nếu vượt quá giới hạn token
+    encoded = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
+    input_ids = encoded["input_ids"]
+    attention_mask = encoded["attention_mask"]
+    try:
+        output = pipe.model.generate(
+            input_ids=input_ids.to(pipe.model.device),
+            attention_mask=attention_mask.to(pipe.model.device),
+            max_length=64,
+            num_return_sequences=1,
+            num_beams=4
+        )
+        result = pipe.tokenizer.decode(output[0], skip_special_tokens=True).strip()
+        result = smart_capitalize(result)
+        print(f"Generated question: {result}")
+        # Thêm dấu chấm nếu chưa có (và không kết thúc bằng ! hay ?)
+        if not re.search(r'[.?!]$', result):
+            result += '.'
+        return result
+    except Exception as e:
+        return f"Lỗi khi sinh câu hỏi: {e}"
+def generate_qa_list(context, num_questions, model_choice):
+    doc = nlp(context)
+    entities = list(set([ent.text for ent in doc.ents]))
+    entities = [e for e in entities if len(e.strip().split()) <= 10]
+    if not entities:
+        return gr.update(visible=True), ["❌ Không tìm thấy thực thể nào để sinh câu hỏi."]
+    count = min(num_questions, len(entities))
+    qa_list = []
+    for i in range(count):
+        answer = entities[i]
+        question = generate_question(context, answer, model_choice)
+        answer = smart_capitalize(entities[i])
+        qa = f"**{question}**\n<details><summary>Hiện câu trả lời</summary><p>{answer}</p></details>"
+        qa_list.append(qa)
+    return gr.update(visible=True), qa_list
+# Tách phần phân tích context và cập nhật slider
+def analyze_context(context):
+    doc = nlp(context)
+    entities = list(set([ent.text for ent in doc.ents]))
+    entities = [e for e in entities if len(e.strip().split()) <= 10]
+    entity_count = len(entities)
+    if entity_count == 0:
+        return (
+            gr.update(visible=True, value="❌ Không tìm thấy thực thể nào để sinh câu hỏi."),
+            gr.update(visible=False),
+            gr.update(visible=False),
+            gr.update(visible=False)
+        )
+    else:
+        return (
+            gr.update(visible=False),
+            gr.update(visible=True, maximum=entity_count, value=min(3, entity_count), label=f"Số câu hỏi (Tối đa: {entity_count})"),
+            gr.update(visible=True),
+            gr.update(visible=True)
+        )
+with gr.Blocks() as demo:
+    gr.Markdown("## Hệ thống sinh câu hỏi từ Context bằng ProphetNet + spaCy NER")
+    with gr.Row():
+        with gr.Column(scale=4):
+            context_input = gr.Textbox(label="Nhập Context", lines=15, placeholder="Nhập đoạn văn bản...")
+            elapsed_time_md = gr.Markdown(visible=False)
+        with gr.Column(scale=1):
+            model_choice = gr.Dropdown(
+                label="Chọn mô hình",
+                choices=list(MODEL_PATHS.keys()),
+                value="prophetnet1"
+            )
+            num_input = gr.Slider(label="Số câu hỏi", minimum=1, maximum=5, value=3, step=1, visible=False)
+            generate_btn = gr.Button("Sinh câu hỏi", visible=False)
+    # Thông báo đang xử lý hoặc không tìm thấy
+    status_message = gr.Markdown(visible=False)
+    # Kết quả hiển thị tại đây
+    with gr.Column(visible=False) as output_container:
+        result_md_list = [gr.Markdown(visible=False) for _ in range(5)]
+    # Xử lý khi bấm nút sinh câu hỏi
+    def run_generation(context, num_questions, model_choice):
+        start_time = time.time()
+        visible_container, qa_list = generate_qa_list(context, num_questions, model_choice)
+        status_hide = gr.update(visible=False)
+        updates = []
+        for i in range(5):
+            if i < len(qa_list):
+                updates.append(gr.update(value=qa_list[i], visible=True))
+            else:
+                updates.append(gr.update(visible=False))
+        elapsed = time.time() - start_time
+        elapsed_msg = f"⏱️ Thời gian xử lý: {elapsed:.2f} giây"
+        elapsed_md = gr.update(value=elapsed_msg, visible=True)
+        return [status_hide, visible_container, elapsed_md] + updates
+    # Khi người dùng thay đổi context, tự động phân tích thực thể và cập nhật slider
+    context_input.change(
+        fn=analyze_context,
+        inputs=[context_input],
+        outputs=[status_message, num_input, generate_btn, elapsed_time_md]
+    )
+    def show_processing():
+        return gr.update(value="⏳ Đang xử lý...", visible=True)
+    generate_btn.click(
+        fn=show_processing,
+        inputs=[],
+        outputs=[status_message]
+    ).then(
+        fn=run_generation,
+        inputs=[context_input, num_input, model_choice],
+        outputs=[status_message, output_container, elapsed_time_md] + result_md_list
+    )
+demo.launch()
+# #/Users/trantieuman/anaconda3/bin/python /Users/trantieuman/Documents/NLP/project/demo.py

requirements.txt ADDED Viewed

	@@ -0,0 +1,70 @@

+aiofiles==24.1.0
+aiohappyeyeballs==2.6.1
+aiohttp==3.12.14
+aiosignal==1.4.0
+annotated-types==0.7.0
+anyio==4.9.0
+attrs==25.3.0
+audioop-lts==0.2.1
+Brotli==1.1.0
+certifi==2025.7.9
+charset-normalizer==3.4.2
+click==8.2.1
+datasets==4.0.0
+dill==0.3.8
+fastapi==0.116.1
+ffmpy==0.6.0
+filelock==3.18.0
+frozenlist==1.7.0
+fsspec==2025.3.0
+gradio==5.38.0
+gradio_client==1.11.0
+groovy==0.1.2
+h11==0.16.0
+hf-xet==1.1.5
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.33.2
+idna==3.10
+Jinja2==3.1.6
+markdown-it-py==3.0.0
+MarkupSafe==3.0.2
+mdurl==0.1.2
+multidict==6.6.3
+multiprocess==0.70.16
+numpy==2.3.1
+orjson==3.11.0
+packaging==25.0
+pandas==2.3.1
+pillow==11.3.0
+propcache==0.3.2
+pyarrow==20.0.0
+pydantic==2.11.7
+pydantic_core==2.33.2
+pydub==0.25.1
+Pygments==2.19.2
+python-dateutil==2.9.0.post0
+python-multipart==0.0.20
+pytz==2025.2
+PyYAML==6.0.2
+requests==2.32.4
+rich==14.0.0
+ruff==0.12.3
+safehttpx==0.1.6
+safetensors==0.5.3
+semantic-version==2.10.0
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+starlette==0.47.1
+tomlkit==0.13.3
+tqdm==4.67.1
+typer==0.16.0
+typing-inspection==0.4.1
+typing_extensions==4.14.1
+tzdata==2025.2
+urllib3==2.5.0
+uvicorn==0.35.0
+websockets==15.0.1
+xxhash==3.5.0
+yarl==1.20.1

test.py ADDED Viewed

	@@ -0,0 +1,106 @@

+# finetuned model
+# Import các thư viện cần thiết
+from transformers import ProphetNetTokenizer, ProphetNetForConditionalGeneration, pipeline
+import torch
+import os
+import pandas as pd
+import time
+from datasets import Dataset
+# Hàm tải dữ liệu Parquet với xử lý lỗi và thử lại
+def load_squad_parquet(split='train', max_retries=3, delay=5):
+    splits = {'train': 'plain_text/train-00000-of-00001.parquet'}
+    path = "hf://datasets/rajpurkar/squad/" + splits[split]
+    for attempt in range(max_retries):
+        try:
+            df = pd.read_parquet(path)
+            print(f"Tải tập dữ liệu SQuAD {split} thành công sau {attempt + 1} lần thử!")
+            return df
+        except Exception as e:
+            print(f"Lần thử {attempt + 1}/{max_retries} thất bại: {e}")
+            if attempt < max_retries - 1:
+                print(f"Đợi {delay} giây trước khi thử lại...")
+                time.sleep(delay)
+            else:
+                print("Đã hết số lần thử. Vui lòng kiểm tra kết nối internet hoặc cài đặt lại môi trường.")
+                return None
+# Tải tập dữ liệu SQuAD (chỉ tải train để kiểm tra)
+train_df = load_squad_parquet('train')
+if train_df is None:
+    raise ValueError("Không thể tải tập dữ liệu SQuAD. Vui lòng kiểm tra kết nối internet hoặc cài đặt lại môi trường.")
+# Chuyển đổi DataFrame thành Dataset để tương thích với pipeline
+train_ds = Dataset.from_pandas(train_df)
+# Đường dẫn đến thư mục chứa mô hình và tokenizer đã tinh chỉnh
+model_dir = "/Users/trantieuman/Downloads/prophetnet_1epoch/prophetnet_context_to_question_finetuned"
+# Kiểm tra xem thư mục tồn tại
+if not os.path.exists(model_dir):
+    raise FileNotFoundError(f"Thư mục {model_dir} không tồn tại. Vui lòng kiểm tra lại đường dẫn.")
+# Danh sách file cần thiết cho mô hình và tokenizer
+required_model_files = ['config.json', 'model.safetensors']  # Chỉ cần model.safetensors vì đã sử dụng định dạng này
+required_tokenizer_files = ['prophetnet.tokenizer', 'tokenizer_config.json']  # File tokenizer cần thiết
+all_files = os.listdir(model_dir)
+missing_model_files = [f for f in required_model_files if f not in all_files]
+missing_tokenizer_files = [f for f in required_tokenizer_files if f not in all_files]
+if missing_model_files or missing_tokenizer_files:
+    print(f"Thiếu file trong {model_dir}:")
+    if missing_model_files:
+        print(f" - File mô hình thiếu: {missing_model_files}")
+    if missing_tokenizer_files:
+        print(f" - File tokenizer thiếu: {missing_tokenizer_files}")
+    raise FileNotFoundError("Vui lòng cung cấp đầy đủ file mô hình và tokenizer.")
+# Khởi tạo tokenizer và mô hình từ thư mục đã tinh chỉnh
+try:
+    # Chỉ định rõ ràng rằng sử dụng định dạng safetensors
+    tokenizer = ProphetNetTokenizer.from_pretrained(model_dir)
+    model = ProphetNetForConditionalGeneration.from_pretrained(model_dir)
+    print("Tải mô hình và tokenizer từ thư mục đã tinh chỉnh thành công!")
+except Exception as e:
+    raise RuntimeError(f"Lỗi khi tải mô hình/tokenizer: {e}. Vui lòng kiểm tra cấu trúc thư mục hoặc cập nhật thư viện transformers.")
+# Tạo pipeline để tạo câu hỏi (question generation)
+pipe = pipeline(
+    "text2text-generation",
+    model=model,
+    tokenizer=tokenizer,
+    max_length=256,  # Giới hạn độ dài tối đa của câu hỏi
+    num_return_sequences=1,  # Tạo 1 câu hỏi duy nhất
+    device=0 if torch.cuda.is_available() else -1  # Sử dụng GPU nếu có, mặc định CPU
+)
+# Hàm tạo câu hỏi từ context và answer
+def generate_question(context, answer):
+    # Định dạng input theo cách mô hình đã được tinh chỉnh
+    input_text = f"context: {context} answer: {answer}"
+    try:
+        result = pipe(input_text)[0]['generated_text']
+        return result
+    except Exception as e:
+        print(f"Lỗi khi tạo câu hỏi: {e}")
+        return None
+# Thử nghiệm pipeline với một ví dụ
+context = "The Vatican Apostolic Library is located in Vatican City."
+answer = "Vatican City"
+question = generate_question(context, answer)
+if question:
+    print(f"Context: {context}")
+    print(f"Answer: {answer}")
+    print(f"Generated Question: {question}")
+# (Tùy chọn) Kiểm tra với dữ liệu từ SQuAD
+sample = train_ds[0]  # Lấy mẫu đầu tiên từ tập dữ liệu
+context_sample = sample['context']
+answer_sample = sample['answers']['text'][0] if sample['answers']['text'] else "No answer"
+question_sample = generate_question(context_sample, answer_sample)
+if question_sample:
+    print(f"\nSample Context: {context_sample}")
+    print(f"Sample Answer: {answer_sample}")
+    print(f"Generated Question: {question_sample}")
+# /Users/trantieuman/anaconda3/bin/python /Users/trantieuman/Documents/NLP/project/test.py