Spaces:

ZombitX64
/

AutoGDataset

Paused

App Files Files Community

Nattapong Tapachoom commited on Sep 1, 2025

Commit

1da8c51

1 Parent(s): 7908154

Enhance README with supported tasks and schemas; update app.py for task templates and localization; modify requirements.txt for additional dependencies

Browse files

Files changed (4) hide show

README.md +71 -16
__pycache__/app.cpython-313.pyc +0 -0
app.py +447 -57
requirements.txt +8 -0

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-title: AutoGDataset
-emoji: 🏆
-colorFrom: green
-colorTo: pink
 sdk: gradio
 sdk_version: 5.44.1
 app_file: app.py
@@ -10,22 +10,77 @@ pinned: false
 hf_oauth: true
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
-## LangChain + HF Inference
-This app uses LangChain with the Hugging Face Inference API to generate QA datasets from PDFs.
-- Preset models: `HuggingFaceH4/zephyr-7b-beta`, `mistralai/Mistral-7B-Instruct-v0.2`, `google/flan-t5-large`.
-- Provide an `HF_TOKEN` (environment or UI) if your chosen model requires authentication.
-## Usage
-- Run locally: `pip install -r requirements.txt` then `python app.py` and open the link. Upload one or more PDFs, choose the inference method, and click Generate.
-- On Spaces: add a secret `HF_TOKEN` if your chosen model requires it; or paste it in the UI when running.
-### Notes
-- Uses HF Inference API via LangChain; no local `transformers` needed.
-- Output files are saved to `outputs/` as JSON and JSONL.
- - Login: the app requires Hugging Face login by default on Spaces. Set env `REQUIRE_LOGIN=0` to disable (useful for local runs). A Login button is shown at the top; Generate is enabled after login.

 ---
+title: AutoGDataset Thai
+emoji: �🇭
+colorFrom: blue
+colorTo: green
 sdk: gradio
 sdk_version: 5.44.1
 app_file: app.py
 hf_oauth: true
 ---
+# AutoGDataset Thai 🇹🇭
+เครื่องมือสร้างชุดข้อมูล (Dataset) ภาษาไทยจากไฟล์ PDF โดยใช้ LangChain กับ Hugging Face Inference API
+## คุณสมบัติเด่น ✨
+- **รองรับงานหลากหลาย**: QA, RLHF, DPO, Constitutional AI, Chain of Thought, Dialogue และอื่นๆ
+- **เน้นภาษาไทย**: รองรับโมเดลภาษาไทยและ prompt ที่เหมาะสมกับบริบททางวัฒนธรรม
+- **โมเดลที่รองรับ**: OpenThaiGPT, Typhoon, WangchanBERTa และ multilingual models
+- **ปรับแต่งได้**: สามารถกำหนด prompt และพารามิเตอร์ต่างๆ ได้
+## โมเดลที่แนะนำ 🤖
+### โมเดลภาษาไทย
+- `openthaigpt/openthaigpt-1.0.0-alpha-7b-chat`
+- `scb10x/llama-3-typhoon-v1.5-8b-instruct`
+- `airesearch/wangchanberta-base-att-spm-uncased`
+### โมเดล Multilingual
+- `google/mt5-large`
+- `microsoft/mdeberta-v3-base`
+- `facebook/xglm-7.5B`
+## การใช้งาน 🚀
+### รันในเครื่อง
+```bash
+pip install -r requirements.txt
+python app.py
+```
+### บน Hugging Face Spaces
+1. เพิ่ม secret `HF_TOKEN` หากจำเป็น
+2. อัปโหลดไฟล์ PDF
+3. เลือกประเภทงานและโมเดล
+4. คลิกสร้างชุดข้อมูล
+## ประเภทงานที่รองรับ 📋
+### งานพื้นฐาน
+- **QA**: คำถาม-คำตอบ `{question: str, answer: str}`
+- **Summarization**: การสรุป `{summary: str}`
+- **Keywords**: คำสำคัญ `{keyword: str}`
+- **NER**: การจดจำเอนทิตี `{text: str, label: str, start: int, end: int}`
+- **Classification**: การจำแนกประเภท `{labels: [str], rationale: str}`
+- **MCQ**: คำถามแบบเลือกตอบ `{question: str, options: [str], answer_index: int}`
+- **True/False**: จริง/เท็จ `{statement: str, answer: bool, explanation: str}`
+- **Translation**: การแปล `{source: str, target: str}`
+### งานขั้นสูงสำหรับ AI Training
+- **RLHF**: `{prompt: str, responses: [str], scores: [float], preferred_response: str}`
+- **DPO**: `{prompt: str, chosen: str, rejected: str, reason: str}`
+- **Instruction_Following**: `{instruction: str, input: str, output: str, difficulty: str}`
+- **Constitutional_AI**: `{problematic_prompt: str, constitutional_response: str, principle: str}`
+- **Chain_of_Thought**: `{problem: str, thinking_steps: [str], final_answer: str}`
+- **Dialogue**: `{dialogue: [{role: str, content: str}], context: str}`
+- **Thai_Culture**: `{question_th: str, answer_th: str, cultural_context: str}`
+## หมายเหตุสำคัญ 📝
+- ใช้ HF Inference API ผ่าน LangChain ไม่ต้องติดตั้ง `transformers` ในเครื่อง
+- ไฟล์ผลลัพธ์จะถูกบันทึกใน `outputs/` ทั้งแบบ JSON และ JSONL
+- ต้องเข้าสู่ระบบ Hugging Face สำหรับ Spaces (ตั้งค่า `REQUIRE_LOGIN=0` เพื่อปิดการใช้งาน)
+- รองรับการปรับแต่ง prompt สำหรับผลลัพธ์ที่ดีขึ้น
+## การติดตั้ง Dependencies 📦
+```bash
+pip install gradio pypdf huggingface_hub langchain langchain-community pythainlp transformers torch
+```
+สำหรับการประมวลผลภาษาไทยที่ดีขึ้น แนะนำให้ติดตั้ง:
+- `pythainlp`: สำหรับการประมวลผลภาษาไทย
+- `thai-word-segmentation`: สำหรับการตัดคำภาษาไทย

__pycache__/app.cpython-313.pyc CHANGED Viewed

Binary files a/__pycache__/app.cpython-313.pyc and b/__pycache__/app.cpython-313.pyc differ

app.py CHANGED Viewed

@@ -89,11 +89,89 @@ def chunk_text(text: str, chunk_size: int = 1500, overlap: int = 200, max_chunks
 DEFAULT_QA_PROMPT_TMPL = (
-    'You are a helpful dataset creator. Read the provided content and generate between {min_pairs} and {max_pairs} high-quality, factual question-answer pairs. '
-    'Return ONLY a JSON array with objects of the form {"question": str, "answer": str}. Do not include any extra text, comments, or code fences.\n\n'
-    'Content:\n{content}\n'
 )
 def extract_json_array(text: str) -> List[Dict[str, Any]]:
     if not text:
@@ -126,11 +204,10 @@ def extract_json_array(text: str) -> List[Dict[str, Any]]:
     return []
-def build_langchain(model_id: str, hf_token: str | None, max_new_tokens: int, temperature: float, custom_instruction: str | None, min_pairs: int, max_pairs: int):
     if any(x is None for x in [PromptTemplate, JsonOutputParser, HuggingFaceHub]):
         raise RuntimeError("langchain and langchain-community are required. Please add to requirements.txt.")
     # Prompt
-    template = custom_instruction.strip() + "\n\nContent:\n{content}\n" if (custom_instruction and custom_instruction.strip()) else DEFAULT_QA_PROMPT_TMPL
     prompt = PromptTemplate.from_template(template)
     # Model wrapper (Hugging Face Inference API)
     llm = HuggingFaceHub(
@@ -144,14 +221,195 @@ def build_langchain(model_id: str, hf_token: str | None, max_new_tokens: int, te
     )
     parser = JsonOutputParser()
     chain = prompt | llm | parser
-    # Provide default formatting variables via partials
-    chain = chain.bind(min_pairs=min_pairs, max_pairs=max_pairs)
     return chain
 def generate_dataset(
     user_profile: Any | None,
     files: List[gr.File],
     preset_model: str,
     custom_model_id: str,
     hf_token: str,
@@ -163,83 +421,159 @@ def generate_dataset(
     custom_instruction: str,
     min_pairs: int,
     max_pairs: int,
 ):
     # Enforce login if required
     if REQUIRE_LOGIN and not user_profile:
-        return "Please login first to generate a dataset.", None, None
     # Read and chunk
     full_text, _docs = read_pdfs(files)
     chunks = chunk_text(full_text, chunk_size=chunk_size, overlap=overlap, max_chunks=max_chunks)
     if not chunks:
-        return "No text extracted from PDF(s).", None, None
     model_id = (custom_model_id or "").strip() or preset_model
     try:
-        chain = build_langchain(model_id, hf_token or None, max_new_tokens, temperature, custom_instruction, min_pairs, max_pairs)
     except Exception as e:
-        return f"Error preparing LangChain: {e}", None, None
     results: List[Dict[str, Any]] = []
     for ch in chunks:
         try:
-            data = chain.invoke({"content": ch["text"]})
-            if isinstance(data, list):
-                items = data
-            else:
-                items = extract_json_array(str(data))
         except Exception:
             # If parser fails, try best-effort extraction on raw string
             try:
-                from langchain_core.runnables import Runnable
-                raw = (PromptTemplate.from_template(DEFAULT_QA_PROMPT_TMPL) | HuggingFaceHub(repo_id=model_id, huggingfacehub_api_token=hf_token)).invoke({"content": ch["text"], "min_pairs": min_pairs, "max_pairs": max_pairs})  # type: ignore
-                items = extract_json_array(str(raw))
             except Exception:
                 items = []
         for it in items:
-            if isinstance(it, dict) and it.get("question") and it.get("answer"):
-                it["context"] = (ch["text"][:500] + ("..." if len(ch["text"]) > 500 else ""))
-                results.append(it)
     if not results:
-        return "Model did not return any valid QA pairs. Try adjusting prompt or model.", None, None
-    # Deduplicate by question
     seen = set()
-    unique = []
     for r in results:
-        q = r.get("question", "").strip()
-        if q and q.lower() not in seen:
             unique.append(r)
-            seen.add(q.lower())
     # Save to outputs
     outdir = ensure_output_dir()
     ts = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
-    json_path = os.path.join(outdir, f"dataset_qa_{ts}.json")
-    jsonl_path = os.path.join(outdir, f"dataset_qa_{ts}.jsonl")
     with io.open(json_path, "w", encoding="utf-8") as f:
         json.dump(unique, f, ensure_ascii=False, indent=2)
     with io.open(jsonl_path, "w", encoding="utf-8") as f:
         for item in unique:
             f.write(json.dumps(item, ensure_ascii=False) + "\n")
-    return f"Generated {len(unique)} QA pairs.", json_path, jsonl_path
 PRESET_MODELS = [
     "HuggingFaceH4/zephyr-7b-beta",
     "mistralai/Mistral-7B-Instruct-v0.2",
     "google/flan-t5-large",
 ]
-with gr.Blocks(title="AutoGDataset - PDF to QA Dataset (LangChain)") as demo:
     gr.Markdown("""
-    # AutoGDataset
-    Generate QA datasets from PDFs using LangChain with Hugging Face models (Inference API).
-    Choose one of the preset models or provide a custom repo id. Provide a valid `HF_TOKEN` if required by the model.
     """)
     # Login requirement (Hugging Face OAuth via Gradio LoginButton when available)
@@ -248,54 +582,105 @@ with gr.Blocks(title="AutoGDataset - PDF to QA Dataset (LangChain)") as demo:
     with gr.Row():
         login_info = gr.Markdown(
             value=(
-                "Please login with your Hugging Face account to use the app."
                 if effective_require_login
                 else (
-                    "Login is optional." if OAUTH_AVAILABLE else "OAuth login not configured on this deployment."
                 )
             ),
             elem_id="login-info",
         )
     if OAUTH_AVAILABLE:
         with gr.Row():
-            login_btn = gr.LoginButton(value="Login with Hugging Face")
     with gr.Row():
-        pdf_files = gr.File(label="Upload PDF(s)", file_count="multiple", file_types=[".pdf"])
     with gr.Group():
         with gr.Row():
-            preset_model = gr.Dropdown(label="Preset Model", choices=PRESET_MODELS, value=PRESET_MODELS[0])
-            custom_model_id = gr.Textbox(label="Custom Model ID (optional)", placeholder="org/model-name")
         with gr.Row():
-            hf_token = gr.Textbox(label="HF Token", type="password", value=os.environ.get("HF_TOKEN", ""), placeholder="hf_xxx (required for many models)")
         with gr.Row():
-            max_new_tokens = gr.Slider(64, 1024, value=512, step=16, label="Max New Tokens")
-            temperature = gr.Slider(0.0, 1.5, value=0.2, step=0.05, label="Temperature")
-    with gr.Accordion("Advanced", open=False):
         with gr.Row():
-            chunk_size = gr.Slider(500, 4000, value=1500, step=50, label="Chunk Size (chars)")
-            overlap = gr.Slider(0, 1000, value=200, step=50, label="Overlap (chars)")
-            max_chunks = gr.Slider(1, 40, value=5, step=1, label="Max Chunks")
         with gr.Row():
-            min_pairs = gr.Slider(1, 10, value=3, step=1, label="Min Pairs/Chunk")
-            max_pairs = gr.Slider(1, 12, value=6, step=1, label="Max Pairs/Chunk")
-        custom_instruction = gr.Textbox(label="Custom Instruction (optional)", lines=3, placeholder="Override default instruction. Must ask for a pure JSON array of {question, answer}.")
-    generate_btn = gr.Button("Generate Dataset", variant="primary", interactive=(not effective_require_login))
     with gr.Row():
         status = gr.Markdown()
     with gr.Row():
-        out_json = gr.File(label="Download JSON")
-        out_jsonl = gr.File(label="Download JSONL")
     generate_btn.click(
         fn=generate_dataset,
         inputs=[
             user_state,
             pdf_files,
             preset_model,
             custom_model_id,
             hf_token,
@@ -307,6 +692,11 @@ with gr.Blocks(title="AutoGDataset - PDF to QA Dataset (LangChain)") as demo:
             custom_instruction,
             min_pairs,
             max_pairs,
         ],
         outputs=[status, out_json, out_jsonl],
         show_progress=True,
@@ -321,9 +711,9 @@ with gr.Blocks(title="AutoGDataset - PDF to QA Dataset (LangChain)") as demo:
                     username = user.get("username") or user.get("name")
                 if not username and hasattr(user, "username"):
                     username = getattr(user, "username")
-                msg = f"Logged in as @{username}" if username else "Logged in"
             except Exception:
-                msg = "Logged in"
             return user, gr.update(value=msg), gr.update(interactive=True)
         # Enable Generate button after login and store user profile
@@ -331,7 +721,7 @@ with gr.Blocks(title="AutoGDataset - PDF to QA Dataset (LangChain)") as demo:
             login_btn.login(_on_login, inputs=None, outputs=[user_state, login_info, generate_btn])
         else:
             # In local/dev without OAuth routing, clicking will mock-login
-            login_btn.click(lambda: ("local_user", gr.update(value="Logged in (local)"), gr.update(interactive=True)), inputs=None, outputs=[user_state, login_info, generate_btn])
 if __name__ == "__main__":
     # For local runs

 DEFAULT_QA_PROMPT_TMPL = (
+    'คุณเป็นผู้สร้างชุดข้อมูลที่เป็นประโยชน์ อ่านเนื้อหาที่ให้มาและสร้างคู่คำถาม-คำตอบที่มีคุณภาพสูงและตรงตามข้อเท็จจริง จำนวน {min_pairs} ถึง {max_pairs} คู่ '
+    'ส่งคืนเฉพาะ JSON array ที่มี objects ในรูปแบบ {{"question": str, "answer": str}} เท่านั้น ไม่ต้องใส่ข้อความเพิ่มเติม คำอธิบาย หรือ code fences\n\n'
+    'เนื้อหา:\n{content}\n'
 )
+TASK_TEMPLATES: Dict[str, str] = {
+    "QA": DEFAULT_QA_PROMPT_TMPL,
+    "Summarization": (
+        'สรุปเนื้อหาต่อไปนี้เป็นบทสรุปที่กระชับ จำนวน {min_pairs} ถึง {max_pairs} บทสรุป โดยครอบคลุมข้อมูลสำคัญ '
+        'ส่งคืนเฉพาะ JSON array ที่มี objects ในรูปแบบ {{"summary": str}} เท่านั้น ไม่ต้องมีข้อความเพิ่มเติม\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Keywords": (
+        'แยกคำสำคัญหรือวลีสำคัญจากเนื้อหา จำนวน {min_pairs} ถึง {max_pairs} คำ '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"keyword": str}} เท่านั้น ไม่ต้องมีข้อความเพิ่มเติม\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "NER": (
+        'แยกเอนทิตีที่มีชื่อเฉพาะจากเนื้อหา ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"text": str, "label": str, "start": int, "end": int}} '
+        'ป้ายกำกับควรเป็นประเภทมาตรฐาน เช่น PER (บุคคล), ORG (องค์กร), LOC (สถานที่), MISC (อื่นๆ){ner_labels_clause}\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Classification": (
+        'จำแนกเนื้อหาตามป้ายกำกับต่อไปนี้: {labels} {multi_label_clause} '
+        'ส่งคืนเฉพาะ JSON array ที่มี objects ในรูปแบบ {{"labels": [str], "rationale": str}} เท่านั้น ไม่ต้องมีข้อความเพิ่มเติม\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "MCQ": (
+        'สร้างคำถามแบบเลือกตอบจากเนื้อหา จำนวน {min_pairs} ถึง {max_pairs} ข้อ แต่ละข้อมี {num_options} ตัวเลือก '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"question": str, "options": [str], "answer_index": int}} เท่านั้น ไม่ต้องมีข้อความเพิ่มเติม\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "True/False": (
+        'สร้างข้อความจริง/เท็จที่อิงจากเนื้อหาเท่านั้น จำนวน {min_pairs} ถึง {max_pairs} ข้อความ '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"statement": str, "answer": bool, "explanation": str}} เท่านั้น ไม่ต้องมีข้อความเพิ่มเติม\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Translation": (
+        'แปลเนื้อหาเป็น{target_language} สร้างคู่ประโยคแบบคู่ขนาน จำนวน {min_pairs} ถึง {max_pairs} คู่ '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"source": str, "target": str}} เท่านั้น ไม่ต้องมีข้อความเพิ่มเติม\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "RLHF": (
+        'สร้างข้อมูลสำหรับ Reinforcement Learning from Human Feedback (RLHF) จากเนื้อหานี้ '
+        'สร้างคำถามและการตอบสนองหลายแบบ พร้อมคะแนนความต้องการของมนุษย์ จำนวน {min_pairs} ถึง {max_pairs} ชุด '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"prompt": str, "responses": [str], "scores": [float], "preferred_response": str}} เท่านั้น\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "DPO": (
+        'สร้างข้อมูลสำหรับ Direct Preference Optimization (DPO) จากเนื้อหานี้ '
+        'สร้างคำถามพร้อมการตอบสนองที่ดีและไม่ดี จำนวน {min_pairs} ถึง {max_pairs} คู่ '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"prompt": str, "chosen": str, "rejected": str, "reason": str}} เท่านั้น\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Instruction_Following": (
+        'สร้างคำสั่งและการตอบสนองสำหรับการฝึกการทำตามคำสั่ง จำนวน {min_pairs} ถึง {max_pairs} คู่ '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"instruction": str, "input": str, "output": str, "difficulty": str}} เท่านั้น\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Constitutional_AI": (
+        'สร้างข้อมูลสำหรับ Constitutional AI โดยสร้างคำถามที่อาจมีปัญหาทางจริยธรรมและคำตอบที่เหมาะสม '
+        'จำนวน {min_pairs} ถึง {max_pairs} คู่ '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"problematic_prompt": str, "constitutional_response": str, "principle": str}} เท่านั้น\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Chain_of_Thought": (
+        'สร้างตัวอย่างการคิดแบบขั้นตอน (Chain of Thought) จากเนื้อหา จำนวน {min_pairs} ถึง {max_pairs} ตัวอย่าง '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"problem": str, "thinking_steps": [str], "final_answer": str}} เท่านั้น\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Dialogue": (
+        'สร้างบทสนทนาระหว่างผู้ใช้และผู้ช่วย AI จากเนื้อหา จำนวน {min_pairs} ถึง {max_pairs} บทสนทนา '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"dialogue": [{{"role": str, "content": str}}], "context": str}} เท่านั้น\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+    "Thai_Culture": (
+        'สร้างคำถาม-คำตอบเกี่ยวกับวัฒนธรรมไทยจากเนื้อหา เน้นความเข้าใจภาษาไทยและบริบททางวัฒนธรรม '
+        'จำนวน {min_pairs} ถึง {max_pairs} คู่ '
+        'ส่งคืนเฉพาะ JSON array ของ objects ที่มี {{"question_th": str, "answer_th": str, "cultural_context": str}} เท่านั้น\n\n'
+        'เนื้อหา:\n{content}\n'
+    ),
+}
 def extract_json_array(text: str) -> List[Dict[str, Any]]:
     if not text:
     return []
+def build_langchain(model_id: str, hf_token: str | None, max_new_tokens: int, temperature: float, template: str):
     if any(x is None for x in [PromptTemplate, JsonOutputParser, HuggingFaceHub]):
         raise RuntimeError("langchain and langchain-community are required. Please add to requirements.txt.")
     # Prompt
     prompt = PromptTemplate.from_template(template)
     # Model wrapper (Hugging Face Inference API)
     llm = HuggingFaceHub(
     )
     parser = JsonOutputParser()
     chain = prompt | llm | parser
     return chain
+def get_task_template(task: str, custom_instruction: str | None) -> str:
+    base = TASK_TEMPLATES.get(task, DEFAULT_QA_PROMPT_TMPL)
+    if custom_instruction and custom_instruction.strip():
+        # Allow user to override fully, but ensure {content} is present
+        if "{content}" not in custom_instruction:
+            custom_instruction = custom_instruction.strip() + "\n\nContent:\n{content}\n"
+        return custom_instruction
+    return base
+def normalize_items(task: str, data: Any) -> List[Dict[str, Any]]:
+    # Convert model output to list[dict] per task
+    items: List[Dict[str, Any]] = []
+    if data is None:
+        return items
+    if isinstance(data, str):
+        data = extract_json_array(data)
+    if isinstance(data, dict):
+        # handle wrappers like {"items": [...]}
+        if "items" in data and isinstance(data["items"], list):
+            data = data["items"]
+        else:
+            data = [data]
+    if isinstance(data, list):
+        # keywords may be list[str]
+        if task == "Keywords" and data and all(isinstance(x, str) for x in data):
+            return [{"keyword": x} for x in data if x]
+        for el in data:
+            if isinstance(el, dict):
+                items.append(el)
+    # Validate per-task required fields and normalize variants
+    norm: List[Dict[str, Any]] = []
+    for it in items:
+        if task == "QA":
+            q = str(it.get("question", "")).strip()
+            a = str(it.get("answer", "")).strip()
+            if q and a:
+                norm.append({"question": q, "answer": a})
+        elif task == "Summarization":
+            s = str(it.get("summary", "")).strip()
+            if s:
+                norm.append({"summary": s})
+        elif task == "Keywords":
+            k = it.get("keyword")
+            if isinstance(k, str) and k.strip():
+                norm.append({"keyword": k.strip()})
+            elif isinstance(it.get("keywords"), list):
+                for kw in it["keywords"]:
+                    if isinstance(kw, str) and kw.strip():
+                        norm.append({"keyword": kw.strip()})
+        elif task == "NER":
+            txt = it.get("text")
+            label = it.get("label")
+            start = it.get("start")
+            end = it.get("end")
+            if isinstance(txt, str) and isinstance(label, str) and isinstance(start, int) and isinstance(end, int):
+                norm.append({"text": txt, "label": label, "start": start, "end": end})
+            elif isinstance(it.get("entities"), list):
+                for ent in it["entities"]:
+                    if all(k in ent for k in ("text", "label", "start", "end")):
+                        norm.append({
+                            "text": str(ent.get("text", "")),
+                            "label": str(ent.get("label", "")),
+                            "start": int(ent.get("start", 0)),
+                            "end": int(ent.get("end", 0)),
+                        })
+        elif task == "Classification":
+            labels = it.get("labels")
+            if isinstance(labels, str):
+                labels = [labels]
+            if isinstance(labels, list):
+                labels = [str(x).strip() for x in labels if str(x).strip()]
+                rationale = str(it.get("rationale", "")).strip()
+                if labels:
+                    norm.append({"labels": labels, "rationale": rationale})
+        elif task == "MCQ":
+            q = it.get("question")
+            options = it.get("options")
+            answer_index = it.get("answer_index")
+            answer = it.get("answer")
+            if isinstance(options, list) and all(isinstance(o, str) for o in options) and isinstance(q, str):
+                if isinstance(answer_index, int):
+                    idx = answer_index
+                elif isinstance(answer, str) and answer in options:
+                    idx = options.index(answer)
+                else:
+                    continue
+                norm.append({"question": q, "options": options, "answer_index": idx})
+        elif task == "True/False":
+            st = it.get("statement")
+            ans = it.get("answer")
+            expl = it.get("explanation", "")
+            if isinstance(st, str):
+                if isinstance(ans, bool):
+                    val = ans
+                elif isinstance(ans, str):
+                    val = ans.strip().lower() in ("true", "t", "yes", "1")
+                else:
+                    continue
+                norm.append({"statement": st, "answer": val, "explanation": str(expl)})
+        elif task == "Translation":
+            src = it.get("source")
+            tgt = it.get("target")
+            if isinstance(src, str) and isinstance(tgt, str) and src.strip() and tgt.strip():
+                norm.append({"source": src, "target": tgt})
+        elif task == "RLHF":
+            prompt = it.get("prompt")
+            responses = it.get("responses")
+            scores = it.get("scores")
+            preferred = it.get("preferred_response")
+            if isinstance(prompt, str) and isinstance(responses, list) and isinstance(scores, list):
+                norm.append({
+                    "prompt": prompt,
+                    "responses": responses,
+                    "scores": scores,
+                    "preferred_response": str(preferred) if preferred else ""
+                })
+        elif task == "DPO":
+            prompt = it.get("prompt")
+            chosen = it.get("chosen")
+            rejected = it.get("rejected")
+            reason = it.get("reason", "")
+            if isinstance(prompt, str) and isinstance(chosen, str) and isinstance(rejected, str):
+                norm.append({
+                    "prompt": prompt,
+                    "chosen": chosen,
+                    "rejected": rejected,
+                    "reason": str(reason)
+                })
+        elif task == "Instruction_Following":
+            instruction = it.get("instruction")
+            input_text = it.get("input", "")
+            output = it.get("output")
+            difficulty = it.get("difficulty", "medium")
+            if isinstance(instruction, str) and isinstance(output, str):
+                norm.append({
+                    "instruction": instruction,
+                    "input": str(input_text),
+                    "output": output,
+                    "difficulty": str(difficulty)
+                })
+        elif task == "Constitutional_AI":
+            problematic = it.get("problematic_prompt")
+            constitutional = it.get("constitutional_response")
+            principle = it.get("principle", "")
+            if isinstance(problematic, str) and isinstance(constitutional, str):
+                norm.append({
+                    "problematic_prompt": problematic,
+                    "constitutional_response": constitutional,
+                    "principle": str(principle)
+                })
+        elif task == "Chain_of_Thought":
+            problem = it.get("problem")
+            steps = it.get("thinking_steps")
+            answer = it.get("final_answer")
+            if isinstance(problem, str) and isinstance(steps, list) and isinstance(answer, str):
+                norm.append({
+                    "problem": problem,
+                    "thinking_steps": steps,
+                    "final_answer": answer
+                })
+        elif task == "Dialogue":
+            dialogue = it.get("dialogue")
+            context = it.get("context", "")
+            if isinstance(dialogue, list):
+                norm.append({
+                    "dialogue": dialogue,
+                    "context": str(context)
+                })
+        elif task == "Thai_Culture":
+            question_th = it.get("question_th")
+            answer_th = it.get("answer_th")
+            cultural_context = it.get("cultural_context", "")
+            if isinstance(question_th, str) and isinstance(answer_th, str):
+                norm.append({
+                    "question_th": question_th,
+                    "answer_th": answer_th,
+                    "cultural_context": str(cultural_context)
+                })
+    return norm
 def generate_dataset(
     user_profile: Any | None,
     files: List[gr.File],
+    task: str,
     preset_model: str,
     custom_model_id: str,
     hf_token: str,
     custom_instruction: str,
     min_pairs: int,
     max_pairs: int,
+    class_labels_text: str,
+    multi_label: bool,
+    target_language: str,
+    num_options: int,
+    ner_labels_text: str,
 ):
     # Enforce login if required
     if REQUIRE_LOGIN and not user_profile:
+        return "กรุณาเข้าสู่ระบบก่อนเพื่อสร้างชุดข้อมูล", None, None
     # Read and chunk
     full_text, _docs = read_pdfs(files)
     chunks = chunk_text(full_text, chunk_size=chunk_size, overlap=overlap, max_chunks=max_chunks)
     if not chunks:
+        return "ไม่สามารถแยกข้อความจากไฟล์ PDF ได้", None, None
     model_id = (custom_model_id or "").strip() or preset_model
+    # Prepare template per task
+    base_template = get_task_template(task, custom_instruction)
+    # enrich template with conditional clauses
+    ner_clause = ""
+    if ner_labels_text.strip():
+        ner_clause = f" (limit to: {ner_labels_text.strip()})"
+    base_template = base_template.replace("{ner_labels_clause}", ner_clause)
+    if "{labels}" in base_template:
+        labels_text = class_labels_text.strip() or "[]"
+        base_template = base_template.replace("{labels}", labels_text)
+    if "{multi_label_clause}" in base_template:
+        base_template = base_template.replace("{multi_label_clause}", " Allow multiple labels." if multi_label else " Choose a single best label.")
+    if "{num_options}" in base_template:
+        base_template = base_template.replace("{num_options}", str(int(num_options)))
     try:
+        chain = build_langchain(model_id, hf_token or None, max_new_tokens, temperature, base_template)
     except Exception as e:
+        return f"ข้อผิดพลาดในการเตรียม LangChain: {e}", None, None
     results: List[Dict[str, Any]] = []
     for ch in chunks:
         try:
+            variables = {"content": ch["text"], "min_pairs": min_pairs, "max_pairs": max_pairs}
+            if "{target_language}" in base_template:
+                variables["target_language"] = target_language or "English"
+            data = chain.invoke(variables)
+            items = normalize_items(task, data)
         except Exception:
             # If parser fails, try best-effort extraction on raw string
             try:
+                raw = (PromptTemplate.from_template(base_template) | HuggingFaceHub(repo_id=model_id, huggingfacehub_api_token=hf_token)).invoke(variables)  # type: ignore
+                items = normalize_items(task, raw)
             except Exception:
                 items = []
         for it in items:
+            # Enrich with context and task
+            it["context"] = (ch["text"][:500] + ("..." if len(ch["text"]) > 500 else ""))
+            it["task"] = task
+            results.append(it)
     if not results:
+        return f"โมเดลไม่ได้ส่งคืนข้อมูลที่ถูกต้องสำหรับงาน {task} ลองปรับ prompt หรือโมเดล", None, None
+    # Deduplicate per task key
+    unique: List[Dict[str, Any]] = []
     seen = set()
+    def key_of(item: Dict[str, Any]) -> str:
+        if task == "QA":
+            return (item.get("question") or "").strip().lower()
+        if task == "Summarization":
+            return (item.get("summary") or "").strip().lower()
+        if task == "Keywords":
+            return (item.get("keyword") or "").strip().lower()
+        if task == "NER":
+            return f"{item.get('text')}|{item.get('label')}|{item.get('start')}|{item.get('end')}"
+        if task == "Classification":
+            return ",".join(sorted([str(x).lower() for x in item.get("labels", [])]))
+        if task == "MCQ":
+            return (item.get("question") or "").strip().lower()
+        if task == "True/False":
+            return (item.get("statement") or "").strip().lower()
+        if task == "Translation":
+            return f"{item.get('source')}|{item.get('target')}"
+        if task == "RLHF":
+            return (item.get("prompt") or "").strip().lower()
+        if task == "DPO":
+            return (item.get("prompt") or "").strip().lower()
+        if task == "Instruction_Following":
+            return (item.get("instruction") or "").strip().lower()
+        if task == "Constitutional_AI":
+            return (item.get("problematic_prompt") or "").strip().lower()
+        if task == "Chain_of_Thought":
+            return (item.get("problem") or "").strip().lower()
+        if task == "Dialogue":
+            dialogue = item.get("dialogue", [])
+            if dialogue and isinstance(dialogue, list):
+                return str(dialogue[0].get("content", "")).strip().lower()
+            return ""
+        if task == "Thai_Culture":
+            return (item.get("question_th") or "").strip().lower()
+        return json.dumps(item, ensure_ascii=False)
     for r in results:
+        k = key_of(r)
+        if k and k not in seen:
             unique.append(r)
+            seen.add(k)
     # Save to outputs
     outdir = ensure_output_dir()
     ts = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
+    safe_task = task.lower().replace("/", "-").replace(" ", "_")
+    json_path = os.path.join(outdir, f"dataset_{safe_task}_{ts}.json")
+    jsonl_path = os.path.join(outdir, f"dataset_{safe_task}_{ts}.jsonl")
     with io.open(json_path, "w", encoding="utf-8") as f:
         json.dump(unique, f, ensure_ascii=False, indent=2)
     with io.open(jsonl_path, "w", encoding="utf-8") as f:
         for item in unique:
             f.write(json.dumps(item, ensure_ascii=False) + "\n")
+    return f"สร้างข้อมูลสำเร็จ {len(unique)} รายการสำหรับงาน: {task} 🎉", json_path, jsonl_path
 PRESET_MODELS = [
+    # Thai-capable models
+    "openthaigpt/openthaigpt-1.0.0-alpha-7b-chat",
+    "scb10x/llama-3-typhoon-v1.5-8b-instruct",
+    "airesearch/wangchanberta-base-att-spm-uncased",
+    # Multilingual models good for Thai
+    "google/mt5-large",
+    "microsoft/mdeberta-v3-base",
+    "facebook/xglm-7.5B",
+    "microsoft/DialoGPT-medium",
+    # General powerful models
     "HuggingFaceH4/zephyr-7b-beta",
     "mistralai/Mistral-7B-Instruct-v0.2",
     "google/flan-t5-large",
+    "meta-llama/Llama-2-7b-chat-hf",
+    "microsoft/DialoGPT-large",
 ]
+with gr.Blocks(title="AutoGDataset Thai - PDF to Dataset Generator") as demo:
     gr.Markdown("""
+    # AutoGDataset Thai 🇹🇭
+    สร้างชุดข้อมูล (Dataset) ภาษาไทยจากไฟล์ PDF โดยใช้ LangChain กับโมเดล Hugging Face
+    **คุณสมบัติ:**
+    - รองรับงานหลากหลายประเภท: QA, RLHF, DPO, Constitutional AI และอื่นๆ
+    - เน้นการสร้างข้อมูลภาษาไทยคุณภาพสูง
+    - รองรับโมเดลภาษาไทยและ multilingual models
+    - สามารถปรับแต่ง prompt เพื่อเพิ่มประสิทธิภาพ
+    เลือกโมเดลที่มีอยู่หรือระบุ repo id ที่กำหนดเอง ระบุ `HF_TOKEN` หากจำเป็นสำหรับโมเดล
     """)
     # Login requirement (Hugging Face OAuth via Gradio LoginButton when available)
     with gr.Row():
         login_info = gr.Markdown(
             value=(
+                "กรุณาเข้าสู่ระบบด้วยบัญชี Hugging Face เพื่อใช้งานแอป"
                 if effective_require_login
                 else (
+                    "การเข้าสู่ระบบเป็นทางเลือก" if OAUTH_AVAILABLE else "ไม่ได้ตั้งค่าการเข้าสู่ระบบ OAuth ในการติดตั้งนี้"
                 )
             ),
             elem_id="login-info",
         )
     if OAUTH_AVAILABLE:
         with gr.Row():
+            login_btn = gr.LoginButton(value="เข้าสู่ระบบด้วย Hugging Face")
     with gr.Row():
+        pdf_files = gr.File(label="อัปโหลดไฟล์ PDF", file_count="multiple", file_types=[".pdf"])
     with gr.Group():
         with gr.Row():
+            task = gr.Dropdown(
+                label="งานที่ต้องการ (Task Type)",
+                choices=[
+                    "QA",
+                    "Summarization",
+                    "Keywords",
+                    "NER",
+                    "Classification",
+                    "MCQ",
+                    "True/False",
+                    "Translation",
+                    "RLHF",
+                    "DPO",
+                    "Instruction_Following",
+                    "Constitutional_AI",
+                    "Chain_of_Thought",
+                    "Dialogue",
+                    "Thai_Culture",
+                ],
+                value="Thai_Culture",
+            )
         with gr.Row():
+            preset_model = gr.Dropdown(label="โมเดลที่กำหนดไว้ (Preset Model)", choices=PRESET_MODELS, value=PRESET_MODELS[0])
+            custom_model_id = gr.Textbox(label="รหัสโมเดลกำหนดเอง (ไม่บังคับ)", placeholder="org/model-name")
         with gr.Row():
+            hf_token = gr.Textbox(label="HF Token", type="password", value=os.environ.get("HF_TOKEN", ""), placeholder="hf_xxx (จำเป็นสำหรับหลายโมเดล)")
+        with gr.Row():
+            max_new_tokens = gr.Slider(64, 1024, value=512, step=16, label="จำนวน Token สูงสุด")
+            temperature = gr.Slider(0.0, 1.5, value=0.2, step=0.05, label="อุณหภูมิ (ความสร้างสรรค์)")
+    with gr.Accordion("การตั้งค่าขั้นสูง (Advanced Settings)", open=False):
         with gr.Row():
+            chunk_size = gr.Slider(500, 4000, value=1500, step=50, label="ขนาดส่วนข้อความ (ตัวอักษร)")
+            overlap = gr.Slider(0, 1000, value=200, step=50, label="การทับซ้อน (ตัวอักษร)")
+            max_chunks = gr.Slider(1, 40, value=5, step=1, label="จำนวนส่วนสูงสุด")
         with gr.Row():
+            min_pairs = gr.Slider(1, 10, value=3, step=1, label="คู่ข้อมูลต่ำสุด/ส่วน")
+            max_pairs = gr.Slider(1, 12, value=6, step=1, label="คู่ข้อมูลสูงสุด/ส่วน")
+        custom_instruction = gr.Textbox(
+            label="คำสั่งกำหนดเอง (ไม่บังคับ)",
+            lines=3,
+            placeholder="แทนที่คำสั่งเริ่มต้น ต้องส่งคืน JSON array บริสุทธิ์ตามโครงสร้างงาน",
+            value="สร้างข้อมูลภาษาไทยคุณภาพสูงที่เข้าใจบริบททางวัฒนธรรมไทย ใช้ภาษาไทยที่เป็นธรรมชาติและเหมาะสมกับเนื้อหา"
+        )
+        # Task-specific controls
+        classification_labels = gr.Textbox(label="ป้ายกำกับการจำแนก (คั่นด้วยคอมมา)", visible=False)
+        multi_label = gr.Checkbox(label="อนุญาตหลายป้ายกำกับ", value=False, visible=False)
+        target_language = gr.Textbox(label="ภาษาเป้าหมาย (การแปล)", value="ไทย", visible=False)
+        num_options = gr.Slider(3, 6, value=4, step=1, label="ตัวเลือก MCQ", visible=False)
+        ner_labels = gr.Textbox(label="ป้ายกำกับ NER (คั่นด้วยคอมมา, ไม่บังคับ)", visible=False)
+    generate_btn = gr.Button("สร้างชุดข้อมูล (Generate Dataset)", variant="primary", interactive=(not effective_require_login))
     with gr.Row():
         status = gr.Markdown()
     with gr.Row():
+        out_json = gr.File(label="ดาวน์โหลด JSON")
+        out_jsonl = gr.File(label="ดาวน์โหลด JSONL")
+    # Toggle visibility for task-specific controls
+    def _switch_task(t: str):
+        is_cls = t == "Classification"
+        is_tr = t == "Translation"
+        is_mcq = t == "MCQ"
+        is_ner = t == "NER"
+        return (
+            gr.update(visible=is_cls),  # classification_labels
+            gr.update(visible=is_cls),  # multi_label
+            gr.update(visible=is_tr),   # target_language
+            gr.update(visible=is_mcq),  # num_options
+            gr.update(visible=is_ner),  # ner_labels
+        )
+    task.change(_switch_task, inputs=task, outputs=[classification_labels, multi_label, target_language, num_options, ner_labels])
     generate_btn.click(
         fn=generate_dataset,
         inputs=[
             user_state,
             pdf_files,
+            task,
             preset_model,
             custom_model_id,
             hf_token,
             custom_instruction,
             min_pairs,
             max_pairs,
+            classification_labels,
+            multi_label,
+            target_language,
+            num_options,
+            ner_labels,
         ],
         outputs=[status, out_json, out_jsonl],
         show_progress=True,
                     username = user.get("username") or user.get("name")
                 if not username and hasattr(user, "username"):
                     username = getattr(user, "username")
+                msg = f"เข้าสู่ระบบแล้วในนาม @{username}" if username else "เข้าสู่ระบบแล้ว"
             except Exception:
+                msg = "เข้าสู่ระบบแล้ว"
             return user, gr.update(value=msg), gr.update(interactive=True)
         # Enable Generate button after login and store user profile
             login_btn.login(_on_login, inputs=None, outputs=[user_state, login_info, generate_btn])
         else:
             # In local/dev without OAuth routing, clicking will mock-login
+            login_btn.click(lambda: ("local_user", gr.update(value="เข้าสู่ระบบแล้ว (ภายในเครื่อง)"), gr.update(interactive=True)), inputs=None, outputs=[user_state, login_info, generate_btn])
 if __name__ == "__main__":
     # For local runs

requirements.txt CHANGED Viewed

@@ -3,3 +3,11 @@ pypdf>=4.2.0
 huggingface_hub>=0.23.0
 langchain>=0.2.0
 langchain-community>=0.2.0

 huggingface_hub>=0.23.0
 langchain>=0.2.0
 langchain-community>=0.2.0
+# Thai language processing
+pythainlp>=5.0.0
+thai-word-segmentation>=0.1.0
+# Additional ML libraries for better text processing
+transformers>=4.30.0
+torch>=2.0.0
+# For better JSON parsing
+ujson>=5.8.0