Spaces:
Sleeping
Sleeping
Upload 59 files
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- .gitattributes +1 -0
- README.md +99 -13
- app.py +563 -0
- data/.DS_Store +0 -0
- data/vector_store/.gitkeep +0 -0
- data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/data_level0.bin +3 -0
- data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/header.bin +3 -0
- data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/length.bin +3 -0
- data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/link_lists.bin +3 -0
- data/vector_store/chroma.sqlite3 +3 -0
- requirements.txt +21 -3
- src/__init__.py +0 -0
- src/processing/__init__.py +20 -0
- src/processing/__pycache__/__init__.cpython-312.pyc +0 -0
- src/processing/__pycache__/batch_processor.cpython-312.pyc +0 -0
- src/processing/__pycache__/chunk_documents.cpython-312.pyc +0 -0
- src/processing/__pycache__/cli.cpython-312.pyc +0 -0
- src/processing/__pycache__/document_processor.cpython-312.pyc +0 -0
- src/processing/__pycache__/metadata_utils.cpython-312.pyc +0 -0
- src/processing/__pycache__/scrape_to_metadata.cpython-312.pyc +0 -0
- src/processing/batch_processor.py +182 -0
- src/processing/chunk_documents.py +54 -0
- src/processing/document_processor.py +46 -0
- src/processing/metadata_utils.py +38 -0
- src/processing/scrape_to_metadata.py +55 -0
- src/qa/__init__.py +3 -0
- src/qa/__pycache__/__init__.cpython-312.pyc +0 -0
- src/qa/__pycache__/chain.cpython-312.pyc +0 -0
- src/qa/__pycache__/prompt.cpython-312.pyc +0 -0
- src/qa/chain.py +341 -0
- src/qa/prompt.py +51 -0
- src/scraping/__init__.py +15 -0
- src/scraping/__pycache__/__init__.cpython-312.pyc +0 -0
- src/scraping/__pycache__/batch.cpython-312.pyc +0 -0
- src/scraping/__pycache__/cli.cpython-312.pyc +0 -0
- src/scraping/__pycache__/convert.cpython-312.pyc +0 -0
- src/scraping/__pycache__/exceptions.cpython-312.pyc +0 -0
- src/scraping/__pycache__/extract.cpython-312.pyc +0 -0
- src/scraping/__pycache__/fetch.cpython-312.pyc +0 -0
- src/scraping/__pycache__/io.cpython-312.pyc +0 -0
- src/scraping/__pycache__/pipeline.cpython-312.pyc +0 -0
- src/scraping/__pycache__/textutil.cpython-312.pyc +0 -0
- src/scraping/batch.py +165 -0
- src/scraping/convert.py +4 -0
- src/scraping/exceptions.py +11 -0
- src/scraping/extract.py +55 -0
- src/scraping/fetch.py +39 -0
- src/scraping/io.py +30 -0
- src/scraping/pipeline.py +46 -0
- src/scraping/textutil.py +10 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
data/vector_store/chroma.sqlite3 filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -1,19 +1,105 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
-
sdk:
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
- streamlit
|
| 10 |
pinned: false
|
| 11 |
-
|
| 12 |
---
|
| 13 |
|
| 14 |
-
#
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: EmpathemeBot
|
| 3 |
+
emoji: 🤖
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: streamlit
|
| 7 |
+
sdk_version: 1.32.0
|
| 8 |
+
app_file: app.py
|
|
|
|
| 9 |
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
---
|
| 12 |
|
| 13 |
+
# 🤖 EmpathemeBot
|
| 14 |
|
| 15 |
+
**KurageSan®による英語学習サポートAIチャットボット**
|
| 16 |
|
| 17 |
+
## 概要
|
| 18 |
+
|
| 19 |
+
EmpathemeBotは、Potionベースの質問応答システムを搭載した英語学習サポートボットです。
|
| 20 |
+
会話履歴を保持しながら、文脈に沿った質問応答を提供します。
|
| 21 |
+
|
| 22 |
+
## 機能
|
| 23 |
+
|
| 24 |
+
- 📚 **RAGベースの質問応答**: ベクトルストアを使用した高精度な回答
|
| 25 |
+
- 💬 **会話履歴の保持**: セッション内で文脈を保持した対話
|
| 26 |
+
- 🎨 **洗練されたUI**: LINE風の吹き出しスタイルのチャットインターフェース
|
| 27 |
+
- 🔑 **APIキー管理**: OpenAI APIキーをセキュアに管理
|
| 28 |
+
|
| 29 |
+
## 使い方
|
| 30 |
+
|
| 31 |
+
1. **APIキーの設定**
|
| 32 |
+
- 左上の「>」ボタンをクリックしてサイドバーを開く
|
| 33 |
+
- OpenAI APIキーを入力(`sk-...`形式)
|
| 34 |
+
- Enterキーを押して設定を保存
|
| 35 |
+
|
| 36 |
+
2. **質問を入力**
|
| 37 |
+
- 下部のチャット入力欄に質問を入力
|
| 38 |
+
- Enterキーを押して送信
|
| 39 |
+
|
| 40 |
+
3. **新しいチャットを開始**
|
| 41 |
+
- サイドバーの「新しいチャット」ボタンをクリック
|
| 42 |
+
|
| 43 |
+
## Hugging Face Spacesへのデプロイ
|
| 44 |
+
|
| 45 |
+
このアプリケーションはHugging Face Spaces上で動作するように最適化されています。
|
| 46 |
+
|
| 47 |
+
### デプロイ手順
|
| 48 |
+
|
| 49 |
+
1. **Hugging Faceアカウントの作成**
|
| 50 |
+
- [Hugging Face](https://huggingface.co/)でアカウントを作成
|
| 51 |
+
|
| 52 |
+
2. **新しいSpaceの作成**
|
| 53 |
+
- Hugging Faceダッシュボードで「New Space」をクリック
|
| 54 |
+
- Space名を入力(例:`empathemebot`)
|
| 55 |
+
- SDKとして「Streamlit」を選択
|
| 56 |
+
- Visibilityを選択(Public/Private)
|
| 57 |
+
|
| 58 |
+
3. **ファイルのアップロード**
|
| 59 |
+
```
|
| 60 |
+
your-space/
|
| 61 |
+
├── app.py # メインアプリケーション
|
| 62 |
+
├── requirements.txt # 依存関係(requirements_hf.txtの内容)
|
| 63 |
+
├── README.md # このファイル
|
| 64 |
+
├── src/ # ソースコード
|
| 65 |
+
│ ├── qa/
|
| 66 |
+
│ │ ├── chain.py
|
| 67 |
+
│ │ └── prompt.py
|
| 68 |
+
│ ├── vector/
|
| 69 |
+
│ │ └── ...
|
| 70 |
+
│ └── ...
|
| 71 |
+
└── data/ # ベクトルストア(オプション)
|
| 72 |
+
└── vector_store/
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
4. **環境変数の設定(オプション)**
|
| 76 |
+
- Settings → Repository secretsで`OPENAI_API_KEY`を設定
|
| 77 |
+
- または、ユーザーが直接UIから入力
|
| 78 |
+
|
| 79 |
+
5. **デプロイの確認**
|
| 80 |
+
- 自動的にビルドが開始されます
|
| 81 |
+
- ビルドが完了すると、アプリケーションが利用可能になります
|
| 82 |
+
|
| 83 |
+
## 技術スタック
|
| 84 |
+
|
| 85 |
+
- **フロントエンド**: Streamlit
|
| 86 |
+
- **LLMフレームワーク**: LangChain
|
| 87 |
+
- **ベクトルDB**: ChromaDB
|
| 88 |
+
- **LLM**: OpenAI GPT-4
|
| 89 |
+
|
| 90 |
+
## ライセンス
|
| 91 |
+
|
| 92 |
+
MIT License
|
| 93 |
+
|
| 94 |
+
## 開発者
|
| 95 |
+
|
| 96 |
+
Empatheme開発チーム
|
| 97 |
+
|
| 98 |
+
## サポート
|
| 99 |
+
|
| 100 |
+
問題が発生した場合は、[Issues](https://github.com/your-username/empathemebot/issues)でお知らせください。
|
| 101 |
+
|
| 102 |
+
---
|
| 103 |
+
|
| 104 |
+
**Note**: このアプリケーションを使用するには、OpenAI APIキーが必要です。
|
| 105 |
+
APIキーは[OpenAIのダッシュボード](https://platform.openai.com/api-keys)から取得できます。
|
app.py
ADDED
|
@@ -0,0 +1,563 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
EmpathemeBot - Hugging Face Spaces用統合版Streamlitアプリ
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import html
|
| 6 |
+
import logging
|
| 7 |
+
import re
|
| 8 |
+
import time
|
| 9 |
+
import uuid
|
| 10 |
+
from datetime import datetime
|
| 11 |
+
from typing import List, Dict, Optional
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
import sys
|
| 14 |
+
|
| 15 |
+
import streamlit as st
|
| 16 |
+
from dotenv import load_dotenv
|
| 17 |
+
import os
|
| 18 |
+
|
| 19 |
+
# 環境変数の読み込み
|
| 20 |
+
load_dotenv()
|
| 21 |
+
|
| 22 |
+
# srcディレクトリをパスに追加
|
| 23 |
+
sys.path.append(str(Path(__file__).parent))
|
| 24 |
+
|
| 25 |
+
# QAチェーンのインポート
|
| 26 |
+
from src.qa.chain import QAChain
|
| 27 |
+
|
| 28 |
+
# ロギング設定
|
| 29 |
+
logging.basicConfig(
|
| 30 |
+
level=logging.INFO,
|
| 31 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
| 32 |
+
)
|
| 33 |
+
logger = logging.getLogger(__name__)
|
| 34 |
+
|
| 35 |
+
# 設定
|
| 36 |
+
st.set_page_config(
|
| 37 |
+
page_title="EmpathemeBot QA System",
|
| 38 |
+
page_icon="🤖",
|
| 39 |
+
layout="wide",
|
| 40 |
+
initial_sidebar_state="collapsed",
|
| 41 |
+
menu_items={}
|
| 42 |
+
)
|
| 43 |
+
|
| 44 |
+
class EmpathemeBotUI:
|
| 45 |
+
"""Hugging Face Spaces用統合版EmpathemeBot UIクラス"""
|
| 46 |
+
|
| 47 |
+
def __init__(self):
|
| 48 |
+
# セッション状態の初期化
|
| 49 |
+
if 'session_id' not in st.session_state:
|
| 50 |
+
st.session_state.session_id = str(uuid.uuid4())
|
| 51 |
+
if 'messages' not in st.session_state:
|
| 52 |
+
st.session_state.messages = []
|
| 53 |
+
if 'qa_chain' not in st.session_state:
|
| 54 |
+
st.session_state.qa_chain = None
|
| 55 |
+
if 'api_key' not in st.session_state:
|
| 56 |
+
# Hugging Face Secretsから取得を試みる
|
| 57 |
+
st.session_state.api_key = os.getenv("OPENAI_API_KEY", "")
|
| 58 |
+
if 'last_activity' not in st.session_state:
|
| 59 |
+
st.session_state.last_activity = datetime.now()
|
| 60 |
+
if 'vector_store_initialized' not in st.session_state:
|
| 61 |
+
st.session_state.vector_store_initialized = False
|
| 62 |
+
|
| 63 |
+
def initialize_qa_chain(self, api_key: str) -> bool:
|
| 64 |
+
"""
|
| 65 |
+
QAChainを初期化
|
| 66 |
+
|
| 67 |
+
Args:
|
| 68 |
+
api_key: OpenAI APIキー
|
| 69 |
+
|
| 70 |
+
Returns:
|
| 71 |
+
初期化成功の場合True
|
| 72 |
+
"""
|
| 73 |
+
try:
|
| 74 |
+
# 環境変数にAPIキーを設定
|
| 75 |
+
os.environ["OPENAI_API_KEY"] = api_key
|
| 76 |
+
|
| 77 |
+
# ベクトルストアのパスを確認
|
| 78 |
+
vector_store_path = Path("data/vector_store")
|
| 79 |
+
|
| 80 |
+
# QAChainの初期化
|
| 81 |
+
if st.session_state.qa_chain is None:
|
| 82 |
+
logger.info("QAChainを初期化中...")
|
| 83 |
+
|
| 84 |
+
# ベクトルストアが存在しない場合の処理
|
| 85 |
+
if not vector_store_path.exists():
|
| 86 |
+
st.warning("ベクトルストアが見つかりません。デモモードで実行します。")
|
| 87 |
+
# デモモード用の簡易実装
|
| 88 |
+
st.session_state.qa_chain = self.create_demo_chain()
|
| 89 |
+
else:
|
| 90 |
+
st.session_state.qa_chain = QAChain(
|
| 91 |
+
persist_dir=str(vector_store_path),
|
| 92 |
+
verbose=False,
|
| 93 |
+
max_history_turns=10,
|
| 94 |
+
max_history_chars=10000
|
| 95 |
+
)
|
| 96 |
+
st.session_state.vector_store_initialized = True
|
| 97 |
+
logger.info("QAChain初期化完了")
|
| 98 |
+
return True
|
| 99 |
+
|
| 100 |
+
except Exception as e:
|
| 101 |
+
logger.error(f"QAChain初期化エラー: {e}")
|
| 102 |
+
st.error(f"初期化エラー: {str(e)}")
|
| 103 |
+
return False
|
| 104 |
+
|
| 105 |
+
return True
|
| 106 |
+
|
| 107 |
+
def create_demo_chain(self):
|
| 108 |
+
"""
|
| 109 |
+
デモ用の簡易QAチェーンを作成(ベクトルストアなし)
|
| 110 |
+
"""
|
| 111 |
+
class DemoQAChain:
|
| 112 |
+
def __init__(self):
|
| 113 |
+
self.conversation_history = []
|
| 114 |
+
|
| 115 |
+
def ask_with_history(self, question: str):
|
| 116 |
+
# デモ用の簡単な応答
|
| 117 |
+
self.conversation_history.append(f"Q: {question}")
|
| 118 |
+
|
| 119 |
+
# OpenAI APIを直接使用して応答を生成
|
| 120 |
+
try:
|
| 121 |
+
from langchain_openai import ChatOpenAI
|
| 122 |
+
from langchain.schema import HumanMessage, SystemMessage
|
| 123 |
+
|
| 124 |
+
llm = ChatOpenAI(
|
| 125 |
+
model_name="gpt-4o-mini",
|
| 126 |
+
temperature=0.7
|
| 127 |
+
)
|
| 128 |
+
|
| 129 |
+
messages = [
|
| 130 |
+
SystemMessage(content="あなたは英語学習をサポートするKurageSan®という親切なアシスタントです。"),
|
| 131 |
+
HumanMessage(content=question)
|
| 132 |
+
]
|
| 133 |
+
|
| 134 |
+
response = llm.invoke(messages)
|
| 135 |
+
answer = response.content
|
| 136 |
+
|
| 137 |
+
except Exception as e:
|
| 138 |
+
answer = f"申し訳ございません。現在デモモードで動作しており、詳細な回答ができません。エラー: {str(e)}"
|
| 139 |
+
|
| 140 |
+
self.conversation_history.append(f"A: {answer}")
|
| 141 |
+
return answer, []
|
| 142 |
+
|
| 143 |
+
def clear_history(self):
|
| 144 |
+
self.conversation_history = []
|
| 145 |
+
|
| 146 |
+
def get_history(self):
|
| 147 |
+
return "\n".join(self.conversation_history)
|
| 148 |
+
|
| 149 |
+
return DemoQAChain()
|
| 150 |
+
|
| 151 |
+
def ask_question(self, question: str) -> Optional[Dict]:
|
| 152 |
+
"""
|
| 153 |
+
質問を処理して回答を取得
|
| 154 |
+
|
| 155 |
+
Args:
|
| 156 |
+
question: ユーザーの質問
|
| 157 |
+
|
| 158 |
+
Returns:
|
| 159 |
+
回答データ
|
| 160 |
+
"""
|
| 161 |
+
try:
|
| 162 |
+
if st.session_state.qa_chain is None:
|
| 163 |
+
st.error("システムが初期化されていません。")
|
| 164 |
+
return None
|
| 165 |
+
|
| 166 |
+
# 質問処理
|
| 167 |
+
logger.info(f"質問処理開始: {question[:100]}...")
|
| 168 |
+
answer, source_docs = st.session_state.qa_chain.ask_with_history(question)
|
| 169 |
+
|
| 170 |
+
# ソースURLを抽出(重複除去)
|
| 171 |
+
source_urls = []
|
| 172 |
+
for doc in source_docs:
|
| 173 |
+
url = doc.metadata.get('source_url', '')
|
| 174 |
+
if url and url not in source_urls:
|
| 175 |
+
source_urls.append(url)
|
| 176 |
+
|
| 177 |
+
result = {
|
| 178 |
+
"answer": answer,
|
| 179 |
+
"source_count": len(source_docs),
|
| 180 |
+
"source_urls": source_urls
|
| 181 |
+
}
|
| 182 |
+
|
| 183 |
+
logger.info(f"回答生成成功: {len(source_docs)}件のソース参照")
|
| 184 |
+
return result
|
| 185 |
+
|
| 186 |
+
except Exception as e:
|
| 187 |
+
logger.error(f"エラー発生: {e}")
|
| 188 |
+
st.error(f"予期しないエラーが発生しました: {str(e)}")
|
| 189 |
+
return None
|
| 190 |
+
|
| 191 |
+
def clear_history(self):
|
| 192 |
+
"""会話履歴をクリア"""
|
| 193 |
+
try:
|
| 194 |
+
if st.session_state.qa_chain:
|
| 195 |
+
st.session_state.qa_chain.clear_history()
|
| 196 |
+
st.session_state.messages = []
|
| 197 |
+
st.success("会話履歴をクリアしました")
|
| 198 |
+
logger.info("履歴クリア成功")
|
| 199 |
+
except Exception as e:
|
| 200 |
+
logger.error(f"履歴クリアエラー: {e}")
|
| 201 |
+
st.error("エラーが発生しました")
|
| 202 |
+
|
| 203 |
+
def create_new_session(self):
|
| 204 |
+
"""新しいセッションIDを生成"""
|
| 205 |
+
st.session_state.session_id = str(uuid.uuid4())
|
| 206 |
+
st.session_state.messages = []
|
| 207 |
+
if st.session_state.qa_chain:
|
| 208 |
+
st.session_state.qa_chain.clear_history()
|
| 209 |
+
logger.info(f"新しいセッション作成: {st.session_state.session_id}")
|
| 210 |
+
|
| 211 |
+
def main():
|
| 212 |
+
"""メイン関数"""
|
| 213 |
+
|
| 214 |
+
# カスタムCSS
|
| 215 |
+
st.markdown("""
|
| 216 |
+
<style>
|
| 217 |
+
/* メインコンテナのスタイル */
|
| 218 |
+
.main {
|
| 219 |
+
padding-top: 1rem;
|
| 220 |
+
max-width: 1000px;
|
| 221 |
+
margin: 0 auto;
|
| 222 |
+
}
|
| 223 |
+
|
| 224 |
+
.block-container {
|
| 225 |
+
padding: 1rem 2rem;
|
| 226 |
+
max-width: 100%;
|
| 227 |
+
}
|
| 228 |
+
|
| 229 |
+
/* チャット入力のスタイル */
|
| 230 |
+
.stChatInput {
|
| 231 |
+
border: none !important;
|
| 232 |
+
box-shadow: none !important;
|
| 233 |
+
background: transparent !important;
|
| 234 |
+
position: fixed;
|
| 235 |
+
bottom: 0;
|
| 236 |
+
padding-bottom: 1rem;
|
| 237 |
+
background: white !important;
|
| 238 |
+
z-index: 999;
|
| 239 |
+
}
|
| 240 |
+
|
| 241 |
+
/* チャット入力のテキストエリア */
|
| 242 |
+
.stChatInput textarea {
|
| 243 |
+
font-size: 14px;
|
| 244 |
+
border: 1px solid #E5E7EB !important;
|
| 245 |
+
border-radius: 8px !important;
|
| 246 |
+
padding: 0.6rem 1rem !important;
|
| 247 |
+
background: #FAFAFA !important;
|
| 248 |
+
transition: all 0.2s ease;
|
| 249 |
+
}
|
| 250 |
+
|
| 251 |
+
.stChatInput textarea:focus {
|
| 252 |
+
background: white !important;
|
| 253 |
+
border-color: #4F46E5 !important;
|
| 254 |
+
outline: none !important;
|
| 255 |
+
box-shadow: 0 0 0 2px rgba(79, 70, 229, 0.1) !important;
|
| 256 |
+
}
|
| 257 |
+
|
| 258 |
+
/* ボタンのスタイル */
|
| 259 |
+
.stButton > button {
|
| 260 |
+
background: #4F46E5;
|
| 261 |
+
color: white;
|
| 262 |
+
border: none;
|
| 263 |
+
border-radius: 6px;
|
| 264 |
+
padding: 0.5rem 1rem;
|
| 265 |
+
font-weight: 500;
|
| 266 |
+
font-size: 13px;
|
| 267 |
+
transition: all 0.15s ease;
|
| 268 |
+
}
|
| 269 |
+
|
| 270 |
+
.stButton > button:hover {
|
| 271 |
+
background: #4338CA;
|
| 272 |
+
}
|
| 273 |
+
|
| 274 |
+
/* タイトルのスタイル */
|
| 275 |
+
h1 {
|
| 276 |
+
color: #111827;
|
| 277 |
+
font-weight: 600;
|
| 278 |
+
text-align: center;
|
| 279 |
+
font-size: 1.75rem;
|
| 280 |
+
margin-bottom: 0.5rem;
|
| 281 |
+
}
|
| 282 |
+
|
| 283 |
+
/* サイドバーのスタイル */
|
| 284 |
+
section[data-testid="stSidebar"] {
|
| 285 |
+
background: #FAFAFB;
|
| 286 |
+
}
|
| 287 |
+
|
| 288 |
+
/* 吹き出し内のコンテンツスタイル */
|
| 289 |
+
.bubble-content {
|
| 290 |
+
font-family: inherit;
|
| 291 |
+
font-size: inherit;
|
| 292 |
+
white-space: pre-wrap;
|
| 293 |
+
word-wrap: break-word;
|
| 294 |
+
margin: 0;
|
| 295 |
+
padding: 0;
|
| 296 |
+
color: inherit;
|
| 297 |
+
}
|
| 298 |
+
</style>
|
| 299 |
+
""", unsafe_allow_html=True)
|
| 300 |
+
|
| 301 |
+
# UIインスタンス作成
|
| 302 |
+
bot = EmpathemeBotUI()
|
| 303 |
+
|
| 304 |
+
# サイドバー設定
|
| 305 |
+
with st.sidebar:
|
| 306 |
+
st.markdown("## 設定")
|
| 307 |
+
|
| 308 |
+
# APIキー入力欄
|
| 309 |
+
st.markdown("### OpenAI API キー")
|
| 310 |
+
api_key = st.text_input(
|
| 311 |
+
"APIキーを入力(必須)",
|
| 312 |
+
value=st.session_state.api_key,
|
| 313 |
+
type="password",
|
| 314 |
+
placeholder="sk-...",
|
| 315 |
+
help="OpenAI APIキーを入力してください。このフィールドは必須です。"
|
| 316 |
+
)
|
| 317 |
+
|
| 318 |
+
# APIキーが変更された場合、セッション状態を更新
|
| 319 |
+
if api_key != st.session_state.api_key:
|
| 320 |
+
st.session_state.api_key = api_key
|
| 321 |
+
if api_key:
|
| 322 |
+
# QAChainを初期化
|
| 323 |
+
if bot.initialize_qa_chain(api_key):
|
| 324 |
+
st.success("✅ APIキーが設定されました")
|
| 325 |
+
else:
|
| 326 |
+
st.error("❌ 初期化に失敗しました")
|
| 327 |
+
else:
|
| 328 |
+
st.warning("⚠️ APIキーが未入力です")
|
| 329 |
+
|
| 330 |
+
st.markdown("---")
|
| 331 |
+
|
| 332 |
+
# コントロールボタン
|
| 333 |
+
st.markdown("### コントロール")
|
| 334 |
+
if st.button("新しいチャット", use_container_width=True):
|
| 335 |
+
bot.create_new_session()
|
| 336 |
+
st.rerun()
|
| 337 |
+
|
| 338 |
+
if st.button("履歴クリア", use_container_width=True):
|
| 339 |
+
bot.clear_history()
|
| 340 |
+
st.rerun()
|
| 341 |
+
|
| 342 |
+
st.markdown("---")
|
| 343 |
+
|
| 344 |
+
# ステータス
|
| 345 |
+
st.markdown("### ステータス")
|
| 346 |
+
if st.session_state.vector_store_initialized:
|
| 347 |
+
st.success("システム準備完了")
|
| 348 |
+
else:
|
| 349 |
+
st.info("システム待機中")
|
| 350 |
+
|
| 351 |
+
# メインヘッダー
|
| 352 |
+
st.markdown(
|
| 353 |
+
"""
|
| 354 |
+
<div style='text-align: center; margin-bottom: 2rem;'>
|
| 355 |
+
<h1 style='margin-bottom: 0.25rem;'>🤖 EmpathemeBot</h1>
|
| 356 |
+
<p style='color: #6B7280; font-size: 0.9rem;'>Potionベースの質問応答システム</p>
|
| 357 |
+
</div>
|
| 358 |
+
""",
|
| 359 |
+
unsafe_allow_html=True
|
| 360 |
+
)
|
| 361 |
+
|
| 362 |
+
# APIキー未入力時の警告メッセージ
|
| 363 |
+
if not st.session_state.api_key:
|
| 364 |
+
st.markdown(
|
| 365 |
+
"""
|
| 366 |
+
<div style="background: #FEF3C7; border: 2px solid #F59E0B; border-radius: 12px; padding: 1.5rem; margin: 2rem 0;">
|
| 367 |
+
<h3 style="color: #92400E; margin-top: 0;">APIキーの入力が必要です</h3>
|
| 368 |
+
<p style="color: #78350F; margin-bottom: 1rem;">
|
| 369 |
+
EmpathemeBotを使用するには、OpenAI APIキーが必要です。
|
| 370 |
+
</p>
|
| 371 |
+
<ol style="color: #78350F; margin-left: 1.5rem;">
|
| 372 |
+
<li>左上の「>」ボタンをクリックしてサイドバーを開く</li>
|
| 373 |
+
<li>「OpenAI API キー」セクションにAPIキー(sk-...)を入力</li>
|
| 374 |
+
<li>Enterキーを押してAPIキーを設定</li>
|
| 375 |
+
</ol>
|
| 376 |
+
<p style="color: #78350F; font-size: 0.9rem; margin-top: 1rem;">
|
| 377 |
+
APIキーは <a href="https://platform.openai.com/api-keys" target="_blank" style="color: #F59E0B;">OpenAIのダッシュボード</a> から取得できます。
|
| 378 |
+
</p>
|
| 379 |
+
</div>
|
| 380 |
+
""",
|
| 381 |
+
unsafe_allow_html=True
|
| 382 |
+
)
|
| 383 |
+
st.stop()
|
| 384 |
+
|
| 385 |
+
# APIキーがあるがQAChainが初期化されていない場合
|
| 386 |
+
if st.session_state.api_key and st.session_state.qa_chain is None:
|
| 387 |
+
if bot.initialize_qa_chain(st.session_state.api_key):
|
| 388 |
+
st.rerun()
|
| 389 |
+
|
| 390 |
+
# ウェルカムメッセージ(初回のみ)
|
| 391 |
+
if len(st.session_state.messages) == 0:
|
| 392 |
+
st.markdown(
|
| 393 |
+
"""
|
| 394 |
+
<div style="text-align: center; padding: 3rem 0; color: #6B7280;">
|
| 395 |
+
<p style="font-size: 0.95rem;">こんにちは、KurageSan®だよ!何か英語学習に関して困っていることはありますか?</p>
|
| 396 |
+
</div>
|
| 397 |
+
""",
|
| 398 |
+
unsafe_allow_html=True
|
| 399 |
+
)
|
| 400 |
+
|
| 401 |
+
# チャット履歴の表示
|
| 402 |
+
for message in st.session_state.messages:
|
| 403 |
+
if message["role"] == "user":
|
| 404 |
+
# ユーザーメッセージ(右側)
|
| 405 |
+
st.markdown(
|
| 406 |
+
f"""
|
| 407 |
+
<div style="display: flex; justify-content: flex-end; margin: 1rem 0; padding-right: 1rem;">
|
| 408 |
+
<div style="background: linear-gradient(135deg, #4F46E5 0%, #6366F1 100%);
|
| 409 |
+
color: white;
|
| 410 |
+
padding: 0.75rem 1.25rem;
|
| 411 |
+
border-radius: 18px 18px 4px 18px;
|
| 412 |
+
max-width: 60%;
|
| 413 |
+
box-shadow: 0 2px 10px rgba(79, 70, 229, 0.2);
|
| 414 |
+
word-wrap: break-word;">
|
| 415 |
+
<pre class="bubble-content">{html.escape(message['content'])}</pre>
|
| 416 |
+
<div style="font-size: 0.7rem; opacity: 0.8; margin-top: 0.3rem; text-align: right;">
|
| 417 |
+
{message.get('timestamp', '')}
|
| 418 |
+
</div>
|
| 419 |
+
</div>
|
| 420 |
+
</div>
|
| 421 |
+
""",
|
| 422 |
+
unsafe_allow_html=True
|
| 423 |
+
)
|
| 424 |
+
else:
|
| 425 |
+
# アシスタントメッセージ(左側)
|
| 426 |
+
st.markdown(
|
| 427 |
+
f"""
|
| 428 |
+
<div style="display: flex; justify-content: flex-start; margin: 1rem 0; padding-left: 1rem;">
|
| 429 |
+
<div style="background: #F3F4F6;
|
| 430 |
+
color: #111827;
|
| 431 |
+
padding: 0.75rem 1.25rem;
|
| 432 |
+
border-radius: 18px 18px 18px 4px;
|
| 433 |
+
max-width: 60%;
|
| 434 |
+
box-shadow: 0 2px 10px rgba(0, 0, 0, 0.08);
|
| 435 |
+
word-wrap: break-word;">
|
| 436 |
+
<pre class="bubble-content">{html.escape(message['content'])}</pre>
|
| 437 |
+
<div style="font-size: 0.7rem; opacity: 0.6; margin-top: 0.3rem;">
|
| 438 |
+
{message.get('timestamp', '')}
|
| 439 |
+
</div>
|
| 440 |
+
</div>
|
| 441 |
+
</div>
|
| 442 |
+
""",
|
| 443 |
+
unsafe_allow_html=True
|
| 444 |
+
)
|
| 445 |
+
|
| 446 |
+
# チャット入力
|
| 447 |
+
if prompt := st.chat_input("質問を入力してください...", key="chat_input"):
|
| 448 |
+
# タイムスタンプを追加
|
| 449 |
+
timestamp = datetime.now().strftime("%H:%M")
|
| 450 |
+
|
| 451 |
+
# ユーザーメッセージを追加
|
| 452 |
+
st.session_state.messages.append({
|
| 453 |
+
"role": "user",
|
| 454 |
+
"content": prompt,
|
| 455 |
+
"timestamp": timestamp
|
| 456 |
+
})
|
| 457 |
+
|
| 458 |
+
# ユーザーメッセージを表示
|
| 459 |
+
st.markdown(
|
| 460 |
+
f"""
|
| 461 |
+
<div style="display: flex; justify-content: flex-end; margin: 1rem 0; padding-right: 1rem;">
|
| 462 |
+
<div style="background: linear-gradient(135deg, #4F46E5 0%, #6366F1 100%);
|
| 463 |
+
color: white;
|
| 464 |
+
padding: 0.75rem 1.25rem;
|
| 465 |
+
border-radius: 18px 18px 4px 18px;
|
| 466 |
+
max-width: 60%;
|
| 467 |
+
box-shadow: 0 2px 10px rgba(79, 70, 229, 0.2);
|
| 468 |
+
word-wrap: break-word;">
|
| 469 |
+
<pre class="bubble-content">{html.escape(prompt)}</pre>
|
| 470 |
+
<div style="font-size: 0.7rem; opacity: 0.8; margin-top: 0.3rem; text-align: right;">
|
| 471 |
+
{timestamp}
|
| 472 |
+
</div>
|
| 473 |
+
</div>
|
| 474 |
+
</div>
|
| 475 |
+
""",
|
| 476 |
+
unsafe_allow_html=True
|
| 477 |
+
)
|
| 478 |
+
|
| 479 |
+
# アシスタントの応答を生成
|
| 480 |
+
with st.spinner("考えています..."):
|
| 481 |
+
response_timestamp = datetime.now().strftime("%H:%M")
|
| 482 |
+
response_data = bot.ask_question(prompt)
|
| 483 |
+
|
| 484 |
+
if response_data:
|
| 485 |
+
answer = response_data["answer"]
|
| 486 |
+
|
| 487 |
+
# メッセージ履歴に追加
|
| 488 |
+
st.session_state.messages.append({
|
| 489 |
+
"role": "assistant",
|
| 490 |
+
"content": answer,
|
| 491 |
+
"timestamp": response_timestamp,
|
| 492 |
+
"metadata": {
|
| 493 |
+
"source_count": response_data.get("source_count", 0)
|
| 494 |
+
}
|
| 495 |
+
})
|
| 496 |
+
|
| 497 |
+
# アシスタントメッセージを表示
|
| 498 |
+
st.markdown(
|
| 499 |
+
f"""
|
| 500 |
+
<div style="display: flex; justify-content: flex-start; margin: 1rem 0; padding-left: 1rem;">
|
| 501 |
+
<div style="background: #F3F4F6;
|
| 502 |
+
color: #111827;
|
| 503 |
+
padding: 0.75rem 1.25rem;
|
| 504 |
+
border-radius: 18px 18px 18px 4px;
|
| 505 |
+
max-width: 60%;
|
| 506 |
+
box-shadow: 0 2px 10px rgba(0, 0, 0, 0.08);
|
| 507 |
+
word-wrap: break-word;">
|
| 508 |
+
<pre class="bubble-content">{html.escape(answer)}</pre>
|
| 509 |
+
<div style="font-size: 0.7rem; opacity: 0.6; margin-top: 0.3rem;">
|
| 510 |
+
{response_timestamp}
|
| 511 |
+
</div>
|
| 512 |
+
</div>
|
| 513 |
+
</div>
|
| 514 |
+
""",
|
| 515 |
+
unsafe_allow_html=True
|
| 516 |
+
)
|
| 517 |
+
else:
|
| 518 |
+
# エラーの場合
|
| 519 |
+
error_message = "申し訳ございません。回答の生成に失敗しました。もう一度お試しください。"
|
| 520 |
+
|
| 521 |
+
st.session_state.messages.append({
|
| 522 |
+
"role": "assistant",
|
| 523 |
+
"content": error_message,
|
| 524 |
+
"timestamp": response_timestamp
|
| 525 |
+
})
|
| 526 |
+
|
| 527 |
+
# エラーメッセージを表示
|
| 528 |
+
st.markdown(
|
| 529 |
+
f"""
|
| 530 |
+
<div style="display: flex; justify-content: flex-start; margin: 1rem 0; padding-left: 1rem;">
|
| 531 |
+
<div style="background: #F3F4F6;
|
| 532 |
+
color: #111827;
|
| 533 |
+
padding: 0.75rem 1.25rem;
|
| 534 |
+
border-radius: 18px 18px 18px 4px;
|
| 535 |
+
max-width: 60%;
|
| 536 |
+
box-shadow: 0 2px 10px rgba(0, 0, 0, 0.08);
|
| 537 |
+
word-wrap: break-word;">
|
| 538 |
+
<pre class="bubble-content">{html.escape(error_message)}</pre>
|
| 539 |
+
<div style="font-size: 0.7rem; opacity: 0.6; margin-top: 0.3rem;">
|
| 540 |
+
{response_timestamp}
|
| 541 |
+
</div>
|
| 542 |
+
</div>
|
| 543 |
+
</div>
|
| 544 |
+
""",
|
| 545 |
+
unsafe_allow_html=True
|
| 546 |
+
)
|
| 547 |
+
|
| 548 |
+
# アクティビティを更新
|
| 549 |
+
st.session_state.last_activity = datetime.now()
|
| 550 |
+
|
| 551 |
+
# フッター
|
| 552 |
+
st.markdown(
|
| 553 |
+
f"""
|
| 554 |
+
<div style="text-align: center; margin-top: 3rem; padding: 1rem 0;
|
| 555 |
+
border-top: 1px solid #E5E7EB; color: #9CA3AF; font-size: 0.8rem;">
|
| 556 |
+
EmpathemeBot · セッション: {st.session_state.session_id[:8]}
|
| 557 |
+
</div>
|
| 558 |
+
""",
|
| 559 |
+
unsafe_allow_html=True
|
| 560 |
+
)
|
| 561 |
+
|
| 562 |
+
if __name__ == "__main__":
|
| 563 |
+
main()
|
data/.DS_Store
ADDED
|
Binary file (6.15 kB). View file
|
|
|
data/vector_store/.gitkeep
ADDED
|
File without changes
|
data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/data_level0.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:bb76ae0f6ca830a9a048b0cf53962a78c88c5c7fcda63fc846077d3456eb3890
|
| 3 |
+
size 62840000
|
data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/header.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ec666c9828420c69fc6b597461d8c18487becec1527c7d1cff9b898cbb393c2d
|
| 3 |
+
size 100
|
data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/length.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:fbeb5cb0c36f6258be7779f5e16bfa212d9750a0183f6eaf196473ad5293babc
|
| 3 |
+
size 40000
|
data/vector_store/5108300c-ddfb-4f8b-9c77-5f4004790ec3/link_lists.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
|
| 3 |
+
size 0
|
data/vector_store/chroma.sqlite3
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cabab722f170e66bfea7089f24a34969058525cb56bbcfbe7a8cbb63f23fb518
|
| 3 |
+
size 5783552
|
requirements.txt
CHANGED
|
@@ -1,3 +1,21 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Requirements for Hugging Face Spaces deployment
|
| 2 |
+
# FastAPI and uvicorn removed since we're using unified Streamlit app
|
| 3 |
+
|
| 4 |
+
aiohttp>=3.9
|
| 5 |
+
beautifulsoup4>=4.12
|
| 6 |
+
chromadb>=0.5.0
|
| 7 |
+
langchain==0.3.7
|
| 8 |
+
langchain-chroma>=0.1.0
|
| 9 |
+
langchain-community==0.3.5
|
| 10 |
+
langchain-core
|
| 11 |
+
langchain-openai==0.2.5
|
| 12 |
+
langchain-text-splitters>=0.0.1
|
| 13 |
+
langgraph>0.2.27
|
| 14 |
+
lxml>=5.2
|
| 15 |
+
markdownify>=0.13
|
| 16 |
+
numpy>=1.24.0
|
| 17 |
+
python-dotenv
|
| 18 |
+
readability-lxml>=0.8
|
| 19 |
+
requests>=2.31.0
|
| 20 |
+
streamlit>=1.32.0
|
| 21 |
+
tqdm>=4.66
|
src/__init__.py
ADDED
|
File without changes
|
src/processing/__init__.py
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""RAG (Retrieval-Augmented Generation) System
|
| 2 |
+
|
| 3 |
+
This module provides tools for building and managing RAG systems,
|
| 4 |
+
including document processing, metadata generation, and batch processing.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
__version__ = "0.1.0"
|
| 8 |
+
|
| 9 |
+
from src.processing.batch_processor import BatchProcessor
|
| 10 |
+
from src.processing.document_processor import DocumentProcessor
|
| 11 |
+
from src.processing.metadata_utils import generate_file_name, extract_title
|
| 12 |
+
from src.processing.scrape_to_metadata import ScrapeToMetadata
|
| 13 |
+
|
| 14 |
+
__all__ = [
|
| 15 |
+
"BatchProcessor",
|
| 16 |
+
"DocumentProcessor",
|
| 17 |
+
"ScrapeToMetadata",
|
| 18 |
+
"extract_title",
|
| 19 |
+
"generate_file_name",
|
| 20 |
+
]
|
src/processing/__pycache__/__init__.cpython-312.pyc
ADDED
|
Binary file (810 Bytes). View file
|
|
|
src/processing/__pycache__/batch_processor.cpython-312.pyc
ADDED
|
Binary file (7.81 kB). View file
|
|
|
src/processing/__pycache__/chunk_documents.cpython-312.pyc
ADDED
|
Binary file (2.95 kB). View file
|
|
|
src/processing/__pycache__/cli.cpython-312.pyc
ADDED
|
Binary file (4.89 kB). View file
|
|
|
src/processing/__pycache__/document_processor.cpython-312.pyc
ADDED
|
Binary file (2.04 kB). View file
|
|
|
src/processing/__pycache__/metadata_utils.cpython-312.pyc
ADDED
|
Binary file (1.85 kB). View file
|
|
|
src/processing/__pycache__/scrape_to_metadata.cpython-312.pyc
ADDED
|
Binary file (2.78 kB). View file
|
|
|
src/processing/batch_processor.py
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""バッチ処理とプログレス管理"""
|
| 2 |
+
|
| 3 |
+
import asyncio
|
| 4 |
+
import json
|
| 5 |
+
import logging
|
| 6 |
+
from pathlib import Path
|
| 7 |
+
from typing import List, Dict, Any, Optional
|
| 8 |
+
|
| 9 |
+
from tqdm import tqdm
|
| 10 |
+
|
| 11 |
+
from src.processing.document_processor import DocumentProcessor
|
| 12 |
+
from src.scraping.exceptions import ArticleNotFoundError, FetchError
|
| 13 |
+
|
| 14 |
+
logger = logging.getLogger(__name__)
|
| 15 |
+
|
| 16 |
+
class BatchProcessor:
|
| 17 |
+
"""バッチ処理とプログレス管理"""
|
| 18 |
+
|
| 19 |
+
def __init__(self, wait_time: float = 1.0):
|
| 20 |
+
"""
|
| 21 |
+
Args:
|
| 22 |
+
wait_time: リクエスト間の待機時間(秒)
|
| 23 |
+
"""
|
| 24 |
+
self.wait_time = wait_time
|
| 25 |
+
self.document_processor = DocumentProcessor()
|
| 26 |
+
|
| 27 |
+
async def process_urls_batch(
|
| 28 |
+
self,
|
| 29 |
+
urls: List[str],
|
| 30 |
+
start_id: int = 1,
|
| 31 |
+
mode: str = "memory",
|
| 32 |
+
show_progress: bool = True,
|
| 33 |
+
save_dir: Optional[Path] = None,
|
| 34 |
+
verbose: bool = False
|
| 35 |
+
) -> List[Dict[str, Any]]:
|
| 36 |
+
"""
|
| 37 |
+
複数URLをバッチ処理してメタデータを生成
|
| 38 |
+
|
| 39 |
+
Args:
|
| 40 |
+
urls: 処理するURLのリスト
|
| 41 |
+
start_id: 開始ID
|
| 42 |
+
mode: "memory" または "save"
|
| 43 |
+
show_progress: プログレス表示の有無
|
| 44 |
+
save_dir: saveモード時のMarkdownファイル保存先ディレクトリ
|
| 45 |
+
verbose: 詳細ログを表示するか
|
| 46 |
+
|
| 47 |
+
Returns:
|
| 48 |
+
生成されたドキュメントメタデータのリスト
|
| 49 |
+
"""
|
| 50 |
+
documents = []
|
| 51 |
+
success_count = 0
|
| 52 |
+
skip_count = 0
|
| 53 |
+
fail_count = 0
|
| 54 |
+
|
| 55 |
+
total = len(urls)
|
| 56 |
+
end_id = start_id + total - 1
|
| 57 |
+
|
| 58 |
+
# saveモードの場合、保存先ディレクトリを設定
|
| 59 |
+
if mode == "save":
|
| 60 |
+
if save_dir is None:
|
| 61 |
+
save_dir = Path("data/raw")
|
| 62 |
+
save_dir.mkdir(parents=True, exist_ok=True)
|
| 63 |
+
logger.info(f"保存先ディレクトリ: {save_dir}")
|
| 64 |
+
|
| 65 |
+
logger.info(f"スクレイピング開始: ID {start_id} から {end_id} まで(計{total}件)")
|
| 66 |
+
logger.info(f"モード: {'メモリー保管' if mode == 'memory' else 'ファイル保存'}")
|
| 67 |
+
logger.info(f"待機時間: {self.wait_time}秒\n")
|
| 68 |
+
|
| 69 |
+
# プログレスバーの作成(単一行で更新)
|
| 70 |
+
pbar = None
|
| 71 |
+
if show_progress:
|
| 72 |
+
pbar = tqdm(
|
| 73 |
+
total=total,
|
| 74 |
+
desc="処理中",
|
| 75 |
+
leave=True,
|
| 76 |
+
ncols=80,
|
| 77 |
+
bar_format='{desc}: {percentage:3.0f}%|{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {postfix}]'
|
| 78 |
+
)
|
| 79 |
+
|
| 80 |
+
try:
|
| 81 |
+
for i, url in enumerate(urls):
|
| 82 |
+
current_id = start_id + i
|
| 83 |
+
|
| 84 |
+
try:
|
| 85 |
+
# saveモードの場合はsave_dirを渡す
|
| 86 |
+
if mode == "save":
|
| 87 |
+
document = await self.document_processor.process_url(url, current_id, save_dir)
|
| 88 |
+
else:
|
| 89 |
+
document = await self.document_processor.process_url(url, current_id)
|
| 90 |
+
|
| 91 |
+
documents.append(document)
|
| 92 |
+
success_count += 1
|
| 93 |
+
|
| 94 |
+
# verboseモードの場合は詳細ログも表示
|
| 95 |
+
if verbose:
|
| 96 |
+
# プログレスバーを一時的にクリアして詳細を表示
|
| 97 |
+
if pbar:
|
| 98 |
+
pbar.clear()
|
| 99 |
+
if mode == "save":
|
| 100 |
+
logger.info(f" ✓ {url}: 保存完了 ({document['metadata']['file_name']})")
|
| 101 |
+
else:
|
| 102 |
+
logger.info(f" ✓ {url}: 処理完了 ({document['metadata']['file_name']})")
|
| 103 |
+
if pbar:
|
| 104 |
+
pbar.refresh()
|
| 105 |
+
|
| 106 |
+
except ArticleNotFoundError:
|
| 107 |
+
skip_count += 1
|
| 108 |
+
|
| 109 |
+
# verboseモードの場合は詳細ログも表示
|
| 110 |
+
if verbose:
|
| 111 |
+
if pbar:
|
| 112 |
+
pbar.clear()
|
| 113 |
+
logger.warning(f" ⊘ {url}: 記事が見つかりません")
|
| 114 |
+
if pbar:
|
| 115 |
+
pbar.refresh()
|
| 116 |
+
|
| 117 |
+
except FetchError as e:
|
| 118 |
+
fail_count += 1
|
| 119 |
+
|
| 120 |
+
# verboseモードの場合は詳細ログも表示
|
| 121 |
+
if verbose:
|
| 122 |
+
if pbar:
|
| 123 |
+
pbar.clear()
|
| 124 |
+
logger.error(f" ✗ {url}: 取得エラー: {str(e)}")
|
| 125 |
+
if pbar:
|
| 126 |
+
pbar.refresh()
|
| 127 |
+
|
| 128 |
+
except Exception as e:
|
| 129 |
+
fail_count += 1
|
| 130 |
+
|
| 131 |
+
# verboseモードの場合は詳細ログも表示
|
| 132 |
+
if verbose:
|
| 133 |
+
if pbar:
|
| 134 |
+
pbar.clear()
|
| 135 |
+
logger.error(f" ✗ {url}: エラー: {str(e)}")
|
| 136 |
+
if pbar:
|
| 137 |
+
pbar.refresh()
|
| 138 |
+
|
| 139 |
+
# プログレスバーを更新
|
| 140 |
+
if pbar:
|
| 141 |
+
pbar.set_postfix({
|
| 142 |
+
'成功': success_count,
|
| 143 |
+
'スキップ': skip_count,
|
| 144 |
+
'失敗': fail_count
|
| 145 |
+
})
|
| 146 |
+
pbar.update(1)
|
| 147 |
+
|
| 148 |
+
# 次のリクエストまで待機(最後のURLでは待機しない)
|
| 149 |
+
if i < len(urls) - 1:
|
| 150 |
+
await asyncio.sleep(self.wait_time)
|
| 151 |
+
finally:
|
| 152 |
+
if pbar:
|
| 153 |
+
pbar.close()
|
| 154 |
+
|
| 155 |
+
# サマリー表示
|
| 156 |
+
logger.info("\n" + "=" * 50)
|
| 157 |
+
logger.info("処理結果サマリー")
|
| 158 |
+
logger.info("=" * 50)
|
| 159 |
+
logger.info(f"合計: {total}件")
|
| 160 |
+
logger.info(f"成功: {success_count}件")
|
| 161 |
+
logger.info(f"スキップ(記事なし): {skip_count}件")
|
| 162 |
+
logger.info(f"失敗: {fail_count}件")
|
| 163 |
+
|
| 164 |
+
if mode == "save" and success_count > 0:
|
| 165 |
+
logger.info(f"\nMarkdownファイル保存先: {save_dir}")
|
| 166 |
+
|
| 167 |
+
return documents
|
| 168 |
+
|
| 169 |
+
def save_metadata(self, documents: List[Dict[str, Any]], output_path: Path):
|
| 170 |
+
"""メタデータをJSON形式で保存
|
| 171 |
+
|
| 172 |
+
Args:
|
| 173 |
+
documents: 保存するドキュメントメタデータのリスト
|
| 174 |
+
output_path: 出力ファイルパス
|
| 175 |
+
"""
|
| 176 |
+
output_path.parent.mkdir(parents=True, exist_ok=True)
|
| 177 |
+
|
| 178 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 179 |
+
json.dump(documents, f, ensure_ascii=False, indent=2)
|
| 180 |
+
|
| 181 |
+
logger.info(f"\nメタデータを保存しました: {output_path}")
|
| 182 |
+
logger.info(f"保存件数: {len(documents)}件")
|
src/processing/chunk_documents.py
ADDED
|
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
+
import json
|
| 3 |
+
import logging
|
| 4 |
+
|
| 5 |
+
from langchain_core.documents import Document
|
| 6 |
+
from langchain_text_splitters import CharacterTextSplitter
|
| 7 |
+
|
| 8 |
+
logger = logging.getLogger(__name__)
|
| 9 |
+
|
| 10 |
+
def chunk_documents(input_file_path: str, output_file_path: str):
|
| 11 |
+
"""
|
| 12 |
+
JSONファイルをロードし、ドキュメントをチャンクに分割して、新しいJSONファイルに保存します。
|
| 13 |
+
"""
|
| 14 |
+
# テキスト分割を初期化します
|
| 15 |
+
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
|
| 16 |
+
|
| 17 |
+
# 入力JSONファイルからドキュメントを読み込みます
|
| 18 |
+
with open(input_file_path, 'r', encoding='utf-8') as f:
|
| 19 |
+
documents_data = json.load(f)
|
| 20 |
+
|
| 21 |
+
chunked_documents = []
|
| 22 |
+
for doc_data in documents_data:
|
| 23 |
+
page_content = doc_data.get("page_content", "")
|
| 24 |
+
metadata = doc_data.get("metadata", {})
|
| 25 |
+
|
| 26 |
+
# page_contentをチャンクに分割します
|
| 27 |
+
splits_texts = text_splitter.split_text(page_content)
|
| 28 |
+
|
| 29 |
+
# 分割ごとに新しいドキュメントチャンクを作成します
|
| 30 |
+
for i, text in enumerate(splits_texts):
|
| 31 |
+
chunk_metadata = metadata.copy()
|
| 32 |
+
chunk_metadata['chunk_index'] = i
|
| 33 |
+
|
| 34 |
+
chunked_doc = {
|
| 35 |
+
"metadata": chunk_metadata,
|
| 36 |
+
"page_content": text
|
| 37 |
+
}
|
| 38 |
+
chunked_documents.append(chunked_doc)
|
| 39 |
+
|
| 40 |
+
# チャンク化されたドキュメントを出力JSONファイルに保存します
|
| 41 |
+
with open(output_file_path, 'w', encoding='utf-8') as f:
|
| 42 |
+
json.dump(chunked_documents, f, ensure_ascii=False, indent=2)
|
| 43 |
+
|
| 44 |
+
logger.info(f"{len(documents_data)}個のドキュメントを{len(chunked_documents)}個のチャンクに正常に分割しました。")
|
| 45 |
+
logger.info(f"出力は {output_file_path} に保存されました")
|
| 46 |
+
|
| 47 |
+
if __name__ == "__main__":
|
| 48 |
+
parser = argparse.ArgumentParser(description="JSONファイルからドキュメントをチャンク化します。")
|
| 49 |
+
parser.add_argument("input_file", help="入力JSONファイルのパス (例: documents.json)。")
|
| 50 |
+
parser.add_argument("output_file", help="出力JSONファイルのパス (例: chunks.json)。")
|
| 51 |
+
|
| 52 |
+
args = parser.parse_args()
|
| 53 |
+
|
| 54 |
+
chunk_documents(args.input_file, args.output_file)
|
src/processing/document_processor.py
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""単一URLのスクレイピング→メタデータ生成処理"""
|
| 2 |
+
|
| 3 |
+
from pathlib import Path
|
| 4 |
+
from typing import Dict, Any, Optional
|
| 5 |
+
|
| 6 |
+
from src.processing.metadata_utils import extract_title, generate_file_name
|
| 7 |
+
from src.scraping import io
|
| 8 |
+
from src.scraping import pipeline as scraping_pipeline
|
| 9 |
+
|
| 10 |
+
class DocumentProcessor:
|
| 11 |
+
"""単一URLのスクレイピング→メタデータ生成処理"""
|
| 12 |
+
|
| 13 |
+
async def process_url(self, url: str, index: int, save_dir: Optional[Path] = None) -> Dict[str, Any]:
|
| 14 |
+
"""
|
| 15 |
+
URLから直接メタデータを生成
|
| 16 |
+
save_dirが指定されている場合はMarkdownファイルも保存
|
| 17 |
+
|
| 18 |
+
Args:
|
| 19 |
+
url: 処理対象のURL
|
| 20 |
+
index: URLのインデックス番号
|
| 21 |
+
save_dir: Markdownファイルの保存先ディレクトリ(Noneの場合は保存しない)
|
| 22 |
+
|
| 23 |
+
Returns:
|
| 24 |
+
LangChain形式のドキュメントメタデータ
|
| 25 |
+
"""
|
| 26 |
+
# スクレイピング処理(集中化されたパイプライン関数を利用)
|
| 27 |
+
md_content = await scraping_pipeline.to_markdown_from_url(url)
|
| 28 |
+
|
| 29 |
+
# save_dirが指定されている場合はファイル保存
|
| 30 |
+
if save_dir:
|
| 31 |
+
saved_path = io.save_markdown(md_content, save_dir, url)
|
| 32 |
+
file_name = saved_path.name
|
| 33 |
+
else:
|
| 34 |
+
file_name = generate_file_name(url, index)
|
| 35 |
+
|
| 36 |
+
# メタデータ生成(LangChain形式)
|
| 37 |
+
document = {
|
| 38 |
+
"metadata": {
|
| 39 |
+
"title": extract_title(md_content),
|
| 40 |
+
"file_name": file_name,
|
| 41 |
+
"source_url": url,
|
| 42 |
+
},
|
| 43 |
+
"page_content": md_content
|
| 44 |
+
}
|
| 45 |
+
|
| 46 |
+
return document
|
src/processing/metadata_utils.py
ADDED
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""メタデータ生成のユーティリティ関数"""
|
| 2 |
+
|
| 3 |
+
from urllib.parse import urlparse
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
def generate_file_name(url: str, index: int) -> str:
|
| 7 |
+
"""URLからファイル名を生成(potion_XXX.md形式)
|
| 8 |
+
|
| 9 |
+
Args:
|
| 10 |
+
url: 処理対象のURL
|
| 11 |
+
index: URLのインデックス番号
|
| 12 |
+
|
| 13 |
+
Returns:
|
| 14 |
+
生成されたファイル名(例: potion_001.md)
|
| 15 |
+
"""
|
| 16 |
+
# URLからIDを抽出するか、インデックスを使用
|
| 17 |
+
url_parts = urlparse(url).path.strip('/').split('/')
|
| 18 |
+
if url_parts and url_parts[-1].isdigit():
|
| 19 |
+
doc_id = url_parts[-1].zfill(3)
|
| 20 |
+
else:
|
| 21 |
+
doc_id = str(index).zfill(3)
|
| 22 |
+
return f"potion_{doc_id}.md"
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def extract_title(content: str) -> str:
|
| 26 |
+
"""Markdownコンテンツからタイトルを抽出
|
| 27 |
+
|
| 28 |
+
Args:
|
| 29 |
+
content: Markdownコンテンツ
|
| 30 |
+
|
| 31 |
+
Returns:
|
| 32 |
+
抽出されたタイトル(見つからない場合は"Untitled")
|
| 33 |
+
"""
|
| 34 |
+
lines = content.split('\n')
|
| 35 |
+
for line in lines:
|
| 36 |
+
if line.startswith('# '):
|
| 37 |
+
return line.replace('# ', '').strip()
|
| 38 |
+
return "Untitled"
|
src/processing/scrape_to_metadata.py
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from pathlib import Path
|
| 2 |
+
from typing import List, Dict, Any, Optional
|
| 3 |
+
|
| 4 |
+
from src.processing.batch_processor import BatchProcessor
|
| 5 |
+
|
| 6 |
+
class ScrapeToMetadata:
|
| 7 |
+
"""スクレイピング→メタデータ生成を一気通貫で処理するファサードクラス"""
|
| 8 |
+
|
| 9 |
+
def __init__(self, wait_time: float = 1.0):
|
| 10 |
+
"""
|
| 11 |
+
Args:
|
| 12 |
+
wait_time: リクエスト間の待機時間(秒)
|
| 13 |
+
"""
|
| 14 |
+
self.batch_processor = BatchProcessor(wait_time)
|
| 15 |
+
|
| 16 |
+
async def process_urls_batch(
|
| 17 |
+
self,
|
| 18 |
+
urls: List[str],
|
| 19 |
+
start_id: int = 1,
|
| 20 |
+
mode: str = "memory",
|
| 21 |
+
show_progress: bool = True,
|
| 22 |
+
save_dir: Optional[Path] = None,
|
| 23 |
+
verbose: bool = False
|
| 24 |
+
) -> List[Dict[str, Any]]:
|
| 25 |
+
"""
|
| 26 |
+
複数URLをバッチ処理してメタデータを生成
|
| 27 |
+
|
| 28 |
+
Args:
|
| 29 |
+
urls: 処理するURLのリスト
|
| 30 |
+
start_id: 開始ID
|
| 31 |
+
mode: "memory" または "save"
|
| 32 |
+
show_progress: プログレス表示の有無
|
| 33 |
+
save_dir: saveモード時のMarkdownファイル保存先ディレクトリ
|
| 34 |
+
verbose: 詳細ログを表示するか
|
| 35 |
+
|
| 36 |
+
Returns:
|
| 37 |
+
生成されたドキュメントメタデータのリスト
|
| 38 |
+
"""
|
| 39 |
+
return await self.batch_processor.process_urls_batch(
|
| 40 |
+
urls=urls,
|
| 41 |
+
start_id=start_id,
|
| 42 |
+
mode=mode,
|
| 43 |
+
show_progress=show_progress,
|
| 44 |
+
save_dir=save_dir,
|
| 45 |
+
verbose=verbose
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
def save_metadata(self, documents: List[Dict[str, Any]], output_path: Path):
|
| 49 |
+
"""メタデータをJSON形式で保存
|
| 50 |
+
|
| 51 |
+
Args:
|
| 52 |
+
documents: 保存するドキュメントメタデータのリスト
|
| 53 |
+
output_path: 出力ファイルパス
|
| 54 |
+
"""
|
| 55 |
+
self.batch_processor.save_metadata(documents, output_path)
|
src/qa/__init__.py
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from .chain import QAChain, ask_question
|
| 2 |
+
|
| 3 |
+
__all__ = ['QAChain', 'ask_question']
|
src/qa/__pycache__/__init__.cpython-312.pyc
ADDED
|
Binary file (295 Bytes). View file
|
|
|
src/qa/__pycache__/chain.cpython-312.pyc
ADDED
|
Binary file (17.8 kB). View file
|
|
|
src/qa/__pycache__/prompt.cpython-312.pyc
ADDED
|
Binary file (2.24 kB). View file
|
|
|
src/qa/chain.py
ADDED
|
@@ -0,0 +1,341 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
RAGベースの質問応答用QA Chainモジュール
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import os
|
| 6 |
+
import logging
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
from typing import List, Optional, Tuple
|
| 9 |
+
|
| 10 |
+
from dotenv import load_dotenv
|
| 11 |
+
from langchain_chroma import Chroma
|
| 12 |
+
from langchain_core.documents import Document
|
| 13 |
+
from langchain_core.output_parsers import StrOutputParser
|
| 14 |
+
from langchain_core.prompts import ChatPromptTemplate
|
| 15 |
+
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
|
| 16 |
+
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
|
| 17 |
+
|
| 18 |
+
from .prompt import QA_TEMPLATE, CHARACTER_TEMPLATE, QA_TEMPLATE_WITH_HISTORY
|
| 19 |
+
|
| 20 |
+
logger = logging.getLogger(__name__)
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
class QAChain:
|
| 24 |
+
"""RAGベースの質問応答用QA Chain"""
|
| 25 |
+
|
| 26 |
+
def __init__(
|
| 27 |
+
self,
|
| 28 |
+
persist_dir: str = "data/vector_store",
|
| 29 |
+
model_name: str = "text-embedding-3-small",
|
| 30 |
+
k: int = 5,
|
| 31 |
+
verbose: bool = False,
|
| 32 |
+
llm_model: str = "gpt-5-nano",
|
| 33 |
+
llm_temperature: float = 0.3,
|
| 34 |
+
llm_max_tokens: Optional[int] = None,
|
| 35 |
+
max_history_turns: int = 10,
|
| 36 |
+
max_history_chars: int = 10000
|
| 37 |
+
):
|
| 38 |
+
"""
|
| 39 |
+
永続化されたベクトルストアを使ってQA Chainを初期化
|
| 40 |
+
|
| 41 |
+
Args:
|
| 42 |
+
persist_dir: ベクトルストアの保存ディレクトリ
|
| 43 |
+
model_name: 埋め込みモデル名
|
| 44 |
+
k: 検索する文書の数
|
| 45 |
+
verbose: 詳細ログの出力
|
| 46 |
+
llm_model: 使用するLLMモデル名
|
| 47 |
+
llm_temperature: LLMの温度パラメーター(0-2)
|
| 48 |
+
llm_max_tokens: 最大トークン数(Noneで自動)
|
| 49 |
+
max_history_turns: 保持する最大会話ターン数(デフォルト: 10)
|
| 50 |
+
max_history_chars: 履歴の最大文字数(デフォルト: 10000)
|
| 51 |
+
"""
|
| 52 |
+
self.persist_dir = persist_dir
|
| 53 |
+
self.model_name = model_name
|
| 54 |
+
self.k = k
|
| 55 |
+
self.verbose = verbose
|
| 56 |
+
self.llm_model = llm_model
|
| 57 |
+
self.llm_temperature = llm_temperature
|
| 58 |
+
self.llm_max_tokens = llm_max_tokens
|
| 59 |
+
self.max_history_turns = max_history_turns
|
| 60 |
+
self.max_history_chars = max_history_chars
|
| 61 |
+
self.conversation_history = [] # 配列形式で管理
|
| 62 |
+
|
| 63 |
+
# 環境変数の読み込み(より堅牢な実装)
|
| 64 |
+
try:
|
| 65 |
+
load_dotenv(dotenv_path=".env")
|
| 66 |
+
api_key = os.getenv("OPENAI_API_KEY")
|
| 67 |
+
if not api_key:
|
| 68 |
+
raise ValueError("OPENAI_API_KEYが.envファイルに見つかりません")
|
| 69 |
+
except FileNotFoundError:
|
| 70 |
+
if self.verbose:
|
| 71 |
+
logger.warning(".envファイルが見つかりません。環境変数から読み込みを試みます。")
|
| 72 |
+
api_key = os.getenv("OPENAI_API_KEY")
|
| 73 |
+
if not api_key:
|
| 74 |
+
raise ValueError("OPENAI_API_KEYが環境変数に設定されていません")
|
| 75 |
+
except Exception as e:
|
| 76 |
+
logger.error(f"環境変数の読み込み中にエラーが発生しました: {e}")
|
| 77 |
+
raise
|
| 78 |
+
|
| 79 |
+
self.db = None
|
| 80 |
+
self.retriever = None
|
| 81 |
+
self.rag_chain = None
|
| 82 |
+
self.model_with_history = None
|
| 83 |
+
self._setup_chain()
|
| 84 |
+
|
| 85 |
+
def _load_vector_store(self) -> Chroma:
|
| 86 |
+
"""永続化されたベクトルストアを読み込む"""
|
| 87 |
+
persist_path = Path(self.persist_dir)
|
| 88 |
+
if not persist_path.exists():
|
| 89 |
+
raise FileNotFoundError(
|
| 90 |
+
f"ベクトルストアが見つかりません: {self.persist_dir}\n"
|
| 91 |
+
"まず次のコマンドでベクトルストアを作成してください: python -m src.cli vector build"
|
| 92 |
+
)
|
| 93 |
+
embeddings = OpenAIEmbeddings(model=self.model_name)
|
| 94 |
+
db = Chroma(
|
| 95 |
+
persist_directory=self.persist_dir,
|
| 96 |
+
embedding_function=embeddings
|
| 97 |
+
)
|
| 98 |
+
if self.verbose:
|
| 99 |
+
logger.info(f"ベクトルストアを{self.persist_dir}から読み込みました")
|
| 100 |
+
return db
|
| 101 |
+
|
| 102 |
+
@staticmethod
|
| 103 |
+
def _format_docs_with_metadata(docs: List[Document]) -> str:
|
| 104 |
+
"""文書をメタデータ付きでコンテキスト用に整形"""
|
| 105 |
+
return '\n\n'.join(
|
| 106 |
+
f"[出典: {doc.metadata.get('title', 'Unknown Title')} - チャンク {doc.metadata.get('chunk_index', 'N/A')}]"
|
| 107 |
+
f"\nURL: {doc.metadata.get('source_url', 'Unknown Source')}"
|
| 108 |
+
f"\n内容: {doc.page_content}\n---"
|
| 109 |
+
for doc in docs
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
def _setup_chain(self):
|
| 113 |
+
"""RAGチェーン全体をセットアップ"""
|
| 114 |
+
try:
|
| 115 |
+
self.db = self._load_vector_store()
|
| 116 |
+
self.retriever = self.db.as_retriever(
|
| 117 |
+
search_type='similarity',
|
| 118 |
+
search_kwargs={'k': self.k}
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
# LLMのパラメーターを動的に構成
|
| 122 |
+
llm_params = {
|
| 123 |
+
'model': self.llm_model,
|
| 124 |
+
}
|
| 125 |
+
if self.llm_model == 'gpt-5-nano':
|
| 126 |
+
llm_params['temperature'] = 1.0
|
| 127 |
+
llm_params['reasoning_effort'] = 'minimal'
|
| 128 |
+
llm_params['verbosity'] = 'low'
|
| 129 |
+
else:
|
| 130 |
+
# その他のモデルでは指定された temperature を使用
|
| 131 |
+
if self.llm_temperature != 0.3:
|
| 132 |
+
llm_params['temperature'] = self.llm_temperature
|
| 133 |
+
|
| 134 |
+
if self.llm_max_tokens is not None:
|
| 135 |
+
llm_params['max_tokens'] = self.llm_max_tokens
|
| 136 |
+
|
| 137 |
+
model = ChatOpenAI(**llm_params)
|
| 138 |
+
format_docs = RunnableLambda(self._format_docs_with_metadata)
|
| 139 |
+
qa_prompt = ChatPromptTemplate.from_template(QA_TEMPLATE)
|
| 140 |
+
|
| 141 |
+
# 通常のRAGチェーン
|
| 142 |
+
self.rag_chain = {
|
| 143 |
+
'context': self.retriever | format_docs,
|
| 144 |
+
'question': RunnablePassthrough(),
|
| 145 |
+
} | qa_prompt | model | StrOutputParser()
|
| 146 |
+
|
| 147 |
+
# モデルを保存
|
| 148 |
+
self.model = model
|
| 149 |
+
self.model_with_history = model
|
| 150 |
+
self.format_docs_with_history = format_docs
|
| 151 |
+
|
| 152 |
+
if self.verbose:
|
| 153 |
+
logger.info("永続化ベクトルストアを使ってRAGチェーンを作成しました!")
|
| 154 |
+
except Exception as e:
|
| 155 |
+
logger.error(f"QAチェーンのセットアップ中にエラー: {e}")
|
| 156 |
+
raise
|
| 157 |
+
|
| 158 |
+
def ask(self, question: str) -> Tuple[str, List[Document]]:
|
| 159 |
+
"""
|
| 160 |
+
質問を投げて、回答と参照文書を取得
|
| 161 |
+
"""
|
| 162 |
+
if not self.model:
|
| 163 |
+
raise RuntimeError("モデルが正しく初期化されていません")
|
| 164 |
+
try:
|
| 165 |
+
# 文書を検索して整形
|
| 166 |
+
source_docs = self.retriever.invoke(question)
|
| 167 |
+
context = self._format_docs_with_metadata(source_docs)
|
| 168 |
+
|
| 169 |
+
# URLリストを作成(重複を除去)
|
| 170 |
+
source_urls = []
|
| 171 |
+
for doc in source_docs:
|
| 172 |
+
url = doc.metadata.get('source_url', '')
|
| 173 |
+
if url and url not in source_urls:
|
| 174 |
+
source_urls.append(url)
|
| 175 |
+
urls_text = '\n'.join(f"- {url}" for url in source_urls)
|
| 176 |
+
|
| 177 |
+
# プロンプトを構築して実行
|
| 178 |
+
prompt_input = {
|
| 179 |
+
'context': context,
|
| 180 |
+
'question': question,
|
| 181 |
+
'source_urls': urls_text
|
| 182 |
+
}
|
| 183 |
+
qa_prompt = ChatPromptTemplate.from_template(QA_TEMPLATE)
|
| 184 |
+
|
| 185 |
+
# プロンプトテンプレートを適用して回答を生成
|
| 186 |
+
messages = qa_prompt.invoke(prompt_input)
|
| 187 |
+
answer = self.model.invoke(messages).content
|
| 188 |
+
|
| 189 |
+
return answer, source_docs
|
| 190 |
+
except Exception as e:
|
| 191 |
+
logger.error(f"質問処理中にエラー: {e}")
|
| 192 |
+
raise
|
| 193 |
+
|
| 194 |
+
def _manage_history_window(self):
|
| 195 |
+
"""
|
| 196 |
+
Sliding Windowを使用して履歴を管理
|
| 197 |
+
最大ターン数と最大文字数の両方を考慮
|
| 198 |
+
"""
|
| 199 |
+
# ターン数の制限
|
| 200 |
+
if len(self.conversation_history) > self.max_history_turns:
|
| 201 |
+
self.conversation_history = self.conversation_history[-self.max_history_turns:]
|
| 202 |
+
|
| 203 |
+
# 文字数の制限(古い会話から削除)
|
| 204 |
+
total_chars = sum(len(turn) for turn in self.conversation_history)
|
| 205 |
+
while total_chars > self.max_history_chars and len(self.conversation_history) > 1:
|
| 206 |
+
removed = self.conversation_history.pop(0)
|
| 207 |
+
total_chars -= len(removed)
|
| 208 |
+
if self.verbose:
|
| 209 |
+
logger.info(f"履歴が制限を超えたため、古い会話を削除しました(削除文字数: {len(removed)})")
|
| 210 |
+
|
| 211 |
+
def _format_history_text(self) -> str:
|
| 212 |
+
"""
|
| 213 |
+
会話履歴配列を文字列に整形
|
| 214 |
+
"""
|
| 215 |
+
if not self.conversation_history:
|
| 216 |
+
return "まだ会話履歴はありません"
|
| 217 |
+
return "\n".join(self.conversation_history)
|
| 218 |
+
|
| 219 |
+
def ask_with_history(self, question: str, retry_count: int = 0) -> Tuple[str, List[Document]]:
|
| 220 |
+
"""
|
| 221 |
+
対話履歴を考慮した質問応答(Sliding Window機能付き)
|
| 222 |
+
|
| 223 |
+
Args:
|
| 224 |
+
question: 質問内容
|
| 225 |
+
retry_count: リトライ回数(内部使用)
|
| 226 |
+
"""
|
| 227 |
+
if not self.model_with_history:
|
| 228 |
+
raise RuntimeError("履歴付きモデルが正しく初期化されていません")
|
| 229 |
+
|
| 230 |
+
try:
|
| 231 |
+
# 履歴を文字列形式に変換
|
| 232 |
+
history_text = self._format_history_text()
|
| 233 |
+
|
| 234 |
+
# 文書を検索して整形
|
| 235 |
+
source_docs = self.retriever.invoke(question)
|
| 236 |
+
context = self._format_docs_with_metadata(source_docs)
|
| 237 |
+
|
| 238 |
+
# URLリストを作成(重複を除去)
|
| 239 |
+
source_urls = []
|
| 240 |
+
for doc in source_docs:
|
| 241 |
+
url = doc.metadata.get('source_url', '')
|
| 242 |
+
if url and url not in source_urls:
|
| 243 |
+
source_urls.append(url)
|
| 244 |
+
urls_text = '\n'.join(f"- {url}" for url in source_urls)
|
| 245 |
+
|
| 246 |
+
# プロンプトを手動で構築して実行
|
| 247 |
+
prompt_input = {
|
| 248 |
+
'context': context,
|
| 249 |
+
'question': question,
|
| 250 |
+
'conversation_history': history_text,
|
| 251 |
+
'source_urls': urls_text
|
| 252 |
+
}
|
| 253 |
+
qa_prompt_with_history = ChatPromptTemplate.from_messages([
|
| 254 |
+
("system", CHARACTER_TEMPLATE),
|
| 255 |
+
("system", QA_TEMPLATE_WITH_HISTORY),
|
| 256 |
+
])
|
| 257 |
+
|
| 258 |
+
# プロンプトテンプレートを適用して回答を生成
|
| 259 |
+
messages = qa_prompt_with_history.invoke(prompt_input)
|
| 260 |
+
answer = self.model_with_history.invoke(messages).content
|
| 261 |
+
|
| 262 |
+
# 新しい会話を履歴に追加
|
| 263 |
+
new_turn = f"ユーザー: {question}\nKurageSan®: {answer}"
|
| 264 |
+
self.conversation_history.append(new_turn)
|
| 265 |
+
|
| 266 |
+
# Sliding Windowを適用
|
| 267 |
+
self._manage_history_window()
|
| 268 |
+
|
| 269 |
+
if self.verbose:
|
| 270 |
+
total_chars = sum(len(turn) for turn in self.conversation_history)
|
| 271 |
+
logger.info(f"履歴付き質問処理完了。ターン数: {len(self.conversation_history)}, 合計文字数: {total_chars}")
|
| 272 |
+
|
| 273 |
+
return answer, source_docs
|
| 274 |
+
|
| 275 |
+
except Exception as e:
|
| 276 |
+
# トークン制限エラーの処理
|
| 277 |
+
error_message = str(e).lower()
|
| 278 |
+
if retry_count < 2 and ('maximum context length' in error_message or
|
| 279 |
+
'token' in error_message and 'limit' in error_message):
|
| 280 |
+
logger.warning(f"トークン制限エラーが発生しました。履歴を削減して再試行します(試行回数: {retry_count + 1})")
|
| 281 |
+
|
| 282 |
+
# 履歴を半分に削減
|
| 283 |
+
if len(self.conversation_history) > 1:
|
| 284 |
+
old_size = len(self.conversation_history)
|
| 285 |
+
self.conversation_history = self.conversation_history[old_size//2:]
|
| 286 |
+
logger.info(f"会話履歴を削減: {old_size} -> {len(self.conversation_history)} ターン")
|
| 287 |
+
else:
|
| 288 |
+
# 履歴が1つ以下の場合はクリア
|
| 289 |
+
self.conversation_history = []
|
| 290 |
+
logger.info("会話履歴を完全にクリアしました")
|
| 291 |
+
|
| 292 |
+
# リトライ
|
| 293 |
+
return self.ask_with_history(question, retry_count + 1)
|
| 294 |
+
|
| 295 |
+
logger.error(f"履歴付き質問処理中にエラー: {e}")
|
| 296 |
+
raise
|
| 297 |
+
|
| 298 |
+
def clear_history(self):
|
| 299 |
+
"""
|
| 300 |
+
対話履歴をクリア
|
| 301 |
+
"""
|
| 302 |
+
self.conversation_history = []
|
| 303 |
+
if self.verbose:
|
| 304 |
+
logger.info("対話履歴をクリアしました")
|
| 305 |
+
|
| 306 |
+
def get_history(self) -> str:
|
| 307 |
+
"""
|
| 308 |
+
現在の対話履歴を取得(デバッグ用)
|
| 309 |
+
"""
|
| 310 |
+
return self._format_history_text()
|
| 311 |
+
|
| 312 |
+
def search_similar(self, query: str, k: int = 5) -> List[Document]:
|
| 313 |
+
"""
|
| 314 |
+
類似文書を検索
|
| 315 |
+
"""
|
| 316 |
+
if not self.retriever:
|
| 317 |
+
raise RuntimeError("リトリーバーが正しく初期化されていません")
|
| 318 |
+
self.retriever.search_kwargs['k'] = k
|
| 319 |
+
return self.retriever.invoke(query)
|
| 320 |
+
|
| 321 |
+
|
| 322 |
+
def ask_question(question: str, persist_dir: str = "data/vector_store") -> None:
|
| 323 |
+
"""
|
| 324 |
+
質問を投げて、回答と参照情報を表示(後方互換用関数)
|
| 325 |
+
"""
|
| 326 |
+
try:
|
| 327 |
+
qa_chain = QAChain(persist_dir=persist_dir, verbose=True)
|
| 328 |
+
answer, source_docs = qa_chain.ask(question)
|
| 329 |
+
|
| 330 |
+
print(f"\n{'='*50}")
|
| 331 |
+
print(f"質問: {question}")
|
| 332 |
+
print(f"{'='*50}")
|
| 333 |
+
print(f"\n回答:\n{answer}")
|
| 334 |
+
|
| 335 |
+
print(f"\n参考にした文書 ({len(source_docs)}件):")
|
| 336 |
+
for i, doc in enumerate(source_docs, 1):
|
| 337 |
+
print(f"\n{i}. {doc.metadata.get('title', 'No Title')}")
|
| 338 |
+
print(f" URL: {doc.metadata.get('source_url', 'N/A')}")
|
| 339 |
+
print(f" チャンク: {doc.metadata.get('chunk_index', 'N/A')}")
|
| 340 |
+
except Exception as e:
|
| 341 |
+
print(f"エラー: {e}")
|
src/qa/prompt.py
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
QA_TEMPLATE = '''
|
| 2 |
+
以下の文脈情報を参考にして、質問に対して正確で有用な回答を提供してください。
|
| 3 |
+
|
| 4 |
+
【参考URL一覧】
|
| 5 |
+
以下のURLから情報を取得しています。回答で情報を引用する際は、必ずこれらのURLを明記してください:
|
| 6 |
+
{source_urls}
|
| 7 |
+
|
| 8 |
+
文脈情報:
|
| 9 |
+
{context}
|
| 10 |
+
|
| 11 |
+
質問: {question}
|
| 12 |
+
|
| 13 |
+
回答:
|
| 14 |
+
'''
|
| 15 |
+
|
| 16 |
+
CHARACTER_TEMPLATE = '''
|
| 17 |
+
"敬語"ではなく、フレンドリーな口調で答えてください。”敬語”は必ず使わないでください。
|
| 18 |
+
あなたはフレンドリーで優しいです。
|
| 19 |
+
あなたの名前は「KurageSan®」です。
|
| 20 |
+
必ずわかりやすく答えてください。絵文字は使わないでください。
|
| 21 |
+
ユーザーが読みやすい文章で答えてください。
|
| 22 |
+
あなたには、「https://ja.empatheme.org/potion/」の情報に基づいて英語の発音や、練習、英語学習、英語学習に対する向き合い方に関する質問に答える役割があります。
|
| 23 |
+
|
| 24 |
+
【あなたの対話スタイル】
|
| 25 |
+
- ユーザーの学習状況や背景を理解しようと努める
|
| 26 |
+
- 質問の背後にある本当のニーズや困りごとを察知する
|
| 27 |
+
- 共感的で励ましのある態度を保つ
|
| 28 |
+
- 実践的で具体的なアドバイスを心がける
|
| 29 |
+
- 練習メニューは作らなくて良いです。
|
| 30 |
+
|
| 31 |
+
【スタイルの厳守ルール】
|
| 32 |
+
- 出力は【最大5文 or 150字以内】のどちらか短い方
|
| 33 |
+
- 箇条書きは【最大3項目まで】
|
| 34 |
+
- 不要な見出し・冗長な導入は禁止
|
| 35 |
+
'''
|
| 36 |
+
|
| 37 |
+
QA_TEMPLATE_WITH_HISTORY = '''
|
| 38 |
+
以下の文脈情報を参考にして、質問に対して正確で有用な回答を提供してください。
|
| 39 |
+
回答に、それに関する場合は、最も重要な文書のURLを記載してください。
|
| 40 |
+
回答時に添付するとユーザーがためになりそうな場合は、そのpotionのリンクを貼ってください。
|
| 41 |
+
|
| 42 |
+
文脈情報:
|
| 43 |
+
{context}
|
| 44 |
+
|
| 45 |
+
質問: {question}
|
| 46 |
+
|
| 47 |
+
過去の会話履歴:
|
| 48 |
+
{conversation_history}
|
| 49 |
+
|
| 50 |
+
回答:
|
| 51 |
+
'''
|
src/scraping/__init__.py
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from src.scraping.convert import to_markdown
|
| 2 |
+
from src.scraping.extract import extract_main_html
|
| 3 |
+
from src.scraping.fetch import fetch_html
|
| 4 |
+
from src.scraping.io import save_markdown
|
| 5 |
+
from src.scraping.pipeline import run
|
| 6 |
+
from src.scraping.textutil import compact_blank_lines
|
| 7 |
+
|
| 8 |
+
__all__ = [
|
| 9 |
+
"fetch_html",
|
| 10 |
+
"extract_main_html",
|
| 11 |
+
"to_markdown",
|
| 12 |
+
"compact_blank_lines",
|
| 13 |
+
"save_markdown",
|
| 14 |
+
"run",
|
| 15 |
+
]
|
src/scraping/__pycache__/__init__.cpython-312.pyc
ADDED
|
Binary file (603 Bytes). View file
|
|
|
src/scraping/__pycache__/batch.cpython-312.pyc
ADDED
|
Binary file (7.68 kB). View file
|
|
|
src/scraping/__pycache__/cli.cpython-312.pyc
ADDED
|
Binary file (5.78 kB). View file
|
|
|
src/scraping/__pycache__/convert.cpython-312.pyc
ADDED
|
Binary file (451 Bytes). View file
|
|
|
src/scraping/__pycache__/exceptions.cpython-312.pyc
ADDED
|
Binary file (741 Bytes). View file
|
|
|
src/scraping/__pycache__/extract.cpython-312.pyc
ADDED
|
Binary file (2.59 kB). View file
|
|
|
src/scraping/__pycache__/fetch.cpython-312.pyc
ADDED
|
Binary file (2.72 kB). View file
|
|
|
src/scraping/__pycache__/io.cpython-312.pyc
ADDED
|
Binary file (1.73 kB). View file
|
|
|
src/scraping/__pycache__/pipeline.cpython-312.pyc
ADDED
|
Binary file (2.17 kB). View file
|
|
|
src/scraping/__pycache__/textutil.cpython-312.pyc
ADDED
|
Binary file (815 Bytes). View file
|
|
|
src/scraping/batch.py
ADDED
|
@@ -0,0 +1,165 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
バッチスクレイピング処理モジュール
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
import asyncio
|
| 6 |
+
import logging
|
| 7 |
+
import sys
|
| 8 |
+
from enum import Enum
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
from typing import List, Tuple, Literal, Optional
|
| 11 |
+
|
| 12 |
+
from tqdm import tqdm
|
| 13 |
+
|
| 14 |
+
from src.scraping.exceptions import ArticleNotFoundError, FetchError
|
| 15 |
+
from src.scraping.pipeline import run as run_pipeline
|
| 16 |
+
|
| 17 |
+
# ロガーの設定
|
| 18 |
+
logging.basicConfig(level=logging.INFO, format='%(message)s')
|
| 19 |
+
logger = logging.getLogger(__name__)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class ScrapeStatus(Enum):
|
| 23 |
+
"""スクレイピング結果のステータス"""
|
| 24 |
+
SUCCESS = "success"
|
| 25 |
+
SKIPPED = "skipped" # 記事が存在しない
|
| 26 |
+
FAILED = "failed" # その他のエラー
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
async def scrape_single_page(url: str, out_dir: Path) -> Tuple[str, ScrapeStatus, str]:
|
| 30 |
+
"""
|
| 31 |
+
単一ページのスクレイピング
|
| 32 |
+
|
| 33 |
+
Returns:
|
| 34 |
+
(url, status, message) のタプル
|
| 35 |
+
"""
|
| 36 |
+
try:
|
| 37 |
+
path = await run_pipeline(url, out_dir)
|
| 38 |
+
return (url, ScrapeStatus.SUCCESS, f"保存完了: {path}")
|
| 39 |
+
except ArticleNotFoundError:
|
| 40 |
+
return (url, ScrapeStatus.SKIPPED, "記事が見つかりません")
|
| 41 |
+
except FetchError as e:
|
| 42 |
+
return (url, ScrapeStatus.FAILED, f"取得エラー: {str(e)}")
|
| 43 |
+
except Exception as e:
|
| 44 |
+
return (url, ScrapeStatus.FAILED, f"エラー: {str(e)}")
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
async def batch_scrape(
|
| 48 |
+
start_id: int,
|
| 49 |
+
end_id: int,
|
| 50 |
+
out_dir: Path,
|
| 51 |
+
delay: float = 1.0,
|
| 52 |
+
base_url: str = "https://ja.empatheme.org/potion",
|
| 53 |
+
verbose: bool = False
|
| 54 |
+
) -> List[Tuple[str, ScrapeStatus, str]]:
|
| 55 |
+
"""
|
| 56 |
+
指定範囲のIDでバッチスクレイピング実行
|
| 57 |
+
|
| 58 |
+
Args:
|
| 59 |
+
start_id: 開始ID
|
| 60 |
+
end_id: 終了ID(含む)
|
| 61 |
+
out_dir: 出力ディレクトリ
|
| 62 |
+
delay: 各リクエスト間の待機時間(秒)
|
| 63 |
+
base_url: ベースURL
|
| 64 |
+
verbose: 詳細ログを表示するか
|
| 65 |
+
|
| 66 |
+
Returns:
|
| 67 |
+
各URLの処理結果のリスト
|
| 68 |
+
"""
|
| 69 |
+
results = []
|
| 70 |
+
total = end_id - start_id + 1
|
| 71 |
+
|
| 72 |
+
logger.info(f"スクレイピング開始: ID {start_id} から {end_id} まで(計{total}件)")
|
| 73 |
+
logger.info(f"出力先: {out_dir}")
|
| 74 |
+
logger.info(f"待機時間: {delay}秒\n")
|
| 75 |
+
|
| 76 |
+
# カウンター初期化
|
| 77 |
+
success_count = 0
|
| 78 |
+
skipped_count = 0
|
| 79 |
+
failed_count = 0
|
| 80 |
+
|
| 81 |
+
# プログレスバーの作成(単一行で更新)
|
| 82 |
+
pbar = tqdm(
|
| 83 |
+
total=total,
|
| 84 |
+
desc="処理中",
|
| 85 |
+
leave=True,
|
| 86 |
+
ncols=80,
|
| 87 |
+
bar_format='{desc}: {percentage:3.0f}%|{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {postfix}]'
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
try:
|
| 91 |
+
for page_id in range(start_id, end_id + 1):
|
| 92 |
+
url = f"{base_url}/{page_id:03d}/"
|
| 93 |
+
|
| 94 |
+
# スクレイピング実行
|
| 95 |
+
result = await scrape_single_page(url, out_dir)
|
| 96 |
+
results.append(result)
|
| 97 |
+
|
| 98 |
+
# カウンター更新
|
| 99 |
+
url, status, message = result
|
| 100 |
+
if status == ScrapeStatus.SUCCESS:
|
| 101 |
+
success_count += 1
|
| 102 |
+
elif status == ScrapeStatus.SKIPPED:
|
| 103 |
+
skipped_count += 1
|
| 104 |
+
else: # FAILED
|
| 105 |
+
failed_count += 1
|
| 106 |
+
|
| 107 |
+
# プログレスバーの説明を更新
|
| 108 |
+
pbar.set_postfix({
|
| 109 |
+
'成功': success_count,
|
| 110 |
+
'スキップ': skipped_count,
|
| 111 |
+
'失敗': failed_count
|
| 112 |
+
})
|
| 113 |
+
|
| 114 |
+
# verboseモードの場合は詳細ログも表示
|
| 115 |
+
if verbose:
|
| 116 |
+
# プログレスバーを一時的にクリアして詳細を表示
|
| 117 |
+
pbar.clear()
|
| 118 |
+
if status == ScrapeStatus.SUCCESS:
|
| 119 |
+
print(f" ✓ {url}: {message}")
|
| 120 |
+
elif status == ScrapeStatus.SKIPPED:
|
| 121 |
+
print(f" ⊘ {url}: {message}")
|
| 122 |
+
else: # FAILED
|
| 123 |
+
print(f" ✗ {url}: {message}")
|
| 124 |
+
pbar.refresh()
|
| 125 |
+
|
| 126 |
+
# プログレスバーを進める
|
| 127 |
+
pbar.update(1)
|
| 128 |
+
|
| 129 |
+
# 最後のページでなければ待機
|
| 130 |
+
if page_id < end_id:
|
| 131 |
+
await asyncio.sleep(delay)
|
| 132 |
+
finally:
|
| 133 |
+
pbar.close()
|
| 134 |
+
|
| 135 |
+
return results
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
def print_summary(results: List[Tuple[str, ScrapeStatus, str]]) -> None:
|
| 139 |
+
"""処理結果のサマリーを表示"""
|
| 140 |
+
total = len(results)
|
| 141 |
+
success_count = sum(1 for _, status, _ in results if status == ScrapeStatus.SUCCESS)
|
| 142 |
+
skipped_count = sum(1 for _, status, _ in results if status == ScrapeStatus.SKIPPED)
|
| 143 |
+
failed_count = sum(1 for _, status, _ in results if status == ScrapeStatus.FAILED)
|
| 144 |
+
|
| 145 |
+
logger.info("\n" + "="*50)
|
| 146 |
+
logger.info("処理結果サマリー")
|
| 147 |
+
logger.info("="*50)
|
| 148 |
+
logger.info(f"合計: {total}件")
|
| 149 |
+
logger.info(f"成功: {success_count}件")
|
| 150 |
+
logger.info(f"スキップ(記事なし): {skipped_count}件")
|
| 151 |
+
logger.info(f"失敗: {failed_count}件")
|
| 152 |
+
|
| 153 |
+
# スキップしたURL(記事が存在しない)の表示
|
| 154 |
+
if skipped_count > 0:
|
| 155 |
+
logger.info("\nスキップしたURL(記事が存在しない):")
|
| 156 |
+
for url, status, message in results:
|
| 157 |
+
if status == ScrapeStatus.SKIPPED:
|
| 158 |
+
logger.info(f" ⊘ {url}")
|
| 159 |
+
|
| 160 |
+
# 失敗したURLの詳細表示
|
| 161 |
+
if failed_count > 0:
|
| 162 |
+
logger.info("\n失敗したURL:")
|
| 163 |
+
for url, status, message in results:
|
| 164 |
+
if status == ScrapeStatus.FAILED:
|
| 165 |
+
logger.info(f" ✗ {url}: {message}")
|
src/scraping/convert.py
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from markdownify import markdownify as md
|
| 2 |
+
|
| 3 |
+
def to_markdown(html: str) -> str:
|
| 4 |
+
return md(html, heading_style="ATX")
|
src/scraping/exceptions.py
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
スクレイピング処理用のカスタム例外
|
| 3 |
+
"""
|
| 4 |
+
|
| 5 |
+
class ArticleNotFoundError(Exception):
|
| 6 |
+
"""記事が存在しない場合の例外(404エラー)"""
|
| 7 |
+
pass
|
| 8 |
+
|
| 9 |
+
class FetchError(Exception):
|
| 10 |
+
"""その他のフェッチエラー"""
|
| 11 |
+
pass
|
src/scraping/extract.py
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from bs4 import BeautifulSoup
|
| 2 |
+
from readability import Document
|
| 3 |
+
|
| 4 |
+
def extract_main_html(html: str) -> str:
|
| 5 |
+
soup = BeautifulSoup(html, "lxml")
|
| 6 |
+
|
| 7 |
+
main = soup.select_one("main#content") or soup
|
| 8 |
+
for sel in [
|
| 9 |
+
"header", "footer", "nav", "aside",
|
| 10 |
+
".breadcrumbs", ".ast-breadcrumbs", ".yoast-breadcrumbs",
|
| 11 |
+
".site-header", ".site-footer",
|
| 12 |
+
".widget", ".sidebar", ".post-navigation", ".navigation",
|
| 13 |
+
".comments-area", ".comment-respond", ".entry-footer", ".entry-meta",
|
| 14 |
+
".jp-relatedposts", ".related-posts", ".sharedaddy", ".share-buttons",
|
| 15 |
+
".wp-block-search", ".search-form",
|
| 16 |
+
'a[href*="facebook.com/sharer"]', 'a[href*="twitter.com/share"]',
|
| 17 |
+
]:
|
| 18 |
+
for el in main.select(sel):
|
| 19 |
+
el.decompose()
|
| 20 |
+
|
| 21 |
+
article = main.select_one("article") or main
|
| 22 |
+
h1 = article.select_one("h1.entry-title, h1")
|
| 23 |
+
content = (
|
| 24 |
+
article.select_one(".entry-content")
|
| 25 |
+
or article.select_one(".nv-content-wrap")
|
| 26 |
+
or article.select_one(".post-content")
|
| 27 |
+
or article.select_one(".single-content")
|
| 28 |
+
or article.select_one(".content")
|
| 29 |
+
)
|
| 30 |
+
|
| 31 |
+
if content:
|
| 32 |
+
for sel in [
|
| 33 |
+
".sharedaddy", ".share", ".sns", ".advertisement", ".adsbygoogle",
|
| 34 |
+
".post-navigation", ".entry-footer", ".toc_container", ".table-of-contents",
|
| 35 |
+
".jp-relatedposts", ".related-posts"
|
| 36 |
+
]:
|
| 37 |
+
for el in content.select(sel):
|
| 38 |
+
el.decompose()
|
| 39 |
+
|
| 40 |
+
parts = []
|
| 41 |
+
if h1:
|
| 42 |
+
parts.append(str(h1))
|
| 43 |
+
parts.append(str(content))
|
| 44 |
+
return "".join(parts)
|
| 45 |
+
|
| 46 |
+
global_content = soup.select_one(".entry-content")
|
| 47 |
+
if global_content:
|
| 48 |
+
parts = []
|
| 49 |
+
if h1:
|
| 50 |
+
parts.append(str(h1))
|
| 51 |
+
parts.append(str(global_content))
|
| 52 |
+
return "".join(parts)
|
| 53 |
+
|
| 54 |
+
doc = Document(html)
|
| 55 |
+
return f"<h1>{doc.short_title()}</h1>{doc.summary()}"
|
src/scraping/fetch.py
ADDED
|
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import asyncio
|
| 2 |
+
from typing import Optional
|
| 3 |
+
|
| 4 |
+
import aiohttp
|
| 5 |
+
|
| 6 |
+
from src.scraping.exceptions import ArticleNotFoundError, FetchError
|
| 7 |
+
|
| 8 |
+
async def fetch_html(
|
| 9 |
+
url: str,
|
| 10 |
+
timeout_s: float = 20.0,
|
| 11 |
+
user_agent: Optional[str] = None,
|
| 12 |
+
) -> str:
|
| 13 |
+
"""
|
| 14 |
+
HTMLを取得する
|
| 15 |
+
|
| 16 |
+
Raises:
|
| 17 |
+
ArticleNotFoundError: 404エラーの場合
|
| 18 |
+
FetchError: その他のネットワークエラーの場合
|
| 19 |
+
"""
|
| 20 |
+
headers = {
|
| 21 |
+
"User-Agent": user_agent or (
|
| 22 |
+
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
|
| 23 |
+
"(KHTML, like Gecko) Chrome/124.0 Safari/537.36"
|
| 24 |
+
)
|
| 25 |
+
}
|
| 26 |
+
timeout = aiohttp.ClientTimeout(total=timeout_s)
|
| 27 |
+
try:
|
| 28 |
+
async with aiohttp.ClientSession(timeout=timeout, headers=headers) as session:
|
| 29 |
+
async with session.get(url, allow_redirects=True) as resp:
|
| 30 |
+
if resp.status == 404:
|
| 31 |
+
raise ArticleNotFoundError(f"記事が見つかりません: {url}")
|
| 32 |
+
resp.raise_for_status()
|
| 33 |
+
return await resp.text()
|
| 34 |
+
except ArticleNotFoundError:
|
| 35 |
+
raise
|
| 36 |
+
except aiohttp.ClientResponseError as e:
|
| 37 |
+
raise FetchError(f"HTTPエラー {e.status}: {url}")
|
| 38 |
+
except (aiohttp.ClientError, aiohttp.http_exceptions.HttpProcessingError, asyncio.TimeoutError) as e:
|
| 39 |
+
raise FetchError(f"ネットワークエラー: {url} - {str(e)}")
|
src/scraping/io.py
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import hashlib
|
| 2 |
+
import re
|
| 3 |
+
from pathlib import Path
|
| 4 |
+
from urllib.parse import urlparse
|
| 5 |
+
|
| 6 |
+
def _slug_from_url(url: str) -> str:
|
| 7 |
+
"""
|
| 8 |
+
URLから安全なファイル名を生成。例: https://ja.empatheme.org/potion/108/ → potion_108.md
|
| 9 |
+
"""
|
| 10 |
+
p = urlparse(url)
|
| 11 |
+
parts = [part for part in p.path.strip("/").split("/") if part]
|
| 12 |
+
if len(parts) >= 2:
|
| 13 |
+
base = f"{parts[-2]}_{parts[-1]}"
|
| 14 |
+
elif parts:
|
| 15 |
+
base = parts[-1]
|
| 16 |
+
else:
|
| 17 |
+
base = "index"
|
| 18 |
+
# サニタイズ: ファイル名に使えない文字をアンダースコアに置換
|
| 19 |
+
base = re.sub(r'[^\w\-_.]', '_', base)
|
| 20 |
+
return f"{base}.md"
|
| 21 |
+
|
| 22 |
+
def ensure_dir(path: Path) -> None:
|
| 23 |
+
path.mkdir(parents=True, exist_ok=True)
|
| 24 |
+
|
| 25 |
+
def save_markdown(md_text: str, out_dir: Path, url: str) -> Path:
|
| 26 |
+
ensure_dir(out_dir)
|
| 27 |
+
filename = _slug_from_url(url)
|
| 28 |
+
out_path = out_dir / filename
|
| 29 |
+
out_path.write_text(md_text, encoding="utf-8")
|
| 30 |
+
return out_path
|
src/scraping/pipeline.py
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from pathlib import Path
|
| 2 |
+
|
| 3 |
+
from src.scraping.convert import to_markdown
|
| 4 |
+
from src.scraping.exceptions import ArticleNotFoundError, FetchError
|
| 5 |
+
from src.scraping.extract import extract_main_html
|
| 6 |
+
from src.scraping.fetch import fetch_html
|
| 7 |
+
from src.scraping.io import save_markdown
|
| 8 |
+
from src.scraping.textutil import compact_blank_lines
|
| 9 |
+
|
| 10 |
+
async def run(url: str, out_dir: Path) -> Path:
|
| 11 |
+
"""
|
| 12 |
+
URLを取得→本文抽出→Markdown化→空行圧縮→保存 までを実行。
|
| 13 |
+
戻り値は保存先パス。
|
| 14 |
+
|
| 15 |
+
Raises:
|
| 16 |
+
ArticleNotFoundError: 記事が存在しない場合
|
| 17 |
+
FetchError: ネットワークエラーの場合
|
| 18 |
+
"""
|
| 19 |
+
# HTMLを取得(例外が発生する可能性あり)
|
| 20 |
+
html = await fetch_html(url, timeout_s=25.0)
|
| 21 |
+
|
| 22 |
+
# 本文抽出→Markdown化→空行圧縮→保存
|
| 23 |
+
main_html = extract_main_html(html)
|
| 24 |
+
md_out = to_markdown(main_html)
|
| 25 |
+
md_out = compact_blank_lines(md_out)
|
| 26 |
+
return save_markdown(md_out, out_dir, url)
|
| 27 |
+
|
| 28 |
+
async def to_markdown_from_url(url: str, timeout_s: float = 25.0) -> str:
|
| 29 |
+
"""
|
| 30 |
+
URLを取得→本文抽出→Markdown化→空行圧縮 までを実行してMarkdown文字列を返す。
|
| 31 |
+
|
| 32 |
+
Args:
|
| 33 |
+
url: 取得対象URL
|
| 34 |
+
timeout_s: 取得タイムアウト(秒)
|
| 35 |
+
|
| 36 |
+
Returns:
|
| 37 |
+
Markdown化された本文文字列
|
| 38 |
+
|
| 39 |
+
Raises:
|
| 40 |
+
ArticleNotFoundError: 記事が存在しない場合
|
| 41 |
+
FetchError: ネットワークエラーの場合
|
| 42 |
+
"""
|
| 43 |
+
html = await fetch_html(url, timeout_s=timeout_s)
|
| 44 |
+
main_html = extract_main_html(html)
|
| 45 |
+
md_out = to_markdown(main_html)
|
| 46 |
+
return compact_blank_lines(md_out)
|
src/scraping/textutil.py
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
def compact_blank_lines(text: str) -> str:
|
| 2 |
+
lines = [line.rstrip() for line in text.splitlines()]
|
| 3 |
+
out, prev_blank = [], False
|
| 4 |
+
for ln in lines:
|
| 5 |
+
blank = (ln.strip() == "")
|
| 6 |
+
if blank and prev_blank:
|
| 7 |
+
continue
|
| 8 |
+
out.append("" if blank else ln)
|
| 9 |
+
prev_blank = blank
|
| 10 |
+
return "\n".join(out)
|