Commit Β·
3915e0a
1
Parent(s): b788390
README update
Browse files- README.md +251 -91
- app/models/trained_pipeline.joblib +0 -0
- models/trained_pipeline.joblib +0 -0
README.md
CHANGED
|
@@ -9,100 +9,260 @@ pinned: false
|
|
| 9 |
license: mit
|
| 10 |
---
|
| 11 |
|
| 12 |
-
#
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
pytest -v
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
### Train-evaluate model
|
| 26 |
-
|
| 27 |
-
python scripts\seed_data.py
|
| 28 |
-
|
| 29 |
-
python scripts\train_model.py
|
| 30 |
-
|
| 31 |
-
python scripts\evaluate.py
|
| 32 |
-
|
| 33 |
-
## Initial struture
|
| 34 |
-
|
| 35 |
-
Context-aware NLP classification platform with MCP/
|
| 36 |
-
ββ Dockerfile
|
| 37 |
-
ββ docker-compose.yml
|
| 38 |
-
ββ LICENSE
|
| 39 |
-
ββ README.md
|
| 40 |
-
ββ requirements-dev.txt
|
| 41 |
-
ββ requirements.txt
|
| 42 |
-
ββ start.sh
|
| 43 |
-
ββ test_backend.py
|
| 44 |
-
ββ app/
|
| 45 |
-
β ββ config.py
|
| 46 |
-
β ββ logging_config.py
|
| 47 |
-
β ββ main.py # FastAPI entrypoint
|
| 48 |
-
β ββ api/
|
| 49 |
-
β β ββ routes.py # API endpoints (e.g., /predict)
|
| 50 |
-
β β ββ schemas.py
|
| 51 |
-
β ββ classification/
|
| 52 |
-
β β ββ decision.py
|
| 53 |
-
β β ββ llm_adapter.py
|
| 54 |
-
β β ββ model.py
|
| 55 |
-
β β ββ preprocess.py
|
| 56 |
-
β β ββ sklearn_model.py
|
| 57 |
-
β ββ context/
|
| 58 |
-
β β ββ resolver.py
|
| 59 |
-
β ββ logging/
|
| 60 |
-
β ββ context_log.py
|
| 61 |
-
β ββ inference_log.py
|
| 62 |
-
ββ orchestration/
|
| 63 |
-
β ββ context_resolver.py
|
| 64 |
-
β ββ mcp_client.py
|
| 65 |
-
ββ utils/
|
| 66 |
-
β ββ validators.py
|
| 67 |
-
ββ data/
|
| 68 |
-
β ββ mcp/
|
| 69 |
-
β β ββ history.json
|
| 70 |
-
β β ββ policies.json
|
| 71 |
-
β β ββ taxonomy.json
|
| 72 |
-
β ββ processed/
|
| 73 |
-
β ββ raw/
|
| 74 |
-
β ββ samples/
|
| 75 |
-
β ββ training_data.json
|
| 76 |
-
ββ docs/
|
| 77 |
-
β ββ TECH_DEBT.md
|
| 78 |
-
ββ logs/
|
| 79 |
-
ββ mcp_servers/
|
| 80 |
-
β ββ history_server/
|
| 81 |
-
β β ββ server.py
|
| 82 |
-
β β ββ data/
|
| 83 |
-
β β ββ labels.csv
|
| 84 |
-
β ββ policy_server/
|
| 85 |
-
β β ββ server.py
|
| 86 |
-
β β ββ data/
|
| 87 |
-
β β ββ rules.yaml
|
| 88 |
-
β ββ taxonomy_server/
|
| 89 |
-
β ββ server.py
|
| 90 |
-
β ββ data/
|
| 91 |
-
ββ models/
|
| 92 |
-
β ββ trained_pipeline.joblib
|
| 93 |
-
ββ scripts/
|
| 94 |
-
β ββ evaluate.py
|
| 95 |
-
β ββ seed_data.py
|
| 96 |
-
β ββ train_model.py
|
| 97 |
-
ββ tests/
|
| 98 |
-
β ββ conftest.py
|
| 99 |
-
β ββ test_api.py
|
| 100 |
-
β ββ test_classification.py
|
| 101 |
-
β ββ test_context_resolution.py
|
| 102 |
-
β ββ test_mcp_servers.py
|
| 103 |
-
ββ ui/
|
| 104 |
-
ββ static/
|
| 105 |
-
β ββ style.css
|
| 106 |
-
β ββ script.js
|
| 107 |
-
ββ templates/
|
| 108 |
-
ββ index.html
|
|
|
|
| 9 |
license: mit
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Context-aware NLP Classification Platform with MCP
|
| 13 |
|
| 14 |
+
## Overview
|
| 15 |
|
| 16 |
+
This repository implements a **context-aware NLP classification platform** that combines a lightweight TF-IDF + Logistic Regression baseline with optional **LLM-assisted context re-ranking** via MCP (Managed Context Platform). It supports multi-domain classification (finance, HR, legal), structured context resolution, logging, and evaluation.
|
| 17 |
|
| 18 |
+
The platform is modular and can run either **locally in a virtual environment** or inside a **Docker container**.
|
| 19 |
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## Repository Structure
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
Dockerfile
|
| 26 |
+
LICENSE
|
| 27 |
+
README.md
|
| 28 |
+
requirements-dev.txt
|
| 29 |
+
requirements.txt
|
| 30 |
+
|
| 31 |
+
app/
|
| 32 |
+
config.py # Configuration and settings
|
| 33 |
+
logging_config.py # Logging configuration
|
| 34 |
+
main.py # Main entry point for API server
|
| 35 |
+
api/
|
| 36 |
+
routes.py # FastAPI routes
|
| 37 |
+
schemas.py # Pydantic schemas
|
| 38 |
+
classification/
|
| 39 |
+
decision.py # Classification decision & abstention logic
|
| 40 |
+
llm_adapter.py # Optional LLM integration for context
|
| 41 |
+
model.py # Abstract classifier orchestration
|
| 42 |
+
preprocess.py # Text preprocessing and tokenization
|
| 43 |
+
sklearn_model.py # TF-IDF + Logistic Regression classifier
|
| 44 |
+
context/
|
| 45 |
+
resolver.py # Context resolution logic
|
| 46 |
+
logging/
|
| 47 |
+
context_log.py # Context logging to JSON
|
| 48 |
+
inference_log.py # Inference logging to JSON
|
| 49 |
+
|
| 50 |
+
orchestration/
|
| 51 |
+
context_resolver.py # MCP-based structured context orchestration
|
| 52 |
+
mcp_client.py # MCP server communication utilities
|
| 53 |
+
|
| 54 |
+
utils/
|
| 55 |
+
validators.py # Metadata validation utilities
|
| 56 |
+
|
| 57 |
+
data/
|
| 58 |
+
samples/
|
| 59 |
+
train.json # Training samples (small dataset)
|
| 60 |
+
eval.json # Evaluation samples
|
| 61 |
+
training_data.json # Full training dataset
|
| 62 |
+
|
| 63 |
+
docs/
|
| 64 |
+
TECH_DEBT.md # Technical debt documentation
|
| 65 |
+
|
| 66 |
+
logs/ # Runtime logs
|
| 67 |
+
|
| 68 |
+
mcp_servers/
|
| 69 |
+
history_server/ # Historical label MCP server
|
| 70 |
+
server.py
|
| 71 |
+
data/labels.csv
|
| 72 |
+
policy_server/ # Policy MCP server
|
| 73 |
+
server.py
|
| 74 |
+
data/rules.yaml
|
| 75 |
+
taxonomy_server/ # Taxonomy MCP server
|
| 76 |
+
server.py
|
| 77 |
+
data/taxonomy.sqlite
|
| 78 |
+
|
| 79 |
+
models/
|
| 80 |
+
trained_pipeline.joblib # Trained sklearn model pipeline
|
| 81 |
+
|
| 82 |
+
scripts/
|
| 83 |
+
evaluate.py # Offline evaluation script
|
| 84 |
+
populate_taxonomy.py # Populate taxonomy.sqlite for MCP
|
| 85 |
+
seed_data.py # Seed initial data into MCP files
|
| 86 |
+
train_model.py # Train sklearn model from JSON dataset
|
| 87 |
+
|
| 88 |
+
tests/
|
| 89 |
+
conftest.py # Pytest configuration
|
| 90 |
+
test_api.py # API endpoint tests
|
| 91 |
+
test_classification.py # Classification module tests
|
| 92 |
+
test_context_resolution.py # Context resolver tests
|
| 93 |
+
test_mcp_servers.py # MCP server tests
|
| 94 |
+
|
| 95 |
+
ui/
|
| 96 |
+
static/
|
| 97 |
+
script.js # Frontend JS
|
| 98 |
+
style.css # Frontend CSS
|
| 99 |
+
templates/
|
| 100 |
+
index.html # Frontend template
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## Installation (Local)
|
| 106 |
+
|
| 107 |
+
### 1. Clone the repository
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
git clone https://github.com/LeonardoMdSACode/Context-aware-NLP-classification-platform-with-MCP.git
|
| 111 |
+
cd Context-aware-NLP-classification-platform-with-MCP
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### 2. Create virtual environment
|
| 115 |
+
|
| 116 |
+
```bash
|
| 117 |
+
python -m venv venv
|
| 118 |
+
source venv/bin/activate # Linux/macOS
|
| 119 |
+
venv\Scripts\activate # Windows
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### 3. Install dependencies
|
| 123 |
+
|
| 124 |
+
```bash
|
| 125 |
+
pip install -r requirements.txt
|
| 126 |
+
pip install -r requirements-dev.txt # for testing and development
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
### 4. Populate MCP Taxonomy (first time setup)
|
| 130 |
+
|
| 131 |
+
```bash
|
| 132 |
+
python scripts/populate_taxonomy.py
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
This populates `mcp_servers/taxonomy_server/data/taxonomy.sqlite`.
|
| 136 |
+
|
| 137 |
+
### 5. Train the model
|
| 138 |
+
|
| 139 |
+
```bash
|
| 140 |
+
python scripts/train_model.py
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
This trains the TF-IDF + Logistic Regression model and saves it to `models/trained_pipeline.joblib`.
|
| 144 |
+
|
| 145 |
+
### 6. Evaluate the model
|
| 146 |
+
|
| 147 |
+
```bash
|
| 148 |
+
python scripts/evaluate.py
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
Shows offline evaluation metrics (accuracy, precision, recall, F1-score).
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
|
| 155 |
+
## Running the API Locally
|
| 156 |
+
|
| 157 |
+
### 1. Start the server
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
uvicorn app.main:app --reload
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
This runs the FastAPI server at `http://127.0.0.1:8000`.
|
| 164 |
+
|
| 165 |
+
### 2. Run MCP embedded servers (if using embedded mode)
|
| 166 |
+
|
| 167 |
+
Embedded MCP servers are started automatically via `app.orchestration.mcp_client.start_embedded_mcp_servers()`.
|
| 168 |
+
|
| 169 |
+
### 3. Access UI
|
| 170 |
+
|
| 171 |
+
Open your browser at `http://127.0.0.1:8000` to use the HTML/JS frontend.
|
| 172 |
+
|
| 173 |
+
### 4. API Endpoints
|
| 174 |
+
|
| 175 |
+
* `POST /classify` : Send `text` and optional `metadata` to get classification with context.
|
| 176 |
+
* Swagger UI: `http://127.0.0.1:8000/docs`
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## Testing
|
| 181 |
+
|
| 182 |
+
### 1. Run all tests
|
| 183 |
+
|
| 184 |
+
```bash
|
| 185 |
pytest -v
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
### 2. Smoke Test
|
| 189 |
+
|
| 190 |
+
* Run `test_backend.py` to ensure core API routes respond correctly.
|
| 191 |
+
* Check MCP servers respond to `/resolve` endpoints.
|
| 192 |
+
|
| 193 |
+
### 3. Module-specific tests
|
| 194 |
+
|
| 195 |
+
* `test_classification.py` β validates `SklearnClassifier` and `LLMAdapter` predictions.
|
| 196 |
+
* `test_context_resolution.py` β checks context resolver output.
|
| 197 |
+
* `test_mcp_servers.py` β verifies taxonomy, policy, history MCP servers.
|
| 198 |
+
|
| 199 |
+
---
|
| 200 |
+
|
| 201 |
+
## How It Works
|
| 202 |
+
|
| 203 |
+
### 1. Classification Layer
|
| 204 |
+
|
| 205 |
+
* **Baseline:** `app/classification/sklearn_model.py` β TF-IDF + Logistic Regression
|
| 206 |
+
* **LLM-assisted:** `app/classification/llm_adapter.py` β optional MCP context re-ranking
|
| 207 |
+
* **Decision logic:** `app/classification/decision.py` β applies confidence, abstention, logging
|
| 208 |
+
|
| 209 |
+
### 2. Context Resolution
|
| 210 |
+
|
| 211 |
+
* **Embedded MCP mode:** `app/orchestration/context_resolver.py` loads JSON/SQLite local files
|
| 212 |
+
* **Distributed MCP mode:** fetches context from taxonomy, policy, and history MCP servers
|
| 213 |
+
* Logs all context resolution for auditability
|
| 214 |
+
|
| 215 |
+
### 3. Logging
|
| 216 |
+
|
| 217 |
+
* `app/logging/inference_log.py` β logs every prediction
|
| 218 |
+
* `app/logging/context_log.py` β logs context used in classification
|
| 219 |
+
* Logs stored as JSON in `logs/`
|
| 220 |
+
|
| 221 |
+
### 4. MCP Servers
|
| 222 |
+
|
| 223 |
+
* `taxonomy_server` β serves categories and descriptions from SQLite
|
| 224 |
+
* `policy_server` β serves policy rules from YAML
|
| 225 |
+
* `history_server` β serves historical label data from CSV
|
| 226 |
+
* Communicated via HTTP endpoints
|
| 227 |
+
|
| 228 |
+
### 5. Scripts
|
| 229 |
+
|
| 230 |
+
* `train_model.py` β trains and saves the sklearn pipeline
|
| 231 |
+
* `evaluate.py` β offline evaluation
|
| 232 |
+
* `populate_taxonomy.py` β populates SQLite taxonomy
|
| 233 |
+
* `seed_data.py` β seeds MCP JSON files
|
| 234 |
+
|
| 235 |
+
### 6. Frontend UI
|
| 236 |
+
|
| 237 |
+
* Simple interface in `ui/templates/index.html`
|
| 238 |
+
* Uses JS (`static/script.js`) to call `/classify` endpoint
|
| 239 |
+
* Styled via `static/style.css`
|
| 240 |
+
|
| 241 |
+
---
|
| 242 |
+
|
| 243 |
+
## Recommendations
|
| 244 |
+
|
| 245 |
+
* Use a **larger, more diverse dataset** for real-world deployment to avoid overfitting
|
| 246 |
+
* Use **sigmoid calibration** for realistic confidence scores
|
| 247 |
+
* Keep logs for **auditability** and context traceability
|
| 248 |
+
* Run tests regularly with `pytest -v` to ensure stability
|
| 249 |
+
|
| 250 |
+
---
|
| 251 |
+
|
| 252 |
+
## References / Docs
|
| 253 |
+
|
| 254 |
+
* `docs/TECH_DEBT.md` β Technical debt notes and improvement suggestions
|
| 255 |
+
* `data/samples/` β Sample training/evaluation datasets
|
| 256 |
+
* `models/trained_pipeline.joblib` β Pretrained baseline model
|
| 257 |
+
|
| 258 |
+
---
|
| 259 |
+
|
| 260 |
+
## Contact / Author
|
| 261 |
+
|
| 262 |
+
Repository: [LeonardoMdSACode / Context-aware-NLP-classification-platform-with-MCP](https://github.com/LeonardoMdSACode/Context-aware-NLP-classification-platform-with-MCP)
|
| 263 |
+
|
| 264 |
+
---
|
| 265 |
+
|
| 266 |
+
## MIT License
|
| 267 |
|
| 268 |
+
This project is licensed under the MIT License. See the LICENSE file for details.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
app/models/trained_pipeline.joblib
DELETED
|
Binary file (5.91 kB)
|
|
|
models/trained_pipeline.joblib
CHANGED
|
Binary files a/models/trained_pipeline.joblib and b/models/trained_pipeline.joblib differ
|
|
|