LeonardoMdSA commited on
Commit
3915e0a
Β·
1 Parent(s): b788390

README update

Browse files
README.md CHANGED
@@ -9,100 +9,260 @@ pinned: false
9
  license: mit
10
  ---
11
 
12
- # Under construction...
13
 
14
- venv\Scripts\activate
15
 
16
- uvicorn app.main:app --reload --host 127.0.0.1 --port 8000
17
 
18
- ### Tests
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  pytest -v
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- Or manual smoke test in test_backend.py
23
-
24
-
25
- ### Train-evaluate model
26
-
27
- python scripts\seed_data.py
28
-
29
- python scripts\train_model.py
30
-
31
- python scripts\evaluate.py
32
-
33
- ## Initial struture
34
-
35
- Context-aware NLP classification platform with MCP/
36
- β”œβ”€ Dockerfile
37
- β”œβ”€ docker-compose.yml
38
- β”œβ”€ LICENSE
39
- β”œβ”€ README.md
40
- β”œβ”€ requirements-dev.txt
41
- β”œβ”€ requirements.txt
42
- β”œβ”€ start.sh
43
- β”œβ”€ test_backend.py
44
- β”œβ”€ app/
45
- β”‚ β”œβ”€ config.py
46
- β”‚ β”œβ”€ logging_config.py
47
- β”‚ β”œβ”€ main.py # FastAPI entrypoint
48
- β”‚ β”œβ”€ api/
49
- β”‚ β”‚ β”œβ”€ routes.py # API endpoints (e.g., /predict)
50
- β”‚ β”‚ └─ schemas.py
51
- β”‚ β”œβ”€ classification/
52
- β”‚ β”‚ β”œβ”€ decision.py
53
- β”‚ β”‚ β”œβ”€ llm_adapter.py
54
- β”‚ β”‚ β”œβ”€ model.py
55
- β”‚ β”‚ β”œβ”€ preprocess.py
56
- β”‚ β”‚ └─ sklearn_model.py
57
- β”‚ β”œβ”€ context/
58
- β”‚ β”‚ └─ resolver.py
59
- β”‚ β”œβ”€ logging/
60
- β”‚ β”œβ”€ context_log.py
61
- β”‚ └─ inference_log.py
62
- β”œβ”€ orchestration/
63
- β”‚ β”œβ”€ context_resolver.py
64
- β”‚ └─ mcp_client.py
65
- β”œβ”€ utils/
66
- β”‚ └─ validators.py
67
- β”œβ”€ data/
68
- β”‚ β”œβ”€ mcp/
69
- β”‚ β”‚ β”œβ”€ history.json
70
- β”‚ β”‚ β”œβ”€ policies.json
71
- β”‚ β”‚ └─ taxonomy.json
72
- β”‚ β”œβ”€ processed/
73
- β”‚ β”œβ”€ raw/
74
- β”‚ └─ samples/
75
- β”‚ └─ training_data.json
76
- β”œβ”€ docs/
77
- β”‚ └─ TECH_DEBT.md
78
- β”œβ”€ logs/
79
- β”œβ”€ mcp_servers/
80
- β”‚ β”œβ”€ history_server/
81
- β”‚ β”‚ β”œβ”€ server.py
82
- β”‚ β”‚ └─ data/
83
- β”‚ β”‚ └─ labels.csv
84
- β”‚ β”œβ”€ policy_server/
85
- β”‚ β”‚ β”œβ”€ server.py
86
- β”‚ β”‚ └─ data/
87
- β”‚ β”‚ └─ rules.yaml
88
- β”‚ └─ taxonomy_server/
89
- β”‚ β”œβ”€ server.py
90
- β”‚ └─ data/
91
- β”œβ”€ models/
92
- β”‚ └─ trained_pipeline.joblib
93
- β”œβ”€ scripts/
94
- β”‚ β”œβ”€ evaluate.py
95
- β”‚ β”œβ”€ seed_data.py
96
- β”‚ └─ train_model.py
97
- β”œβ”€ tests/
98
- β”‚ β”œβ”€ conftest.py
99
- β”‚ β”œβ”€ test_api.py
100
- β”‚ β”œβ”€ test_classification.py
101
- β”‚ β”œβ”€ test_context_resolution.py
102
- β”‚ └─ test_mcp_servers.py
103
- └─ ui/
104
- β”œβ”€ static/
105
- β”‚ β”œβ”€ style.css
106
- β”‚ └─ script.js
107
- └─ templates/
108
- └─ index.html
 
9
  license: mit
10
  ---
11
 
12
+ # Context-aware NLP Classification Platform with MCP
13
 
14
+ ## Overview
15
 
16
+ This repository implements a **context-aware NLP classification platform** that combines a lightweight TF-IDF + Logistic Regression baseline with optional **LLM-assisted context re-ranking** via MCP (Managed Context Platform). It supports multi-domain classification (finance, HR, legal), structured context resolution, logging, and evaluation.
17
 
18
+ The platform is modular and can run either **locally in a virtual environment** or inside a **Docker container**.
19
 
20
+ ---
21
+
22
+ ## Repository Structure
23
+
24
+ ```
25
+ Dockerfile
26
+ LICENSE
27
+ README.md
28
+ requirements-dev.txt
29
+ requirements.txt
30
+
31
+ app/
32
+ config.py # Configuration and settings
33
+ logging_config.py # Logging configuration
34
+ main.py # Main entry point for API server
35
+ api/
36
+ routes.py # FastAPI routes
37
+ schemas.py # Pydantic schemas
38
+ classification/
39
+ decision.py # Classification decision & abstention logic
40
+ llm_adapter.py # Optional LLM integration for context
41
+ model.py # Abstract classifier orchestration
42
+ preprocess.py # Text preprocessing and tokenization
43
+ sklearn_model.py # TF-IDF + Logistic Regression classifier
44
+ context/
45
+ resolver.py # Context resolution logic
46
+ logging/
47
+ context_log.py # Context logging to JSON
48
+ inference_log.py # Inference logging to JSON
49
+
50
+ orchestration/
51
+ context_resolver.py # MCP-based structured context orchestration
52
+ mcp_client.py # MCP server communication utilities
53
+
54
+ utils/
55
+ validators.py # Metadata validation utilities
56
+
57
+ data/
58
+ samples/
59
+ train.json # Training samples (small dataset)
60
+ eval.json # Evaluation samples
61
+ training_data.json # Full training dataset
62
+
63
+ docs/
64
+ TECH_DEBT.md # Technical debt documentation
65
+
66
+ logs/ # Runtime logs
67
+
68
+ mcp_servers/
69
+ history_server/ # Historical label MCP server
70
+ server.py
71
+ data/labels.csv
72
+ policy_server/ # Policy MCP server
73
+ server.py
74
+ data/rules.yaml
75
+ taxonomy_server/ # Taxonomy MCP server
76
+ server.py
77
+ data/taxonomy.sqlite
78
+
79
+ models/
80
+ trained_pipeline.joblib # Trained sklearn model pipeline
81
+
82
+ scripts/
83
+ evaluate.py # Offline evaluation script
84
+ populate_taxonomy.py # Populate taxonomy.sqlite for MCP
85
+ seed_data.py # Seed initial data into MCP files
86
+ train_model.py # Train sklearn model from JSON dataset
87
+
88
+ tests/
89
+ conftest.py # Pytest configuration
90
+ test_api.py # API endpoint tests
91
+ test_classification.py # Classification module tests
92
+ test_context_resolution.py # Context resolver tests
93
+ test_mcp_servers.py # MCP server tests
94
+
95
+ ui/
96
+ static/
97
+ script.js # Frontend JS
98
+ style.css # Frontend CSS
99
+ templates/
100
+ index.html # Frontend template
101
+ ```
102
+
103
+ ---
104
+
105
+ ## Installation (Local)
106
+
107
+ ### 1. Clone the repository
108
+
109
+ ```bash
110
+ git clone https://github.com/LeonardoMdSACode/Context-aware-NLP-classification-platform-with-MCP.git
111
+ cd Context-aware-NLP-classification-platform-with-MCP
112
+ ```
113
+
114
+ ### 2. Create virtual environment
115
+
116
+ ```bash
117
+ python -m venv venv
118
+ source venv/bin/activate # Linux/macOS
119
+ venv\Scripts\activate # Windows
120
+ ```
121
+
122
+ ### 3. Install dependencies
123
+
124
+ ```bash
125
+ pip install -r requirements.txt
126
+ pip install -r requirements-dev.txt # for testing and development
127
+ ```
128
+
129
+ ### 4. Populate MCP Taxonomy (first time setup)
130
+
131
+ ```bash
132
+ python scripts/populate_taxonomy.py
133
+ ```
134
+
135
+ This populates `mcp_servers/taxonomy_server/data/taxonomy.sqlite`.
136
+
137
+ ### 5. Train the model
138
+
139
+ ```bash
140
+ python scripts/train_model.py
141
+ ```
142
+
143
+ This trains the TF-IDF + Logistic Regression model and saves it to `models/trained_pipeline.joblib`.
144
+
145
+ ### 6. Evaluate the model
146
+
147
+ ```bash
148
+ python scripts/evaluate.py
149
+ ```
150
+
151
+ Shows offline evaluation metrics (accuracy, precision, recall, F1-score).
152
+
153
+ ---
154
+
155
+ ## Running the API Locally
156
+
157
+ ### 1. Start the server
158
+
159
+ ```bash
160
+ uvicorn app.main:app --reload
161
+ ```
162
+
163
+ This runs the FastAPI server at `http://127.0.0.1:8000`.
164
+
165
+ ### 2. Run MCP embedded servers (if using embedded mode)
166
+
167
+ Embedded MCP servers are started automatically via `app.orchestration.mcp_client.start_embedded_mcp_servers()`.
168
+
169
+ ### 3. Access UI
170
+
171
+ Open your browser at `http://127.0.0.1:8000` to use the HTML/JS frontend.
172
+
173
+ ### 4. API Endpoints
174
+
175
+ * `POST /classify` : Send `text` and optional `metadata` to get classification with context.
176
+ * Swagger UI: `http://127.0.0.1:8000/docs`
177
+
178
+ ---
179
+
180
+ ## Testing
181
+
182
+ ### 1. Run all tests
183
+
184
+ ```bash
185
  pytest -v
186
+ ```
187
+
188
+ ### 2. Smoke Test
189
+
190
+ * Run `test_backend.py` to ensure core API routes respond correctly.
191
+ * Check MCP servers respond to `/resolve` endpoints.
192
+
193
+ ### 3. Module-specific tests
194
+
195
+ * `test_classification.py` β†’ validates `SklearnClassifier` and `LLMAdapter` predictions.
196
+ * `test_context_resolution.py` β†’ checks context resolver output.
197
+ * `test_mcp_servers.py` β†’ verifies taxonomy, policy, history MCP servers.
198
+
199
+ ---
200
+
201
+ ## How It Works
202
+
203
+ ### 1. Classification Layer
204
+
205
+ * **Baseline:** `app/classification/sklearn_model.py` β†’ TF-IDF + Logistic Regression
206
+ * **LLM-assisted:** `app/classification/llm_adapter.py` β†’ optional MCP context re-ranking
207
+ * **Decision logic:** `app/classification/decision.py` β†’ applies confidence, abstention, logging
208
+
209
+ ### 2. Context Resolution
210
+
211
+ * **Embedded MCP mode:** `app/orchestration/context_resolver.py` loads JSON/SQLite local files
212
+ * **Distributed MCP mode:** fetches context from taxonomy, policy, and history MCP servers
213
+ * Logs all context resolution for auditability
214
+
215
+ ### 3. Logging
216
+
217
+ * `app/logging/inference_log.py` β†’ logs every prediction
218
+ * `app/logging/context_log.py` β†’ logs context used in classification
219
+ * Logs stored as JSON in `logs/`
220
+
221
+ ### 4. MCP Servers
222
+
223
+ * `taxonomy_server` β†’ serves categories and descriptions from SQLite
224
+ * `policy_server` β†’ serves policy rules from YAML
225
+ * `history_server` β†’ serves historical label data from CSV
226
+ * Communicated via HTTP endpoints
227
+
228
+ ### 5. Scripts
229
+
230
+ * `train_model.py` β†’ trains and saves the sklearn pipeline
231
+ * `evaluate.py` β†’ offline evaluation
232
+ * `populate_taxonomy.py` β†’ populates SQLite taxonomy
233
+ * `seed_data.py` β†’ seeds MCP JSON files
234
+
235
+ ### 6. Frontend UI
236
+
237
+ * Simple interface in `ui/templates/index.html`
238
+ * Uses JS (`static/script.js`) to call `/classify` endpoint
239
+ * Styled via `static/style.css`
240
+
241
+ ---
242
+
243
+ ## Recommendations
244
+
245
+ * Use a **larger, more diverse dataset** for real-world deployment to avoid overfitting
246
+ * Use **sigmoid calibration** for realistic confidence scores
247
+ * Keep logs for **auditability** and context traceability
248
+ * Run tests regularly with `pytest -v` to ensure stability
249
+
250
+ ---
251
+
252
+ ## References / Docs
253
+
254
+ * `docs/TECH_DEBT.md` β†’ Technical debt notes and improvement suggestions
255
+ * `data/samples/` β†’ Sample training/evaluation datasets
256
+ * `models/trained_pipeline.joblib` β†’ Pretrained baseline model
257
+
258
+ ---
259
+
260
+ ## Contact / Author
261
+
262
+ Repository: [LeonardoMdSACode / Context-aware-NLP-classification-platform-with-MCP](https://github.com/LeonardoMdSACode/Context-aware-NLP-classification-platform-with-MCP)
263
+
264
+ ---
265
+
266
+ ## MIT License
267
 
268
+ This project is licensed under the MIT License. See the LICENSE file for details.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/models/trained_pipeline.joblib DELETED
Binary file (5.91 kB)
 
models/trained_pipeline.joblib CHANGED
Binary files a/models/trained_pipeline.joblib and b/models/trained_pipeline.joblib differ