meet4150/ALIV_AI / BUCKET_README.md
|
download
raw
4.37 kB
# AliveAI Bucket README
This document explains the current code snapshot in `hf://buckets/meet4150/ALIV_AI`, what was implemented today, and what each file is used for.
## What Was Implemented Today
1. File ingestion endpoints were added:
- `POST /ingest/file`
- `POST /ingest/text`
- `GET /ingest/task/{task_id}`
- `GET /ingest/schema`
2. Upload processing pipeline was added:
- extract text from `.txt/.pdf/.doc/.docx`
- chunk text with overlap
- generate embeddings (`BAAI/bge-base-en-v1.5`)
- store records into vector DB schema
3. Background ingestion was added via Celery.
4. Ingestion endpoints are configured to store into local Chroma collection `medical_kb`.
5. RAG tuning env params were added (`chunk_size`, `chunk_overlap`, `top_k`, `top_p`, `llm_top_k`).
## Canonical Ingestion Schema
All ingested chunks are stored in this structure:
```json
{
"id": "string",
"content": "string",
"metadata": {
"disease_id": "string",
"topic": "string",
"source": "string",
"document_id": "string",
"chunk_index": 0,
"scraped_at": "YYYY-MM-DD"
}
}
```
## Runtime Flow
### A) File upload flow
1. Client uploads file to `POST /ingest/file`.
2. File is saved to `data/uploads/`.
3. Text is extracted by file type parser.
4. Text is chunked.
5. Embeddings are computed.
6. Data is inserted into Chroma collection `medical_kb`.
7. If async mode is on, task status is checked via `GET /ingest/task/{task_id}`.
### B) Raw text flow
1. Client posts text to `POST /ingest/text`.
2. Text is chunked, embedded, and inserted with the same schema.
## How To Run
Install deps:
```bash
python3.12 -m pip install -r requirements.txt
```
Start API:
```bash
python3.12 -m uvicorn app.main:app --reload --port 8000
```
Start Celery worker (for async ingestion):
```bash
export ALIVEAI_CELERY_BROKER_URL=redis://localhost:6379/0
export ALIVEAI_CELERY_RESULT_BACKEND=redis://localhost:6379/0
celery -A app.celery_app.celery_app worker --loglevel=info
```
## Important Env Vars
```bash
# Ingestion + chunking
export ALIVEAI_CHUNK_SIZE=700
export ALIVEAI_CHUNK_OVERLAP=150
export ALIVEAI_INGEST_BATCH_SIZE=256
# RAG retrieval/generation
export ALIVEAI_RAG_TOP_K=5
export ALIVEAI_LLM_TOP_P=0.9
export ALIVEAI_LLM_TOP_K=40
# Backend selector (global app)
export ALIVEAI_VECTOR_BACKEND=auto # auto | pinecone | chroma
```
Note: ingestion endpoints explicitly target Chroma for upload processing.
## File Purpose Map
### Root files
- `README.md`: main project documentation and setup instructions.
- `BUCKET_README.md`: this bucket-focused developer guide.
- `requirements.txt`: Python dependencies.
- `test_rag.py`: basic checks for embedding similarity, retrieval, and NLP routing.
- `.gitignore`: ignore rules.
### `app/`
- `app/main.py`: FastAPI app entrypoint, routes (`/chat`, `/health`, `/ingest/*`).
- `app/celery_app.py`: Celery app configuration (broker/backend + task settings).
### `app/agent/`
- `app/agent/health_agent.py`: chat orchestration and response generation with Ollama/HF fallback.
- `app/agent/kb_embedding.py`: KB embedding service (bge-base-en-v1.5).
- `app/agent/kb_retrieval.py`: vector retrieval functions used by chat flow.
### `app/db/`
- `app/db/chroma_client.py`: vector DB adapter (Chroma + optional Pinecone integration), collection access.
### `app/nlp/`
- `app/nlp/nlp_service.py`: intent + disease routing using MiniLM embeddings and heuristics.
### `app/ingestion/`
- `app/ingestion/pipeline.py`: ingestion core logic:
- file text extraction
- chunking
- schema record creation
- embedding + vector insertion
### `app/tasks/`
- `app/tasks/ingestion_tasks.py`: Celery tasks for async `ingest_file` and `ingest_text`.
### `scripts/`
- `scripts/download_dataset.py`: download raw MedQuAD dataset and normalize fields.
- `scripts/prepare_dataset.py`: chunk + transform raw dataset to ingestion schema JSONL.
- `scripts/ingest.py`: batch ingestion of prepared JSONL into vector DB.
- `scripts/download_models.py`: download local embedding models.
- `scripts/download_hf_chat_model.py`: download local HF fallback chat model.
## Current Scope
This bucket is now clean and code-focused for developers:
- no duplicate project folders
- no local model binaries
- no local vector DB files
- no raw data dumps
If needed, data/model artifacts should be uploaded separately in dedicated paths.

Xet Storage Details

Size:
4.37 kB
·
Xet hash:
21e896adbede71a55174d275a3baed9e25761df568968d3a22ce58612b650547

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.