Buckets:

meet4150
/

ALIV_AI

4.37 kB

	# AliveAI Bucket README

	This document explains the current code snapshot in `hf://buckets/meet4150/ALIV_AI`, what was implemented today, and what each file is used for.

	## What Was Implemented Today

	1. File ingestion endpoints were added:
	- `POST /ingest/file`
	- `POST /ingest/text`
	- `GET /ingest/task/{task_id}`
	- `GET /ingest/schema`

	2. Upload processing pipeline was added:
	- extract text from `.txt/.pdf/.doc/.docx`
	- chunk text with overlap
	- generate embeddings (`BAAI/bge-base-en-v1.5`)
	- store records into vector DB schema

	3. Background ingestion was added via Celery.

	4. Ingestion endpoints are configured to store into local Chroma collection `medical_kb`.

	5. RAG tuning env params were added (`chunk_size`, `chunk_overlap`, `top_k`, `top_p`, `llm_top_k`).

	## Canonical Ingestion Schema

	All ingested chunks are stored in this structure:

	```json
	{
	"id": "string",
	"content": "string",
	"metadata": {
	"disease_id": "string",
	"topic": "string",
	"source": "string",
	"document_id": "string",
	"chunk_index": 0,
	"scraped_at": "YYYY-MM-DD"
	}
	}
	```

	## Runtime Flow

	### A) File upload flow
	1. Client uploads file to `POST /ingest/file`.
	2. File is saved to `data/uploads/`.
	3. Text is extracted by file type parser.
	4. Text is chunked.
	5. Embeddings are computed.
	6. Data is inserted into Chroma collection `medical_kb`.
	7. If async mode is on, task status is checked via `GET /ingest/task/{task_id}`.

	### B) Raw text flow
	1. Client posts text to `POST /ingest/text`.
	2. Text is chunked, embedded, and inserted with the same schema.

	## How To Run

	Install deps:

	```bash
	python3.12 -m pip install -r requirements.txt
	```

	Start API:

	```bash
	python3.12 -m uvicorn app.main:app --reload --port 8000
	```

	Start Celery worker (for async ingestion):

	```bash
	export ALIVEAI_CELERY_BROKER_URL=redis://localhost:6379/0
	export ALIVEAI_CELERY_RESULT_BACKEND=redis://localhost:6379/0
	celery -A app.celery_app.celery_app worker --loglevel=info
	```

	## Important Env Vars

	```bash
	# Ingestion + chunking
	export ALIVEAI_CHUNK_SIZE=700
	export ALIVEAI_CHUNK_OVERLAP=150
	export ALIVEAI_INGEST_BATCH_SIZE=256

	# RAG retrieval/generation
	export ALIVEAI_RAG_TOP_K=5
	export ALIVEAI_LLM_TOP_P=0.9
	export ALIVEAI_LLM_TOP_K=40

	# Backend selector (global app)
	export ALIVEAI_VECTOR_BACKEND=auto # auto \| pinecone \| chroma
	```

	Note: ingestion endpoints explicitly target Chroma for upload processing.

	## File Purpose Map

	### Root files
	- `README.md`: main project documentation and setup instructions.
	- `BUCKET_README.md`: this bucket-focused developer guide.
	- `requirements.txt`: Python dependencies.
	- `test_rag.py`: basic checks for embedding similarity, retrieval, and NLP routing.
	- `.gitignore`: ignore rules.

	### `app/`
	- `app/main.py`: FastAPI app entrypoint, routes (`/chat`, `/health`, `/ingest/*`).
	- `app/celery_app.py`: Celery app configuration (broker/backend + task settings).

	### `app/agent/`
	- `app/agent/health_agent.py`: chat orchestration and response generation with Ollama/HF fallback.
	- `app/agent/kb_embedding.py`: KB embedding service (bge-base-en-v1.5).
	- `app/agent/kb_retrieval.py`: vector retrieval functions used by chat flow.

	### `app/db/`
	- `app/db/chroma_client.py`: vector DB adapter (Chroma + optional Pinecone integration), collection access.

	### `app/nlp/`
	- `app/nlp/nlp_service.py`: intent + disease routing using MiniLM embeddings and heuristics.

	### `app/ingestion/`
	- `app/ingestion/pipeline.py`: ingestion core logic:
	- file text extraction
	- chunking
	- schema record creation
	- embedding + vector insertion

	### `app/tasks/`
	- `app/tasks/ingestion_tasks.py`: Celery tasks for async `ingest_file` and `ingest_text`.

	### `scripts/`
	- `scripts/download_dataset.py`: download raw MedQuAD dataset and normalize fields.
	- `scripts/prepare_dataset.py`: chunk + transform raw dataset to ingestion schema JSONL.
	- `scripts/ingest.py`: batch ingestion of prepared JSONL into vector DB.
	- `scripts/download_models.py`: download local embedding models.
	- `scripts/download_hf_chat_model.py`: download local HF fallback chat model.

	## Current Scope

	This bucket is now clean and code-focused for developers:
	- no duplicate project folders
	- no local model binaries
	- no local vector DB files
	- no raw data dumps

	If needed, data/model artifacts should be uploaded separately in dedicated paths.

Xet Storage Details

Size:: 4.37 kB
Xet hash:: 21e896adbede71a55174d275a3baed9e25761df568968d3a22ce58612b650547

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.