AliveAI Bucket README
This document explains the current code snapshot in hf://buckets/meet4150/ALIV_AI, what was implemented today, and what each file is used for.
What Was Implemented Today
- File ingestion endpoints were added:
POST /ingest/filePOST /ingest/textGET /ingest/task/{task_id}GET /ingest/schema
- Upload processing pipeline was added:
- extract text from
.txt/.pdf/.doc/.docx - chunk text with overlap
- generate embeddings (
BAAI/bge-base-en-v1.5) - store records into vector DB schema
Background ingestion was added via Celery.
Ingestion endpoints are configured to store into local Chroma collection
medical_kb.RAG tuning env params were added (
chunk_size,chunk_overlap,top_k,top_p,llm_top_k).
Canonical Ingestion Schema
All ingested chunks are stored in this structure:
{
"id": "string",
"content": "string",
"metadata": {
"disease_id": "string",
"topic": "string",
"source": "string",
"document_id": "string",
"chunk_index": 0,
"scraped_at": "YYYY-MM-DD"
}
}
Runtime Flow
A) File upload flow
- Client uploads file to
POST /ingest/file. - File is saved to
data/uploads/. - Text is extracted by file type parser.
- Text is chunked.
- Embeddings are computed.
- Data is inserted into Chroma collection
medical_kb. - If async mode is on, task status is checked via
GET /ingest/task/{task_id}.
B) Raw text flow
- Client posts text to
POST /ingest/text. - Text is chunked, embedded, and inserted with the same schema.
How To Run
Install deps:
python3.12 -m pip install -r requirements.txt
Start API:
python3.12 -m uvicorn app.main:app --reload --port 8000
Start Celery worker (for async ingestion):
export ALIVEAI_CELERY_BROKER_URL=redis://localhost:6379/0
export ALIVEAI_CELERY_RESULT_BACKEND=redis://localhost:6379/0
celery -A app.celery_app.celery_app worker --loglevel=info
Important Env Vars
# Ingestion + chunking
export ALIVEAI_CHUNK_SIZE=700
export ALIVEAI_CHUNK_OVERLAP=150
export ALIVEAI_INGEST_BATCH_SIZE=256
# RAG retrieval/generation
export ALIVEAI_RAG_TOP_K=5
export ALIVEAI_LLM_TOP_P=0.9
export ALIVEAI_LLM_TOP_K=40
# Backend selector (global app)
export ALIVEAI_VECTOR_BACKEND=auto # auto | pinecone | chroma
Note: ingestion endpoints explicitly target Chroma for upload processing.
File Purpose Map
Root files
README.md: main project documentation and setup instructions.BUCKET_README.md: this bucket-focused developer guide.requirements.txt: Python dependencies.test_rag.py: basic checks for embedding similarity, retrieval, and NLP routing..gitignore: ignore rules.
app/
app/main.py: FastAPI app entrypoint, routes (/chat,/health,/ingest/*).app/celery_app.py: Celery app configuration (broker/backend + task settings).
app/agent/
app/agent/health_agent.py: chat orchestration and response generation with Ollama/HF fallback.app/agent/kb_embedding.py: KB embedding service (bge-base-en-v1.5).app/agent/kb_retrieval.py: vector retrieval functions used by chat flow.
app/db/
app/db/chroma_client.py: vector DB adapter (Chroma + optional Pinecone integration), collection access.
app/nlp/
app/nlp/nlp_service.py: intent + disease routing using MiniLM embeddings and heuristics.
app/ingestion/
app/ingestion/pipeline.py: ingestion core logic:- file text extraction
- chunking
- schema record creation
- embedding + vector insertion
app/tasks/
app/tasks/ingestion_tasks.py: Celery tasks for asyncingest_fileandingest_text.
scripts/
scripts/download_dataset.py: download raw MedQuAD dataset and normalize fields.scripts/prepare_dataset.py: chunk + transform raw dataset to ingestion schema JSONL.scripts/ingest.py: batch ingestion of prepared JSONL into vector DB.scripts/download_models.py: download local embedding models.scripts/download_hf_chat_model.py: download local HF fallback chat model.
Current Scope
This bucket is now clean and code-focused for developers:
- no duplicate project folders
- no local model binaries
- no local vector DB files
- no raw data dumps
If needed, data/model artifacts should be uploaded separately in dedicated paths.
Xet Storage Details
- Size:
- 4.37 kB
- Xet hash:
- 21e896adbede71a55174d275a3baed9e25761df568968d3a22ce58612b650547
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.