Buckets:

meet4150
/

ALIV_AI

4.37 kB

AliveAI Bucket README

This document explains the current code snapshot in hf://buckets/meet4150/ALIV_AI, what was implemented today, and what each file is used for.

What Was Implemented Today

File ingestion endpoints were added:

POST /ingest/file
POST /ingest/text
GET /ingest/task/{task_id}
GET /ingest/schema

Upload processing pipeline was added:

extract text from .txt/.pdf/.doc/.docx
chunk text with overlap
generate embeddings (BAAI/bge-base-en-v1.5)
store records into vector DB schema

Background ingestion was added via Celery.
Ingestion endpoints are configured to store into local Chroma collection medical_kb.
RAG tuning env params were added (chunk_size, chunk_overlap, top_k, top_p, llm_top_k).

Canonical Ingestion Schema

All ingested chunks are stored in this structure:

{
  "id": "string",
  "content": "string",
  "metadata": {
    "disease_id": "string",
    "topic": "string",
    "source": "string",
    "document_id": "string",
    "chunk_index": 0,
    "scraped_at": "YYYY-MM-DD"
  }
}

Runtime Flow

A) File upload flow

Client uploads file to POST /ingest/file.
File is saved to data/uploads/.
Text is extracted by file type parser.
Text is chunked.
Embeddings are computed.
Data is inserted into Chroma collection medical_kb.
If async mode is on, task status is checked via GET /ingest/task/{task_id}.

B) Raw text flow

Client posts text to POST /ingest/text.
Text is chunked, embedded, and inserted with the same schema.

How To Run

Install deps:

python3.12 -m pip install -r requirements.txt

Start API:

python3.12 -m uvicorn app.main:app --reload --port 8000

Start Celery worker (for async ingestion):

export ALIVEAI_CELERY_BROKER_URL=redis://localhost:6379/0
export ALIVEAI_CELERY_RESULT_BACKEND=redis://localhost:6379/0
celery -A app.celery_app.celery_app worker --loglevel=info

Important Env Vars

# Ingestion + chunking
export ALIVEAI_CHUNK_SIZE=700
export ALIVEAI_CHUNK_OVERLAP=150
export ALIVEAI_INGEST_BATCH_SIZE=256

# RAG retrieval/generation
export ALIVEAI_RAG_TOP_K=5
export ALIVEAI_LLM_TOP_P=0.9
export ALIVEAI_LLM_TOP_K=40

# Backend selector (global app)
export ALIVEAI_VECTOR_BACKEND=auto   # auto | pinecone | chroma

Note: ingestion endpoints explicitly target Chroma for upload processing.

File Purpose Map

Root files

README.md: main project documentation and setup instructions.
BUCKET_README.md: this bucket-focused developer guide.
requirements.txt: Python dependencies.
test_rag.py: basic checks for embedding similarity, retrieval, and NLP routing.
.gitignore: ignore rules.

`app/`

app/main.py: FastAPI app entrypoint, routes (/chat, /health, /ingest/*).
app/celery_app.py: Celery app configuration (broker/backend + task settings).

`app/agent/`

app/agent/health_agent.py: chat orchestration and response generation with Ollama/HF fallback.
app/agent/kb_embedding.py: KB embedding service (bge-base-en-v1.5).
app/agent/kb_retrieval.py: vector retrieval functions used by chat flow.

`app/db/`

app/db/chroma_client.py: vector DB adapter (Chroma + optional Pinecone integration), collection access.

`app/nlp/`

app/nlp/nlp_service.py: intent + disease routing using MiniLM embeddings and heuristics.

`app/ingestion/`

app/ingestion/pipeline.py: ingestion core logic:
- file text extraction
- chunking
- schema record creation
- embedding + vector insertion

`app/tasks/`

app/tasks/ingestion_tasks.py: Celery tasks for async ingest_file and ingest_text.

`scripts/`

scripts/download_dataset.py: download raw MedQuAD dataset and normalize fields.
scripts/prepare_dataset.py: chunk + transform raw dataset to ingestion schema JSONL.
scripts/ingest.py: batch ingestion of prepared JSONL into vector DB.
scripts/download_models.py: download local embedding models.
scripts/download_hf_chat_model.py: download local HF fallback chat model.

Current Scope

This bucket is now clean and code-focused for developers:

no duplicate project folders
no local model binaries
no local vector DB files
no raw data dumps

If needed, data/model artifacts should be uploaded separately in dedicated paths.

Xet Storage Details

Size:: 4.37 kB
Xet hash:: 21e896adbede71a55174d275a3baed9e25761df568968d3a22ce58612b650547

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.