meet4150/ALIV_AI / BUCKET_README.md
|
download
raw
4.37 kB

AliveAI Bucket README

This document explains the current code snapshot in hf://buckets/meet4150/ALIV_AI, what was implemented today, and what each file is used for.

What Was Implemented Today

  1. File ingestion endpoints were added:
  • POST /ingest/file
  • POST /ingest/text
  • GET /ingest/task/{task_id}
  • GET /ingest/schema
  1. Upload processing pipeline was added:
  • extract text from .txt/.pdf/.doc/.docx
  • chunk text with overlap
  • generate embeddings (BAAI/bge-base-en-v1.5)
  • store records into vector DB schema
  1. Background ingestion was added via Celery.

  2. Ingestion endpoints are configured to store into local Chroma collection medical_kb.

  3. RAG tuning env params were added (chunk_size, chunk_overlap, top_k, top_p, llm_top_k).

Canonical Ingestion Schema

All ingested chunks are stored in this structure:

{
  "id": "string",
  "content": "string",
  "metadata": {
    "disease_id": "string",
    "topic": "string",
    "source": "string",
    "document_id": "string",
    "chunk_index": 0,
    "scraped_at": "YYYY-MM-DD"
  }
}

Runtime Flow

A) File upload flow

  1. Client uploads file to POST /ingest/file.
  2. File is saved to data/uploads/.
  3. Text is extracted by file type parser.
  4. Text is chunked.
  5. Embeddings are computed.
  6. Data is inserted into Chroma collection medical_kb.
  7. If async mode is on, task status is checked via GET /ingest/task/{task_id}.

B) Raw text flow

  1. Client posts text to POST /ingest/text.
  2. Text is chunked, embedded, and inserted with the same schema.

How To Run

Install deps:

python3.12 -m pip install -r requirements.txt

Start API:

python3.12 -m uvicorn app.main:app --reload --port 8000

Start Celery worker (for async ingestion):

export ALIVEAI_CELERY_BROKER_URL=redis://localhost:6379/0
export ALIVEAI_CELERY_RESULT_BACKEND=redis://localhost:6379/0
celery -A app.celery_app.celery_app worker --loglevel=info

Important Env Vars

# Ingestion + chunking
export ALIVEAI_CHUNK_SIZE=700
export ALIVEAI_CHUNK_OVERLAP=150
export ALIVEAI_INGEST_BATCH_SIZE=256

# RAG retrieval/generation
export ALIVEAI_RAG_TOP_K=5
export ALIVEAI_LLM_TOP_P=0.9
export ALIVEAI_LLM_TOP_K=40

# Backend selector (global app)
export ALIVEAI_VECTOR_BACKEND=auto   # auto | pinecone | chroma

Note: ingestion endpoints explicitly target Chroma for upload processing.

File Purpose Map

Root files

  • README.md: main project documentation and setup instructions.
  • BUCKET_README.md: this bucket-focused developer guide.
  • requirements.txt: Python dependencies.
  • test_rag.py: basic checks for embedding similarity, retrieval, and NLP routing.
  • .gitignore: ignore rules.

app/

  • app/main.py: FastAPI app entrypoint, routes (/chat, /health, /ingest/*).
  • app/celery_app.py: Celery app configuration (broker/backend + task settings).

app/agent/

  • app/agent/health_agent.py: chat orchestration and response generation with Ollama/HF fallback.
  • app/agent/kb_embedding.py: KB embedding service (bge-base-en-v1.5).
  • app/agent/kb_retrieval.py: vector retrieval functions used by chat flow.

app/db/

  • app/db/chroma_client.py: vector DB adapter (Chroma + optional Pinecone integration), collection access.

app/nlp/

  • app/nlp/nlp_service.py: intent + disease routing using MiniLM embeddings and heuristics.

app/ingestion/

  • app/ingestion/pipeline.py: ingestion core logic:
    • file text extraction
    • chunking
    • schema record creation
    • embedding + vector insertion

app/tasks/

  • app/tasks/ingestion_tasks.py: Celery tasks for async ingest_file and ingest_text.

scripts/

  • scripts/download_dataset.py: download raw MedQuAD dataset and normalize fields.
  • scripts/prepare_dataset.py: chunk + transform raw dataset to ingestion schema JSONL.
  • scripts/ingest.py: batch ingestion of prepared JSONL into vector DB.
  • scripts/download_models.py: download local embedding models.
  • scripts/download_hf_chat_model.py: download local HF fallback chat model.

Current Scope

This bucket is now clean and code-focused for developers:

  • no duplicate project folders
  • no local model binaries
  • no local vector DB files
  • no raw data dumps

If needed, data/model artifacts should be uploaded separately in dedicated paths.

Xet Storage Details

Size:
4.37 kB
·
Xet hash:
21e896adbede71a55174d275a3baed9e25761df568968d3a22ce58612b650547

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.