docmind-ai / data_downloader.py
Ryanfafa's picture
Upload 7 files
477ca04 verified
"""
data_downloader.py
──────────────────
Downloads a free, publicly available AI research report to use as a
demo document — no manual steps needed.
Primary : Stanford AI Index Report 2024 (summary chapter, public PDF)
Fallback 1: Our World in Data – AI progress summary (txt)
Fallback 2: Generate a synthetic AI overview document locally
"""
import os
import time
import textwrap
import urllib.request
from pathlib import Path
CACHE_DIR = Path("./sample_docs")
SAMPLE_PDF = CACHE_DIR / "ai_report_sample.pdf"
SAMPLE_TXT = CACHE_DIR / "ai_overview.txt"
# Public, stable, lightweight PDFs (< 5 MB each)
PDF_SOURCES = [
(
"https://arxiv.org/pdf/2310.07064", # "Levels of AGI" Google DeepMind paper
"Levels_of_AGI_DeepMind.pdf",
),
(
"https://arxiv.org/pdf/2303.12528", # "Sparks of AGI" Microsoft Research
"Sparks_of_AGI_Microsoft.pdf",
),
(
"https://arxiv.org/pdf/2304.15004", # "AutoGPT for Online Dec. Making"
"AutoGPT_Decision_Making.pdf",
),
]
def download_sample_doc() -> tuple[str, str]:
"""
Returns (local_path, display_name).
Tries PDF sources first; falls back to a generated TXT file.
"""
CACHE_DIR.mkdir(exist_ok=True)
# ── Try each PDF source ────────────────────────────────────────────────────
for url, fname in PDF_SOURCES:
dest = CACHE_DIR / fname
if dest.exists():
return str(dest), fname # already cached
try:
print(f"Attempting download: {url}")
req = urllib.request.Request(
url,
headers={
"User-Agent": (
"Mozilla/5.0 (X11; Linux x86_64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0 Safari/537.36"
)
},
)
with urllib.request.urlopen(req, timeout=20) as resp:
data = resp.read()
# Sanity-check: must look like a PDF
if data[:4] == b"%PDF" and len(data) > 10_000:
dest.write_bytes(data)
print(f"✓ Downloaded {fname} ({len(data)//1024} KB)")
return str(dest), fname
except Exception as ex:
print(f" ✗ Failed: {ex}")
time.sleep(1)
# ── Fallback: generate a rich synthetic TXT document ──────────────────────
print("All PDF downloads failed – generating synthetic document.")
return _generate_synthetic_doc()
def _generate_synthetic_doc() -> tuple[str, str]:
"""Creates a comprehensive synthetic AI overview document locally."""
fname = "AI_Technology_Overview_2024.txt"
dest = CACHE_DIR / fname
content = textwrap.dedent("""
═══════════════════════════════════════════════════════════════
ARTIFICIAL INTELLIGENCE: STATE OF THE FIELD — 2024 OVERVIEW
A Comprehensive Technical Reference Document
═══════════════════════════════════════════════════════════════
── SECTION 1: LARGE LANGUAGE MODELS ──────────────────────────
Large Language Models (LLMs) are neural networks trained on vast corpora
of text data using the Transformer architecture introduced by Vaswani et
al. in 2017. Modern LLMs such as GPT-4, Claude 3, Gemini Ultra, and
LLaMA-3 contain hundreds of billions of parameters.
Training involves two primary phases:
1. Pre-training: Self-supervised learning on internet-scale text data
(Common Crawl, Wikipedia, Books, GitHub code). The model learns to
predict the next token in a sequence.
2. Fine-tuning / RLHF: Reinforcement Learning from Human Feedback aligns
the model with human preferences, improving helpfulness, harmlessness,
and honesty.
Key capabilities: text generation, translation, summarization, question
answering, code generation, reasoning, and multimodal understanding.
Limitations: hallucinations (generating plausible but false information),
knowledge cutoff dates, context-window constraints, and sensitivity to
prompt phrasing (prompt brittleness).
── SECTION 2: RETRIEVAL-AUGMENTED GENERATION (RAG) ──────────
RAG is an architectural pattern that enhances LLM accuracy by grounding
generation in retrieved factual documents. It was introduced in a 2020
paper by Lewis et al. at Facebook AI Research.
RAG Pipeline Architecture:
1. Document Ingestion: PDFs, text files, or web pages are loaded.
2. Chunking: Documents are split into smaller overlapping segments
(typically 256–1024 tokens) to fit the model's context window.
3. Embedding: Each chunk is converted to a dense vector using a sentence
transformer model (e.g., all-MiniLM-L6-v2, text-embedding-ada-002).
4. Vector Storage: Embeddings are stored in a vector database such as
ChromaDB, Pinecone, Weaviate, or Qdrant for fast similarity search.
5. Query Processing: A user query is embedded and compared against stored
vectors using cosine similarity or ANN algorithms (HNSW, IVF).
6. Context Injection: The top-k most relevant chunks are retrieved and
injected into the LLM prompt as grounding context.
7. Generation: The LLM generates an answer informed by retrieved context.
Advantages over pure LLMs:
- Up-to-date information (no knowledge cutoff)
- Reduced hallucination (grounded in real documents)
- Source attribution and transparency
- Domain-specific knowledge without expensive fine-tuning
── SECTION 3: VECTOR DATABASES ───────────────────────────────
Vector databases are specialized systems optimized for storing and
querying high-dimensional embedding vectors.
ChromaDB: Open-source, runs locally in Python. Ideal for development
and small-to-medium scale projects. Supports persistent and in-memory
storage. Integrates seamlessly with LangChain.
Pinecone: Managed cloud vector database. Scales to billions of vectors.
Supports metadata filtering, sparse-dense hybrid search.
Qdrant: Open-source with cloud option. Supports payload filtering,
multi-vector collections, and quantization for memory efficiency.
Weaviate: GraphQL-native vector search with modular ML integrations.
FAISS (Facebook AI Similarity Search): Library (not a database) for
efficient similarity search. Excellent for research and batch processing.
Approximate Nearest Neighbor (ANN) algorithms used by these systems
include HNSW (Hierarchical Navigable Small World graphs), which provides
O(log n) search complexity with high recall.
── SECTION 4: EMBEDDING MODELS ───────────────────────────────
Embedding models convert text into dense numerical vectors that capture
semantic meaning. Similar texts produce vectors that are close in the
embedding space (measured by cosine similarity or dot product).
Popular models:
- all-MiniLM-L6-v2: 22M parameters, 384 dimensions, very fast, good
quality. Best for real-time applications.
- all-mpnet-base-v2: 110M parameters, 768 dimensions, higher quality.
- text-embedding-3-small (OpenAI): 1536 dims, strong general performance.
- text-embedding-3-large (OpenAI): 3072 dims, state-of-the-art quality.
- UAE-Large-V1 (WhereIsAI): Top performer on MTEB benchmark as of 2024.
The MTEB (Massive Text Embedding Benchmark) is the standard evaluation
suite for embedding models, covering retrieval, clustering, classification,
and semantic similarity tasks across 56 datasets.
── SECTION 5: AI AGENTS & AGENTIC SYSTEMS ────────────────────
AI agents are LLM-powered systems that can take actions in the world—
browsing the web, executing code, calling APIs, and managing files—in
pursuit of a goal.
ReAct (Reason + Act) Framework: The model alternates between reasoning
steps (Thought) and actions (Act), observing results after each action.
LangGraph: A framework for building stateful, graph-based agent workflows.
Supports cycles, branching, parallel execution, and human-in-the-loop
interrupts.
CrewAI: Multi-agent framework where specialized agents collaborate on
complex tasks. Agents have roles, goals, tools, and can delegate to peers.
AutoGen (Microsoft): Framework for multi-agent conversation and code
execution. Supports human-agent collaboration workflows.
Key challenges in agent development:
- Long-horizon planning and task decomposition
- Reliable tool use and API integration
- Memory management (short-term, long-term, episodic)
- Error recovery and graceful degradation
- Safety and sandboxing of code execution
── SECTION 6: FINE-TUNING & PEFT METHODS ─────────────────────
Full fine-tuning of LLMs is computationally expensive. Parameter-Efficient
Fine-Tuning (PEFT) methods adapt pre-trained models with minimal resources.
LoRA (Low-Rank Adaptation): Adds small trainable rank-decomposition matrices
to attention layers while freezing the base model. Reduces trainable
parameters by 10,000x while achieving near-full fine-tune quality.
QLoRA: Quantizes the base model to 4-bit precision (NF4), then applies
LoRA adapters. Enables fine-tuning of 70B models on a single consumer GPU.
Instruction tuning: Fine-tuning on (instruction, response) pairs to
improve the model's ability to follow natural language directions.
Popular open-source base models for fine-tuning:
- LLaMA-3 (Meta AI): 8B and 70B versions, strong multilingual support.
- Mistral-7B: Efficient 7B model with sliding window attention.
- Phi-3 (Microsoft): Small but surprisingly capable models (3.8B–14B).
- Gemma-2 (Google): 2B and 9B versions, optimized for efficiency.
── SECTION 7: MLOPS AND MODEL DEPLOYMENT ─────────────────────
MLOps (Machine Learning Operations) covers the practices of deploying,
monitoring, and maintaining ML models in production.
Key components:
- Experiment Tracking: MLflow, Weights & Biases (W&B) track metrics,
hyperparameters, and model artifacts across training runs.
- Model Registry: Central repository for versioned model artifacts.
- Serving Infrastructure: FastAPI, TorchServe, Triton Inference Server,
or vLLM for high-throughput LLM serving.
- Containerization: Docker packages models with all dependencies.
Kubernetes orchestrates containers at scale.
- CI/CD: GitHub Actions or GitLab CI automates testing, building,
and deployment pipelines.
- Monitoring: Track data drift, concept drift, latency, and error rates
in production. Tools: Evidently AI, Arize, WhyLabs.
Deployment platforms:
- HuggingFace Spaces: Free hosting for Gradio/Streamlit ML demos.
- AWS SageMaker: Enterprise ML deployment on AWS infrastructure.
- Google Vertex AI: Managed ML platform on Google Cloud.
- Replicate: API-first model deployment, pay-per-prediction.
- Modal: Serverless GPU compute for ML inference.
── SECTION 8: RESPONSIBLE AI & SAFETY ────────────────────────
As AI systems become more capable, ensuring they are safe, fair, and
aligned with human values is a critical research and engineering challenge.
Key principles:
- Helpfulness: The system should assist users effectively.
- Harmlessness: Avoid generating content that could cause real-world harm.
- Honesty: Acknowledge uncertainty; do not hallucinate or deceive.
Techniques:
- RLHF (Reinforcement Learning from Human Feedback): Trains reward models
from human preferences to guide LLM behavior.
- Constitutional AI (Anthropic): Models self-critique and revise outputs
against a set of principles.
- Red Teaming: Adversarial testing to discover model failure modes.
- Interpretability Research: Understanding internal model representations
(mechanistic interpretability, probing classifiers, attention analysis).
Regulatory landscape (2024):
- EU AI Act: First comprehensive AI regulation, risk-based tiered approach.
- US Executive Order on AI (Oct. 2023): Safety testing requirements for
large AI models.
- China AI Regulations: Content moderation and algorithmic transparency
requirements for generative AI services.
═══════════════════════════════════════════════════════════════
END OF DOCUMENT
═══════════════════════════════════════════════════════════════
""").strip()
dest.write_text(content, encoding="utf-8")
print(f"✓ Generated synthetic document ({len(content)} chars)")
return str(dest), fname
if __name__ == "__main__":
path, name = download_sample_doc()
print(f"\nReady: {path} ({name})")