Spaces:

Ryanfafa
/

docmind-ai

Running

App Files Files Community

docmind-ai / data_downloader.py

Ryanfafa

Upload 7 files

477ca04 verified 2 months ago

raw

history blame contribute delete

13.9 kB

	"""
	data_downloader.py
	──────────────────
	Downloads a free, publicly available AI research report to use as a
	demo document — no manual steps needed.

	Primary : Stanford AI Index Report 2024 (summary chapter, public PDF)
	Fallback 1: Our World in Data – AI progress summary (txt)
	Fallback 2: Generate a synthetic AI overview document locally
	"""

	import os
	import time
	import textwrap
	import urllib.request
	from pathlib import Path

	CACHE_DIR = Path("./sample_docs")
	SAMPLE_PDF = CACHE_DIR / "ai_report_sample.pdf"
	SAMPLE_TXT = CACHE_DIR / "ai_overview.txt"

	# Public, stable, lightweight PDFs (< 5 MB each)
	PDF_SOURCES = [
	(
	"https://arxiv.org/pdf/2310.07064", # "Levels of AGI" Google DeepMind paper
	"Levels_of_AGI_DeepMind.pdf",
	),
	(
	"https://arxiv.org/pdf/2303.12528", # "Sparks of AGI" Microsoft Research
	"Sparks_of_AGI_Microsoft.pdf",
	),
	(
	"https://arxiv.org/pdf/2304.15004", # "AutoGPT for Online Dec. Making"
	"AutoGPT_Decision_Making.pdf",
	),
	]


	def download_sample_doc() -> tuple[str, str]:
	"""
	Returns (local_path, display_name).
	Tries PDF sources first; falls back to a generated TXT file.
	"""
	CACHE_DIR.mkdir(exist_ok=True)

	# ── Try each PDF source ────────────────────────────────────────────────────
	for url, fname in PDF_SOURCES:
	dest = CACHE_DIR / fname
	if dest.exists():
	return str(dest), fname # already cached

	try:
	print(f"Attempting download: {url}")
	req = urllib.request.Request(
	url,
	headers={
	"User-Agent": (
	"Mozilla/5.0 (X11; Linux x86_64) "
	"AppleWebKit/537.36 (KHTML, like Gecko) "
	"Chrome/120.0 Safari/537.36"
	)
	},
	)
	with urllib.request.urlopen(req, timeout=20) as resp:
	data = resp.read()

	# Sanity-check: must look like a PDF
	if data[:4] == b"%PDF" and len(data) > 10_000:
	dest.write_bytes(data)
	print(f"✓ Downloaded {fname} ({len(data)//1024} KB)")
	return str(dest), fname

	except Exception as ex:
	print(f" ✗ Failed: {ex}")
	time.sleep(1)

	# ── Fallback: generate a rich synthetic TXT document ──────────────────────
	print("All PDF downloads failed – generating synthetic document.")
	return _generate_synthetic_doc()


	def _generate_synthetic_doc() -> tuple[str, str]:
	"""Creates a comprehensive synthetic AI overview document locally."""
	fname = "AI_Technology_Overview_2024.txt"
	dest = CACHE_DIR / fname

	content = textwrap.dedent("""
	═══════════════════════════════════════════════════════════════
	ARTIFICIAL INTELLIGENCE: STATE OF THE FIELD — 2024 OVERVIEW
	A Comprehensive Technical Reference Document
	═══════════════════════════════════════════════════════════════

	── SECTION 1: LARGE LANGUAGE MODELS ──────────────────────────

	Large Language Models (LLMs) are neural networks trained on vast corpora
	of text data using the Transformer architecture introduced by Vaswani et
	al. in 2017. Modern LLMs such as GPT-4, Claude 3, Gemini Ultra, and
	LLaMA-3 contain hundreds of billions of parameters.

	Training involves two primary phases:
	1. Pre-training: Self-supervised learning on internet-scale text data
	(Common Crawl, Wikipedia, Books, GitHub code). The model learns to
	predict the next token in a sequence.
	2. Fine-tuning / RLHF: Reinforcement Learning from Human Feedback aligns
	the model with human preferences, improving helpfulness, harmlessness,
	and honesty.

	Key capabilities: text generation, translation, summarization, question
	answering, code generation, reasoning, and multimodal understanding.

	Limitations: hallucinations (generating plausible but false information),
	knowledge cutoff dates, context-window constraints, and sensitivity to
	prompt phrasing (prompt brittleness).

	── SECTION 2: RETRIEVAL-AUGMENTED GENERATION (RAG) ──────────

	RAG is an architectural pattern that enhances LLM accuracy by grounding
	generation in retrieved factual documents. It was introduced in a 2020
	paper by Lewis et al. at Facebook AI Research.

	RAG Pipeline Architecture:
	1. Document Ingestion: PDFs, text files, or web pages are loaded.
	2. Chunking: Documents are split into smaller overlapping segments
	(typically 256–1024 tokens) to fit the model's context window.
	3. Embedding: Each chunk is converted to a dense vector using a sentence
	transformer model (e.g., all-MiniLM-L6-v2, text-embedding-ada-002).
	4. Vector Storage: Embeddings are stored in a vector database such as
	ChromaDB, Pinecone, Weaviate, or Qdrant for fast similarity search.
	5. Query Processing: A user query is embedded and compared against stored
	vectors using cosine similarity or ANN algorithms (HNSW, IVF).
	6. Context Injection: The top-k most relevant chunks are retrieved and
	injected into the LLM prompt as grounding context.
	7. Generation: The LLM generates an answer informed by retrieved context.

	Advantages over pure LLMs:
	- Up-to-date information (no knowledge cutoff)
	- Reduced hallucination (grounded in real documents)
	- Source attribution and transparency
	- Domain-specific knowledge without expensive fine-tuning

	── SECTION 3: VECTOR DATABASES ───────────────────────────────

	Vector databases are specialized systems optimized for storing and
	querying high-dimensional embedding vectors.

	ChromaDB: Open-source, runs locally in Python. Ideal for development
	and small-to-medium scale projects. Supports persistent and in-memory
	storage. Integrates seamlessly with LangChain.

	Pinecone: Managed cloud vector database. Scales to billions of vectors.
	Supports metadata filtering, sparse-dense hybrid search.

	Qdrant: Open-source with cloud option. Supports payload filtering,
	multi-vector collections, and quantization for memory efficiency.

	Weaviate: GraphQL-native vector search with modular ML integrations.

	FAISS (Facebook AI Similarity Search): Library (not a database) for
	efficient similarity search. Excellent for research and batch processing.

	Approximate Nearest Neighbor (ANN) algorithms used by these systems
	include HNSW (Hierarchical Navigable Small World graphs), which provides
	O(log n) search complexity with high recall.

	── SECTION 4: EMBEDDING MODELS ───────────────────────────────

	Embedding models convert text into dense numerical vectors that capture
	semantic meaning. Similar texts produce vectors that are close in the
	embedding space (measured by cosine similarity or dot product).

	Popular models:
	- all-MiniLM-L6-v2: 22M parameters, 384 dimensions, very fast, good
	quality. Best for real-time applications.
	- all-mpnet-base-v2: 110M parameters, 768 dimensions, higher quality.
	- text-embedding-3-small (OpenAI): 1536 dims, strong general performance.
	- text-embedding-3-large (OpenAI): 3072 dims, state-of-the-art quality.
	- UAE-Large-V1 (WhereIsAI): Top performer on MTEB benchmark as of 2024.

	The MTEB (Massive Text Embedding Benchmark) is the standard evaluation
	suite for embedding models, covering retrieval, clustering, classification,
	and semantic similarity tasks across 56 datasets.

	── SECTION 5: AI AGENTS & AGENTIC SYSTEMS ────────────────────

	AI agents are LLM-powered systems that can take actions in the world—
	browsing the web, executing code, calling APIs, and managing files—in
	pursuit of a goal.

	ReAct (Reason + Act) Framework: The model alternates between reasoning
	steps (Thought) and actions (Act), observing results after each action.

	LangGraph: A framework for building stateful, graph-based agent workflows.
	Supports cycles, branching, parallel execution, and human-in-the-loop
	interrupts.

	CrewAI: Multi-agent framework where specialized agents collaborate on
	complex tasks. Agents have roles, goals, tools, and can delegate to peers.

	AutoGen (Microsoft): Framework for multi-agent conversation and code
	execution. Supports human-agent collaboration workflows.

	Key challenges in agent development:
	- Long-horizon planning and task decomposition
	- Reliable tool use and API integration
	- Memory management (short-term, long-term, episodic)
	- Error recovery and graceful degradation
	- Safety and sandboxing of code execution

	── SECTION 6: FINE-TUNING & PEFT METHODS ─────────────────────

	Full fine-tuning of LLMs is computationally expensive. Parameter-Efficient
	Fine-Tuning (PEFT) methods adapt pre-trained models with minimal resources.

	LoRA (Low-Rank Adaptation): Adds small trainable rank-decomposition matrices
	to attention layers while freezing the base model. Reduces trainable
	parameters by 10,000x while achieving near-full fine-tune quality.

	QLoRA: Quantizes the base model to 4-bit precision (NF4), then applies
	LoRA adapters. Enables fine-tuning of 70B models on a single consumer GPU.

	Instruction tuning: Fine-tuning on (instruction, response) pairs to
	improve the model's ability to follow natural language directions.

	Popular open-source base models for fine-tuning:
	- LLaMA-3 (Meta AI): 8B and 70B versions, strong multilingual support.
	- Mistral-7B: Efficient 7B model with sliding window attention.
	- Phi-3 (Microsoft): Small but surprisingly capable models (3.8B–14B).
	- Gemma-2 (Google): 2B and 9B versions, optimized for efficiency.

	── SECTION 7: MLOPS AND MODEL DEPLOYMENT ─────────────────────

	MLOps (Machine Learning Operations) covers the practices of deploying,
	monitoring, and maintaining ML models in production.

	Key components:
	- Experiment Tracking: MLflow, Weights & Biases (W&B) track metrics,
	hyperparameters, and model artifacts across training runs.
	- Model Registry: Central repository for versioned model artifacts.
	- Serving Infrastructure: FastAPI, TorchServe, Triton Inference Server,
	or vLLM for high-throughput LLM serving.
	- Containerization: Docker packages models with all dependencies.
	Kubernetes orchestrates containers at scale.
	- CI/CD: GitHub Actions or GitLab CI automates testing, building,
	and deployment pipelines.
	- Monitoring: Track data drift, concept drift, latency, and error rates
	in production. Tools: Evidently AI, Arize, WhyLabs.

	Deployment platforms:
	- HuggingFace Spaces: Free hosting for Gradio/Streamlit ML demos.
	- AWS SageMaker: Enterprise ML deployment on AWS infrastructure.
	- Google Vertex AI: Managed ML platform on Google Cloud.
	- Replicate: API-first model deployment, pay-per-prediction.
	- Modal: Serverless GPU compute for ML inference.

	── SECTION 8: RESPONSIBLE AI & SAFETY ────────────────────────

	As AI systems become more capable, ensuring they are safe, fair, and
	aligned with human values is a critical research and engineering challenge.

	Key principles:
	- Helpfulness: The system should assist users effectively.
	- Harmlessness: Avoid generating content that could cause real-world harm.
	- Honesty: Acknowledge uncertainty; do not hallucinate or deceive.

	Techniques:
	- RLHF (Reinforcement Learning from Human Feedback): Trains reward models
	from human preferences to guide LLM behavior.
	- Constitutional AI (Anthropic): Models self-critique and revise outputs
	against a set of principles.
	- Red Teaming: Adversarial testing to discover model failure modes.
	- Interpretability Research: Understanding internal model representations
	(mechanistic interpretability, probing classifiers, attention analysis).

	Regulatory landscape (2024):
	- EU AI Act: First comprehensive AI regulation, risk-based tiered approach.
	- US Executive Order on AI (Oct. 2023): Safety testing requirements for
	large AI models.
	- China AI Regulations: Content moderation and algorithmic transparency
	requirements for generative AI services.

	═══════════════════════════════════════════════════════════════
	END OF DOCUMENT
	═══════════════════════════════════════════════════════════════
	""").strip()

	dest.write_text(content, encoding="utf-8")
	print(f"✓ Generated synthetic document ({len(content)} chars)")
	return str(dest), fname


	if __name__ == "__main__":
	path, name = download_sample_doc()
	print(f"\nReady: {path} ({name})")