Spaces:
Running
Running
| """ | |
| data_downloader.py | |
| ────────────────── | |
| Downloads a free, publicly available AI research report to use as a | |
| demo document — no manual steps needed. | |
| Primary : Stanford AI Index Report 2024 (summary chapter, public PDF) | |
| Fallback 1: Our World in Data – AI progress summary (txt) | |
| Fallback 2: Generate a synthetic AI overview document locally | |
| """ | |
| import os | |
| import time | |
| import textwrap | |
| import urllib.request | |
| from pathlib import Path | |
| CACHE_DIR = Path("./sample_docs") | |
| SAMPLE_PDF = CACHE_DIR / "ai_report_sample.pdf" | |
| SAMPLE_TXT = CACHE_DIR / "ai_overview.txt" | |
| # Public, stable, lightweight PDFs (< 5 MB each) | |
| PDF_SOURCES = [ | |
| ( | |
| "https://arxiv.org/pdf/2310.07064", # "Levels of AGI" Google DeepMind paper | |
| "Levels_of_AGI_DeepMind.pdf", | |
| ), | |
| ( | |
| "https://arxiv.org/pdf/2303.12528", # "Sparks of AGI" Microsoft Research | |
| "Sparks_of_AGI_Microsoft.pdf", | |
| ), | |
| ( | |
| "https://arxiv.org/pdf/2304.15004", # "AutoGPT for Online Dec. Making" | |
| "AutoGPT_Decision_Making.pdf", | |
| ), | |
| ] | |
| def download_sample_doc() -> tuple[str, str]: | |
| """ | |
| Returns (local_path, display_name). | |
| Tries PDF sources first; falls back to a generated TXT file. | |
| """ | |
| CACHE_DIR.mkdir(exist_ok=True) | |
| # ── Try each PDF source ──────────────────────────────────────────────────── | |
| for url, fname in PDF_SOURCES: | |
| dest = CACHE_DIR / fname | |
| if dest.exists(): | |
| return str(dest), fname # already cached | |
| try: | |
| print(f"Attempting download: {url}") | |
| req = urllib.request.Request( | |
| url, | |
| headers={ | |
| "User-Agent": ( | |
| "Mozilla/5.0 (X11; Linux x86_64) " | |
| "AppleWebKit/537.36 (KHTML, like Gecko) " | |
| "Chrome/120.0 Safari/537.36" | |
| ) | |
| }, | |
| ) | |
| with urllib.request.urlopen(req, timeout=20) as resp: | |
| data = resp.read() | |
| # Sanity-check: must look like a PDF | |
| if data[:4] == b"%PDF" and len(data) > 10_000: | |
| dest.write_bytes(data) | |
| print(f"✓ Downloaded {fname} ({len(data)//1024} KB)") | |
| return str(dest), fname | |
| except Exception as ex: | |
| print(f" ✗ Failed: {ex}") | |
| time.sleep(1) | |
| # ── Fallback: generate a rich synthetic TXT document ────────────────────── | |
| print("All PDF downloads failed – generating synthetic document.") | |
| return _generate_synthetic_doc() | |
| def _generate_synthetic_doc() -> tuple[str, str]: | |
| """Creates a comprehensive synthetic AI overview document locally.""" | |
| fname = "AI_Technology_Overview_2024.txt" | |
| dest = CACHE_DIR / fname | |
| content = textwrap.dedent(""" | |
| ═══════════════════════════════════════════════════════════════ | |
| ARTIFICIAL INTELLIGENCE: STATE OF THE FIELD — 2024 OVERVIEW | |
| A Comprehensive Technical Reference Document | |
| ═══════════════════════════════════════════════════════════════ | |
| ── SECTION 1: LARGE LANGUAGE MODELS ────────────────────────── | |
| Large Language Models (LLMs) are neural networks trained on vast corpora | |
| of text data using the Transformer architecture introduced by Vaswani et | |
| al. in 2017. Modern LLMs such as GPT-4, Claude 3, Gemini Ultra, and | |
| LLaMA-3 contain hundreds of billions of parameters. | |
| Training involves two primary phases: | |
| 1. Pre-training: Self-supervised learning on internet-scale text data | |
| (Common Crawl, Wikipedia, Books, GitHub code). The model learns to | |
| predict the next token in a sequence. | |
| 2. Fine-tuning / RLHF: Reinforcement Learning from Human Feedback aligns | |
| the model with human preferences, improving helpfulness, harmlessness, | |
| and honesty. | |
| Key capabilities: text generation, translation, summarization, question | |
| answering, code generation, reasoning, and multimodal understanding. | |
| Limitations: hallucinations (generating plausible but false information), | |
| knowledge cutoff dates, context-window constraints, and sensitivity to | |
| prompt phrasing (prompt brittleness). | |
| ── SECTION 2: RETRIEVAL-AUGMENTED GENERATION (RAG) ────────── | |
| RAG is an architectural pattern that enhances LLM accuracy by grounding | |
| generation in retrieved factual documents. It was introduced in a 2020 | |
| paper by Lewis et al. at Facebook AI Research. | |
| RAG Pipeline Architecture: | |
| 1. Document Ingestion: PDFs, text files, or web pages are loaded. | |
| 2. Chunking: Documents are split into smaller overlapping segments | |
| (typically 256–1024 tokens) to fit the model's context window. | |
| 3. Embedding: Each chunk is converted to a dense vector using a sentence | |
| transformer model (e.g., all-MiniLM-L6-v2, text-embedding-ada-002). | |
| 4. Vector Storage: Embeddings are stored in a vector database such as | |
| ChromaDB, Pinecone, Weaviate, or Qdrant for fast similarity search. | |
| 5. Query Processing: A user query is embedded and compared against stored | |
| vectors using cosine similarity or ANN algorithms (HNSW, IVF). | |
| 6. Context Injection: The top-k most relevant chunks are retrieved and | |
| injected into the LLM prompt as grounding context. | |
| 7. Generation: The LLM generates an answer informed by retrieved context. | |
| Advantages over pure LLMs: | |
| - Up-to-date information (no knowledge cutoff) | |
| - Reduced hallucination (grounded in real documents) | |
| - Source attribution and transparency | |
| - Domain-specific knowledge without expensive fine-tuning | |
| ── SECTION 3: VECTOR DATABASES ─────────────────────────────── | |
| Vector databases are specialized systems optimized for storing and | |
| querying high-dimensional embedding vectors. | |
| ChromaDB: Open-source, runs locally in Python. Ideal for development | |
| and small-to-medium scale projects. Supports persistent and in-memory | |
| storage. Integrates seamlessly with LangChain. | |
| Pinecone: Managed cloud vector database. Scales to billions of vectors. | |
| Supports metadata filtering, sparse-dense hybrid search. | |
| Qdrant: Open-source with cloud option. Supports payload filtering, | |
| multi-vector collections, and quantization for memory efficiency. | |
| Weaviate: GraphQL-native vector search with modular ML integrations. | |
| FAISS (Facebook AI Similarity Search): Library (not a database) for | |
| efficient similarity search. Excellent for research and batch processing. | |
| Approximate Nearest Neighbor (ANN) algorithms used by these systems | |
| include HNSW (Hierarchical Navigable Small World graphs), which provides | |
| O(log n) search complexity with high recall. | |
| ── SECTION 4: EMBEDDING MODELS ─────────────────────────────── | |
| Embedding models convert text into dense numerical vectors that capture | |
| semantic meaning. Similar texts produce vectors that are close in the | |
| embedding space (measured by cosine similarity or dot product). | |
| Popular models: | |
| - all-MiniLM-L6-v2: 22M parameters, 384 dimensions, very fast, good | |
| quality. Best for real-time applications. | |
| - all-mpnet-base-v2: 110M parameters, 768 dimensions, higher quality. | |
| - text-embedding-3-small (OpenAI): 1536 dims, strong general performance. | |
| - text-embedding-3-large (OpenAI): 3072 dims, state-of-the-art quality. | |
| - UAE-Large-V1 (WhereIsAI): Top performer on MTEB benchmark as of 2024. | |
| The MTEB (Massive Text Embedding Benchmark) is the standard evaluation | |
| suite for embedding models, covering retrieval, clustering, classification, | |
| and semantic similarity tasks across 56 datasets. | |
| ── SECTION 5: AI AGENTS & AGENTIC SYSTEMS ──────────────────── | |
| AI agents are LLM-powered systems that can take actions in the world— | |
| browsing the web, executing code, calling APIs, and managing files—in | |
| pursuit of a goal. | |
| ReAct (Reason + Act) Framework: The model alternates between reasoning | |
| steps (Thought) and actions (Act), observing results after each action. | |
| LangGraph: A framework for building stateful, graph-based agent workflows. | |
| Supports cycles, branching, parallel execution, and human-in-the-loop | |
| interrupts. | |
| CrewAI: Multi-agent framework where specialized agents collaborate on | |
| complex tasks. Agents have roles, goals, tools, and can delegate to peers. | |
| AutoGen (Microsoft): Framework for multi-agent conversation and code | |
| execution. Supports human-agent collaboration workflows. | |
| Key challenges in agent development: | |
| - Long-horizon planning and task decomposition | |
| - Reliable tool use and API integration | |
| - Memory management (short-term, long-term, episodic) | |
| - Error recovery and graceful degradation | |
| - Safety and sandboxing of code execution | |
| ── SECTION 6: FINE-TUNING & PEFT METHODS ───────────────────── | |
| Full fine-tuning of LLMs is computationally expensive. Parameter-Efficient | |
| Fine-Tuning (PEFT) methods adapt pre-trained models with minimal resources. | |
| LoRA (Low-Rank Adaptation): Adds small trainable rank-decomposition matrices | |
| to attention layers while freezing the base model. Reduces trainable | |
| parameters by 10,000x while achieving near-full fine-tune quality. | |
| QLoRA: Quantizes the base model to 4-bit precision (NF4), then applies | |
| LoRA adapters. Enables fine-tuning of 70B models on a single consumer GPU. | |
| Instruction tuning: Fine-tuning on (instruction, response) pairs to | |
| improve the model's ability to follow natural language directions. | |
| Popular open-source base models for fine-tuning: | |
| - LLaMA-3 (Meta AI): 8B and 70B versions, strong multilingual support. | |
| - Mistral-7B: Efficient 7B model with sliding window attention. | |
| - Phi-3 (Microsoft): Small but surprisingly capable models (3.8B–14B). | |
| - Gemma-2 (Google): 2B and 9B versions, optimized for efficiency. | |
| ── SECTION 7: MLOPS AND MODEL DEPLOYMENT ───────────────────── | |
| MLOps (Machine Learning Operations) covers the practices of deploying, | |
| monitoring, and maintaining ML models in production. | |
| Key components: | |
| - Experiment Tracking: MLflow, Weights & Biases (W&B) track metrics, | |
| hyperparameters, and model artifacts across training runs. | |
| - Model Registry: Central repository for versioned model artifacts. | |
| - Serving Infrastructure: FastAPI, TorchServe, Triton Inference Server, | |
| or vLLM for high-throughput LLM serving. | |
| - Containerization: Docker packages models with all dependencies. | |
| Kubernetes orchestrates containers at scale. | |
| - CI/CD: GitHub Actions or GitLab CI automates testing, building, | |
| and deployment pipelines. | |
| - Monitoring: Track data drift, concept drift, latency, and error rates | |
| in production. Tools: Evidently AI, Arize, WhyLabs. | |
| Deployment platforms: | |
| - HuggingFace Spaces: Free hosting for Gradio/Streamlit ML demos. | |
| - AWS SageMaker: Enterprise ML deployment on AWS infrastructure. | |
| - Google Vertex AI: Managed ML platform on Google Cloud. | |
| - Replicate: API-first model deployment, pay-per-prediction. | |
| - Modal: Serverless GPU compute for ML inference. | |
| ── SECTION 8: RESPONSIBLE AI & SAFETY ──────────────────────── | |
| As AI systems become more capable, ensuring they are safe, fair, and | |
| aligned with human values is a critical research and engineering challenge. | |
| Key principles: | |
| - Helpfulness: The system should assist users effectively. | |
| - Harmlessness: Avoid generating content that could cause real-world harm. | |
| - Honesty: Acknowledge uncertainty; do not hallucinate or deceive. | |
| Techniques: | |
| - RLHF (Reinforcement Learning from Human Feedback): Trains reward models | |
| from human preferences to guide LLM behavior. | |
| - Constitutional AI (Anthropic): Models self-critique and revise outputs | |
| against a set of principles. | |
| - Red Teaming: Adversarial testing to discover model failure modes. | |
| - Interpretability Research: Understanding internal model representations | |
| (mechanistic interpretability, probing classifiers, attention analysis). | |
| Regulatory landscape (2024): | |
| - EU AI Act: First comprehensive AI regulation, risk-based tiered approach. | |
| - US Executive Order on AI (Oct. 2023): Safety testing requirements for | |
| large AI models. | |
| - China AI Regulations: Content moderation and algorithmic transparency | |
| requirements for generative AI services. | |
| ═══════════════════════════════════════════════════════════════ | |
| END OF DOCUMENT | |
| ═══════════════════════════════════════════════════════════════ | |
| """).strip() | |
| dest.write_text(content, encoding="utf-8") | |
| print(f"✓ Generated synthetic document ({len(content)} chars)") | |
| return str(dest), fname | |
| if __name__ == "__main__": | |
| path, name = download_sample_doc() | |
| print(f"\nReady: {path} ({name})") | |