Qsearch / README_AGENTIC_SYSTEM.md
flyfir248's picture
Commit : Updated header.html and routes.py
aa928dd

Agentic AI System for Individual Information Collection & RAG-Based Search

A sophisticated autonomous intelligence system that discovers, collects, and indexes researcher profiles using multiple academic data sources, with semantic search and RAG-powered question answering capabilities.

๐ŸŒŸ Key Features

๐Ÿค– Autonomous Data Collection

  • Multi-source aggregation: Automatically collects data from OpenAlex, Google Scholar, and arXiv
  • Intelligent crawling: Adaptive strategies for discovering relevant individuals
  • Profile synthesis: Combines data from multiple sources into unified profiles
  • Batch processing: Efficiently collects data for multiple individuals
  • Caching: Prevents redundant API calls with intelligent memory

๐Ÿ” Semantic Search

  • Vector embeddings: Uses sentence-transformers/all-MiniLM-L6-v2 for semantic understanding
  • In-memory vector store: Fast, efficient storage without external dependencies
  • Relevance ranking: Multi-factor scoring based on content similarity and metrics
  • Deduplication: Intelligent aggregation of search results

๐Ÿง  RAG-Powered Q&A

  • Context-aware synthesis: Uses Llama-3-8B-Instruct via HuggingFace API
  • Source attribution: Every answer includes relevant researcher profiles
  • No local models: All inference via API (no downloads required)

๐Ÿ“Š Rich Profile Data

Each collected profile includes:

  • Name, affiliation, biography
  • H-index, total citations, paper count
  • Research interests/topics
  • Recent publications
  • Profile URLs and metadata
  • Source attribution

๐Ÿš€ Quick Start

Installation

# Install dependencies
pip install flask langchain langchain-huggingface requests scholarly feedparser --break-system-packages

# Set HuggingFace token (required for LLM features)
export HF_TOKEN="your_huggingface_token_here"

Basic Usage

from agentic_rag_system import AgenticRAGOrchestrator

# Initialize the system
orchestrator = AgenticRAGOrchestrator()

# Autonomous discovery: Find and index experts in a field
result = orchestrator.discover_and_index(
    query="machine learning",
    max_profiles=20
)

# Search for specific expertise
search_results = orchestrator.search("deep learning", k=5)

# Ask questions and get synthesized answers
answer = orchestrator.ask(
    "Who are the leading researchers in neural networks?",
    k=5
)

print(answer['answer'])
for source in answer['sources']:
    print(f"- {source['name']} ({source['affiliation']})")

๐Ÿ“š Core Components

1. AgenticDataCollector

Autonomously collects comprehensive data about individuals.

from agentic_rag_system import AgenticDataCollector

collector = AgenticDataCollector()

# Collect data for a specific person
profile = collector.collect_individual_data(
    name="Geoffrey Hinton",
    additional_context="deep learning"
)

# Batch collection
names = ["Yann LeCun", "Yoshua Bengio", "Andrew Ng"]
profiles = collector.batch_collect(names, context="machine learning")

Features:

  • Multi-step collection pipeline
  • Caching to prevent redundant calls
  • Error handling and retries
  • Progress tracking

Data Sources:

  • OpenAlex: Comprehensive academic database (primary source)
  • Google Scholar: Citation metrics and h-index verification
  • Recent Publications: Latest research output

2. IntelligentRAGSystem

RAG system optimized for researcher profile search.

from agentic_rag_system import IntelligentRAGSystem

rag = IntelligentRAGSystem()

# Index profiles
rag.index_profiles(profiles)

# Search
results = rag.search("computer vision experts", k=5)

# Generate synthesized answer
answer = rag.synthesize_answer(
    "Which researchers focus on attention mechanisms?",
    k=5
)

Features:

  • Semantic chunking with overlap
  • Metadata-rich documents
  • Deduplication and aggregation
  • Context building for LLM prompts

3. AgenticRAGOrchestrator

High-level orchestrator combining all components.

from agentic_rag_system import AgenticRAGOrchestrator

orchestrator = AgenticRAGOrchestrator()

# All-in-one: discover, collect, index
orchestrator.discover_and_index("quantum computing", max_profiles=15)

# Search
results = orchestrator.search("quantum algorithms", k=5)

# Ask questions
answer = orchestrator.ask("Who are the top quantum computing researchers?")

# Export data
orchestrator.export_profiles("/path/to/export.json")

๐ŸŒ Flask Integration

API Endpoints

1. Autonomous Discovery

POST /api/agentic/discover
Content-Type: application/json

{
  "query": "artificial intelligence",
  "max_profiles": 20
}

Response:

{
  "success": true,
  "profiles_collected": 18,
  "profiles_indexed": 18,
  "elapsed_time": 45.2,
  "query": "artificial intelligence"
}

2. Semantic Search

GET /api/agentic/search?q=neural%20networks&k=5

Response:

{
  "query": "neural networks",
  "results": [
    {
      "name": "Geoffrey Hinton",
      "affiliation": "University of Toronto",
      "h_index": 185,
      "total_citations": 487000,
      "profile_url": "https://openalex.org/authors/A1234567890",
      "relevance_score": 3
    }
  ],
  "total_indexed": 18
}

3. RAG Question Answering

POST /api/agentic/ask
Content-Type: application/json

{
  "question": "Who are the leading deep learning researchers?",
  "k": 5
}

Response:

{
  "answer": "Based on the indexed profiles, leading deep learning researchers include Geoffrey Hinton from University of Toronto with h-index of 185...",
  "sources": [...],
  "context_used": 5
}

4. Get All Profiles

GET /api/agentic/profiles

5. System Statistics

GET /api/agentic/stats

6. Collect Specific Individual

POST /api/agentic/collect-individual
Content-Type: application/json

{
  "name": "Andrew Ng",
  "context": "machine learning stanford"
}

Web Interface Routes

  • /rag - Main RAG search interface
  • /agentic-dashboard - System monitoring and control dashboard
  • /health - Health check endpoint

๐Ÿ“– Example Use Cases

Use Case 1: Building a Research Team

orchestrator = AgenticRAGOrchestrator()

# Discover experts in required areas
for expertise in ['medical imaging', 'deep learning', 'computer vision']:
    orchestrator.discover_and_index(expertise, max_profiles=10)

# Search for qualified candidates
results = orchestrator.search(
    "AI healthcare medical imaging deep learning",
    k=15
)

# Filter by criteria
qualified = [
    r for r in results['results']
    if r['h_index'] >= 20 and r['total_citations'] >= 5000
]

# Select team
team = qualified[:5]

Use Case 2: Literature Review Assistant

orchestrator = AgenticRAGOrchestrator()

# Build knowledge base for a topic
orchestrator.discover_and_index("transformer models NLP", max_profiles=30)

# Ask research questions
questions = [
    "Who pioneered transformer architectures?",
    "Which researchers focus on attention mechanisms?",
    "Who has recent work on large language models?"
]

for question in questions:
    answer = orchestrator.ask(question, k=5)
    print(f"Q: {question}")
    print(f"A: {answer['answer']}\n")

Use Case 3: Collaboration Discovery

orchestrator = AgenticRAGOrchestrator()

# Index your research area
orchestrator.discover_and_index("reinforcement learning", max_profiles=50)

# Find potential collaborators
results = orchestrator.search(
    "multi-agent systems game theory reinforcement learning",
    k=10
)

# Analyze collaboration potential
for researcher in results['results']:
    print(f"{researcher['name']}")
    print(f"  Interests: {', '.join(researcher.get('interests', []))}")
    print(f"  H-index: {researcher['h_index']}")

โš™๏ธ Configuration

Environment Variables

# Required for LLM generation
export HF_TOKEN="your_huggingface_token"

# Optional: Configure rate limits
export OPENALEX_RATE_LIMIT=10  # requests per second
export SCHOLAR_RATE_LIMIT=2    # requests per second

System Requirements

  • Python: 3.8+
  • Memory: 2GB+ RAM (for embeddings)
  • Network: Internet connection for API calls
  • Storage: Minimal (in-memory vector store)

Model Configuration

The system uses these models via HuggingFace API:

  • Embeddings: sentence-transformers/all-MiniLM-L6-v2

    • Lightweight, fast, high-quality
    • No local download required
  • LLM: meta-llama/Meta-Llama-3-8B-Instruct

    • Via HuggingFace Inference API
    • Requires HF_TOKEN
    • No local download required

๐Ÿ”ง Advanced Features

Custom Data Collection

class CustomCollector(AgenticDataCollector):
    def _execute_collection_pipeline(self, name, context):
        # Add custom data sources
        custom_data = self._collect_from_custom_source(name)
        
        # Call parent implementation
        profile = super()._execute_collection_pipeline(name, context)
        
        # Enrich profile
        profile.metadata['custom_data'] = custom_data
        return profile

Custom RAG Prompts

rag_system = IntelligentRAGSystem()

# Modify the system prompt
custom_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a domain-specific research assistant..."),
    ("user", "{query}\n\nContext: {context}")
])

# Use in synthesis
answer = rag_system.synthesize_answer(
    query="Who are the experts?",
    k=5,
    custom_prompt=custom_prompt
)

Export Formats

# JSON export
orchestrator.export_profiles("profiles.json")

# Custom export
profiles = orchestrator.get_all_profiles()
df = pd.DataFrame([asdict(p) for p in profiles])
df.to_csv("profiles.csv", index=False)

๐ŸŽฏ Performance Optimization

Batch Processing

# Efficient batch collection
names = [f"researcher_{i}" for i in range(100)]
batch_size = 10

for i in range(0, len(names), batch_size):
    batch = names[i:i+batch_size]
    profiles = collector.batch_collect(batch)
    rag_system.index_profiles(profiles)

Caching Strategy

# The system automatically caches collected profiles for 1 hour
# Force refresh by clearing cache:
collector.collection_memory.clear()

Rate Limiting

import time

# Add delays between API calls
for name in names:
    profile = collector.collect_individual_data(name)
    time.sleep(1)  # 1 second delay

๐Ÿ› Troubleshooting

Common Issues

Issue: "No HF_TOKEN provided"

# Solution: Set environment variable
import os
os.environ['HF_TOKEN'] = 'your_token_here'

Issue: "Rate limit exceeded"

# Solution: Add delays or reduce batch size
collector = AgenticDataCollector()
collector.rate_limit = 1  # 1 request per second

Issue: "No profiles found"

# Solution: Try broader search terms
result = orchestrator.discover_and_index(
    "machine learning",  # Broader term
    max_profiles=30      # More profiles
)

๐Ÿ“Š Monitoring & Logging

Enable Verbose Logging

import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('agentic_rag_system')

Track Performance

import time

start = time.time()
result = orchestrator.discover_and_index("AI", max_profiles=20)
elapsed = time.time() - start

print(f"Time: {elapsed:.2f}s")
print(f"Rate: {result['profiles_collected']/elapsed:.2f} profiles/sec")

๐Ÿ”’ Security Considerations

  • API tokens are never logged or exposed
  • Rate limiting prevents abuse
  • User agent identifies legitimate academic use
  • No scraping of paywalled content
  • Respects robots.txt and API terms of service

๐Ÿ“„ License

This system respects academic data sources and their terms of service:

  • OpenAlex: CC0 License (public domain)
  • Google Scholar: Use via scholarly library
  • arXiv: Open access repository

๐Ÿค Contributing

Contributions welcome! Areas for improvement:

  • Additional data sources (Semantic Scholar, ORCID, etc.)
  • Enhanced profile enrichment
  • Better deduplication algorithms
  • UI/UX improvements
  • Performance optimizations

๐Ÿ“ฎ Support

For issues, questions, or feature requests:

  1. Check the troubleshooting section
  2. Review example usage scripts
  3. Examine system logs
  4. Contact the development team

๐ŸŽ“ Citation

If you use this system in your research, please cite:

@software{agentic_rag_system,
  title={Agentic RAG System for Academic Profile Collection},
  author={Your Organization},
  year={2025},
  url={https://github.com/your-repo}
}

๐Ÿ“ Changelog

Version 1.0.0 (2025-01-28)

  • Initial release
  • Multi-source data collection
  • Semantic search with vector embeddings
  • RAG-powered question answering
  • Flask API integration
  • Web dashboard

Built with: Python, LangChain, HuggingFace, OpenAlex API, Google Scholar API

Status: Production-ready โœ