Spaces:

flyfir248
/

Qsearch

Sleeping

App Files Files Community

Qsearch / README_AGENTIC_SYSTEM.md

flyfir248

Commit : Updated header.html and routes.py

aa928dd 3 months ago

preview code

raw

history blame contribute delete

13 kB

Agentic AI System for Individual Information Collection & RAG-Based Search

A sophisticated autonomous intelligence system that discovers, collects, and indexes researcher profiles using multiple academic data sources, with semantic search and RAG-powered question answering capabilities.

🌟 Key Features

🤖 Autonomous Data Collection

Multi-source aggregation: Automatically collects data from OpenAlex, Google Scholar, and arXiv
Intelligent crawling: Adaptive strategies for discovering relevant individuals
Profile synthesis: Combines data from multiple sources into unified profiles
Batch processing: Efficiently collects data for multiple individuals
Caching: Prevents redundant API calls with intelligent memory

🔍 Semantic Search

Vector embeddings: Uses sentence-transformers/all-MiniLM-L6-v2 for semantic understanding
In-memory vector store: Fast, efficient storage without external dependencies
Relevance ranking: Multi-factor scoring based on content similarity and metrics
Deduplication: Intelligent aggregation of search results

🧠 RAG-Powered Q&A

Context-aware synthesis: Uses Llama-3-8B-Instruct via HuggingFace API
Source attribution: Every answer includes relevant researcher profiles
No local models: All inference via API (no downloads required)

📊 Rich Profile Data

Each collected profile includes:

Name, affiliation, biography
H-index, total citations, paper count
Research interests/topics
Recent publications
Profile URLs and metadata
Source attribution

🚀 Quick Start

Installation

# Install dependencies
pip install flask langchain langchain-huggingface requests scholarly feedparser --break-system-packages

# Set HuggingFace token (required for LLM features)
export HF_TOKEN="your_huggingface_token_here"

Basic Usage

from agentic_rag_system import AgenticRAGOrchestrator

# Initialize the system
orchestrator = AgenticRAGOrchestrator()

# Autonomous discovery: Find and index experts in a field
result = orchestrator.discover_and_index(
    query="machine learning",
    max_profiles=20
)

# Search for specific expertise
search_results = orchestrator.search("deep learning", k=5)

# Ask questions and get synthesized answers
answer = orchestrator.ask(
    "Who are the leading researchers in neural networks?",
    k=5
)

print(answer['answer'])
for source in answer['sources']:
    print(f"- {source['name']} ({source['affiliation']})")

📚 Core Components

1. AgenticDataCollector

Autonomously collects comprehensive data about individuals.

from agentic_rag_system import AgenticDataCollector

collector = AgenticDataCollector()

# Collect data for a specific person
profile = collector.collect_individual_data(
    name="Geoffrey Hinton",
    additional_context="deep learning"
)

# Batch collection
names = ["Yann LeCun", "Yoshua Bengio", "Andrew Ng"]
profiles = collector.batch_collect(names, context="machine learning")

Features:

Multi-step collection pipeline
Caching to prevent redundant calls
Error handling and retries
Progress tracking

Data Sources:

OpenAlex: Comprehensive academic database (primary source)
Google Scholar: Citation metrics and h-index verification
Recent Publications: Latest research output

2. IntelligentRAGSystem

RAG system optimized for researcher profile search.

from agentic_rag_system import IntelligentRAGSystem

rag = IntelligentRAGSystem()

# Index profiles
rag.index_profiles(profiles)

# Search
results = rag.search("computer vision experts", k=5)

# Generate synthesized answer
answer = rag.synthesize_answer(
    "Which researchers focus on attention mechanisms?",
    k=5
)

Features:

Semantic chunking with overlap
Metadata-rich documents
Deduplication and aggregation
Context building for LLM prompts

3. AgenticRAGOrchestrator

High-level orchestrator combining all components.

from agentic_rag_system import AgenticRAGOrchestrator

orchestrator = AgenticRAGOrchestrator()

# All-in-one: discover, collect, index
orchestrator.discover_and_index("quantum computing", max_profiles=15)

# Search
results = orchestrator.search("quantum algorithms", k=5)

# Ask questions
answer = orchestrator.ask("Who are the top quantum computing researchers?")

# Export data
orchestrator.export_profiles("/path/to/export.json")

🌐 Flask Integration

API Endpoints

1. Autonomous Discovery

POST /api/agentic/discover
Content-Type: application/json

{
  "query": "artificial intelligence",
  "max_profiles": 20
}

Response:

{
  "success": true,
  "profiles_collected": 18,
  "profiles_indexed": 18,
  "elapsed_time": 45.2,
  "query": "artificial intelligence"
}

2. Semantic Search

GET /api/agentic/search?q=neural%20networks&k=5

Response:

{
  "query": "neural networks",
  "results": [
    {
      "name": "Geoffrey Hinton",
      "affiliation": "University of Toronto",
      "h_index": 185,
      "total_citations": 487000,
      "profile_url": "https://openalex.org/authors/A1234567890",
      "relevance_score": 3
    }
  ],
  "total_indexed": 18
}

3. RAG Question Answering

POST /api/agentic/ask
Content-Type: application/json

{
  "question": "Who are the leading deep learning researchers?",
  "k": 5
}

Response:

{
  "answer": "Based on the indexed profiles, leading deep learning researchers include Geoffrey Hinton from University of Toronto with h-index of 185...",
  "sources": [...],
  "context_used": 5
}

4. Get All Profiles

GET /api/agentic/profiles

5. System Statistics

GET /api/agentic/stats

6. Collect Specific Individual

POST /api/agentic/collect-individual
Content-Type: application/json

{
  "name": "Andrew Ng",
  "context": "machine learning stanford"
}

Web Interface Routes

/rag - Main RAG search interface
/agentic-dashboard - System monitoring and control dashboard
/health - Health check endpoint

📖 Example Use Cases

Use Case 1: Building a Research Team

orchestrator = AgenticRAGOrchestrator()

# Discover experts in required areas
for expertise in ['medical imaging', 'deep learning', 'computer vision']:
    orchestrator.discover_and_index(expertise, max_profiles=10)

# Search for qualified candidates
results = orchestrator.search(
    "AI healthcare medical imaging deep learning",
    k=15
)

# Filter by criteria
qualified = [
    r for r in results['results']
    if r['h_index'] >= 20 and r['total_citations'] >= 5000
]

# Select team
team = qualified[:5]

Use Case 2: Literature Review Assistant

orchestrator = AgenticRAGOrchestrator()

# Build knowledge base for a topic
orchestrator.discover_and_index("transformer models NLP", max_profiles=30)

# Ask research questions
questions = [
    "Who pioneered transformer architectures?",
    "Which researchers focus on attention mechanisms?",
    "Who has recent work on large language models?"
]

for question in questions:
    answer = orchestrator.ask(question, k=5)
    print(f"Q: {question}")
    print(f"A: {answer['answer']}\n")

Use Case 3: Collaboration Discovery

orchestrator = AgenticRAGOrchestrator()

# Index your research area
orchestrator.discover_and_index("reinforcement learning", max_profiles=50)

# Find potential collaborators
results = orchestrator.search(
    "multi-agent systems game theory reinforcement learning",
    k=10
)

# Analyze collaboration potential
for researcher in results['results']:
    print(f"{researcher['name']}")
    print(f"  Interests: {', '.join(researcher.get('interests', []))}")
    print(f"  H-index: {researcher['h_index']}")

⚙️ Configuration

Environment Variables

# Required for LLM generation
export HF_TOKEN="your_huggingface_token"

# Optional: Configure rate limits
export OPENALEX_RATE_LIMIT=10  # requests per second
export SCHOLAR_RATE_LIMIT=2    # requests per second

System Requirements

Python: 3.8+
Memory: 2GB+ RAM (for embeddings)
Network: Internet connection for API calls
Storage: Minimal (in-memory vector store)

Model Configuration

The system uses these models via HuggingFace API:

Embeddings: sentence-transformers/all-MiniLM-L6-v2
- Lightweight, fast, high-quality
- No local download required
LLM: meta-llama/Meta-Llama-3-8B-Instruct
- Via HuggingFace Inference API
- Requires HF_TOKEN
- No local download required

🔧 Advanced Features

Custom Data Collection

class CustomCollector(AgenticDataCollector):
    def _execute_collection_pipeline(self, name, context):
        # Add custom data sources
        custom_data = self._collect_from_custom_source(name)
        
        # Call parent implementation
        profile = super()._execute_collection_pipeline(name, context)
        
        # Enrich profile
        profile.metadata['custom_data'] = custom_data
        return profile

Custom RAG Prompts

rag_system = IntelligentRAGSystem()

# Modify the system prompt
custom_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a domain-specific research assistant..."),
    ("user", "{query}\n\nContext: {context}")
])

# Use in synthesis
answer = rag_system.synthesize_answer(
    query="Who are the experts?",
    k=5,
    custom_prompt=custom_prompt
)

Export Formats

# JSON export
orchestrator.export_profiles("profiles.json")

# Custom export
profiles = orchestrator.get_all_profiles()
df = pd.DataFrame([asdict(p) for p in profiles])
df.to_csv("profiles.csv", index=False)

🎯 Performance Optimization

Batch Processing

# Efficient batch collection
names = [f"researcher_{i}" for i in range(100)]
batch_size = 10

for i in range(0, len(names), batch_size):
    batch = names[i:i+batch_size]
    profiles = collector.batch_collect(batch)
    rag_system.index_profiles(profiles)

Caching Strategy

# The system automatically caches collected profiles for 1 hour
# Force refresh by clearing cache:
collector.collection_memory.clear()

Rate Limiting

import time

# Add delays between API calls
for name in names:
    profile = collector.collect_individual_data(name)
    time.sleep(1)  # 1 second delay

🐛 Troubleshooting

Common Issues

Issue: "No HF_TOKEN provided"

# Solution: Set environment variable
import os
os.environ['HF_TOKEN'] = 'your_token_here'

Issue: "Rate limit exceeded"

# Solution: Add delays or reduce batch size
collector = AgenticDataCollector()
collector.rate_limit = 1  # 1 request per second

Issue: "No profiles found"

# Solution: Try broader search terms
result = orchestrator.discover_and_index(
    "machine learning",  # Broader term
    max_profiles=30      # More profiles
)

📊 Monitoring & Logging

Enable Verbose Logging

import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('agentic_rag_system')

Track Performance

import time

start = time.time()
result = orchestrator.discover_and_index("AI", max_profiles=20)
elapsed = time.time() - start

print(f"Time: {elapsed:.2f}s")
print(f"Rate: {result['profiles_collected']/elapsed:.2f} profiles/sec")

🔒 Security Considerations

API tokens are never logged or exposed
Rate limiting prevents abuse
User agent identifies legitimate academic use
No scraping of paywalled content
Respects robots.txt and API terms of service

📄 License

This system respects academic data sources and their terms of service:

OpenAlex: CC0 License (public domain)
Google Scholar: Use via scholarly library
arXiv: Open access repository

🤝 Contributing

Contributions welcome! Areas for improvement:

Additional data sources (Semantic Scholar, ORCID, etc.)
Enhanced profile enrichment
Better deduplication algorithms
UI/UX improvements
Performance optimizations

📮 Support

For issues, questions, or feature requests:

Check the troubleshooting section
Review example usage scripts
Examine system logs
Contact the development team

🎓 Citation

If you use this system in your research, please cite:

@software{agentic_rag_system,
  title={Agentic RAG System for Academic Profile Collection},
  author={Your Organization},
  year={2025},
  url={https://github.com/your-repo}
}

📝 Changelog

Version 1.0.0 (2025-01-28)

Initial release
Multi-source data collection
Semantic search with vector embeddings
RAG-powered question answering
Flask API integration
Web dashboard

Built with: Python, LangChain, HuggingFace, OpenAlex API, Google Scholar API

Status: Production-ready ✅