Agentic AI System for Individual Information Collection & RAG-Based Search
A sophisticated autonomous intelligence system that discovers, collects, and indexes researcher profiles using multiple academic data sources, with semantic search and RAG-powered question answering capabilities.
๐ Key Features
๐ค Autonomous Data Collection
- Multi-source aggregation: Automatically collects data from OpenAlex, Google Scholar, and arXiv
- Intelligent crawling: Adaptive strategies for discovering relevant individuals
- Profile synthesis: Combines data from multiple sources into unified profiles
- Batch processing: Efficiently collects data for multiple individuals
- Caching: Prevents redundant API calls with intelligent memory
๐ Semantic Search
- Vector embeddings: Uses
sentence-transformers/all-MiniLM-L6-v2for semantic understanding - In-memory vector store: Fast, efficient storage without external dependencies
- Relevance ranking: Multi-factor scoring based on content similarity and metrics
- Deduplication: Intelligent aggregation of search results
๐ง RAG-Powered Q&A
- Context-aware synthesis: Uses Llama-3-8B-Instruct via HuggingFace API
- Source attribution: Every answer includes relevant researcher profiles
- No local models: All inference via API (no downloads required)
๐ Rich Profile Data
Each collected profile includes:
- Name, affiliation, biography
- H-index, total citations, paper count
- Research interests/topics
- Recent publications
- Profile URLs and metadata
- Source attribution
๐ Quick Start
Installation
# Install dependencies
pip install flask langchain langchain-huggingface requests scholarly feedparser --break-system-packages
# Set HuggingFace token (required for LLM features)
export HF_TOKEN="your_huggingface_token_here"
Basic Usage
from agentic_rag_system import AgenticRAGOrchestrator
# Initialize the system
orchestrator = AgenticRAGOrchestrator()
# Autonomous discovery: Find and index experts in a field
result = orchestrator.discover_and_index(
query="machine learning",
max_profiles=20
)
# Search for specific expertise
search_results = orchestrator.search("deep learning", k=5)
# Ask questions and get synthesized answers
answer = orchestrator.ask(
"Who are the leading researchers in neural networks?",
k=5
)
print(answer['answer'])
for source in answer['sources']:
print(f"- {source['name']} ({source['affiliation']})")
๐ Core Components
1. AgenticDataCollector
Autonomously collects comprehensive data about individuals.
from agentic_rag_system import AgenticDataCollector
collector = AgenticDataCollector()
# Collect data for a specific person
profile = collector.collect_individual_data(
name="Geoffrey Hinton",
additional_context="deep learning"
)
# Batch collection
names = ["Yann LeCun", "Yoshua Bengio", "Andrew Ng"]
profiles = collector.batch_collect(names, context="machine learning")
Features:
- Multi-step collection pipeline
- Caching to prevent redundant calls
- Error handling and retries
- Progress tracking
Data Sources:
- OpenAlex: Comprehensive academic database (primary source)
- Google Scholar: Citation metrics and h-index verification
- Recent Publications: Latest research output
2. IntelligentRAGSystem
RAG system optimized for researcher profile search.
from agentic_rag_system import IntelligentRAGSystem
rag = IntelligentRAGSystem()
# Index profiles
rag.index_profiles(profiles)
# Search
results = rag.search("computer vision experts", k=5)
# Generate synthesized answer
answer = rag.synthesize_answer(
"Which researchers focus on attention mechanisms?",
k=5
)
Features:
- Semantic chunking with overlap
- Metadata-rich documents
- Deduplication and aggregation
- Context building for LLM prompts
3. AgenticRAGOrchestrator
High-level orchestrator combining all components.
from agentic_rag_system import AgenticRAGOrchestrator
orchestrator = AgenticRAGOrchestrator()
# All-in-one: discover, collect, index
orchestrator.discover_and_index("quantum computing", max_profiles=15)
# Search
results = orchestrator.search("quantum algorithms", k=5)
# Ask questions
answer = orchestrator.ask("Who are the top quantum computing researchers?")
# Export data
orchestrator.export_profiles("/path/to/export.json")
๐ Flask Integration
API Endpoints
1. Autonomous Discovery
POST /api/agentic/discover
Content-Type: application/json
{
"query": "artificial intelligence",
"max_profiles": 20
}
Response:
{
"success": true,
"profiles_collected": 18,
"profiles_indexed": 18,
"elapsed_time": 45.2,
"query": "artificial intelligence"
}
2. Semantic Search
GET /api/agentic/search?q=neural%20networks&k=5
Response:
{
"query": "neural networks",
"results": [
{
"name": "Geoffrey Hinton",
"affiliation": "University of Toronto",
"h_index": 185,
"total_citations": 487000,
"profile_url": "https://openalex.org/authors/A1234567890",
"relevance_score": 3
}
],
"total_indexed": 18
}
3. RAG Question Answering
POST /api/agentic/ask
Content-Type: application/json
{
"question": "Who are the leading deep learning researchers?",
"k": 5
}
Response:
{
"answer": "Based on the indexed profiles, leading deep learning researchers include Geoffrey Hinton from University of Toronto with h-index of 185...",
"sources": [...],
"context_used": 5
}
4. Get All Profiles
GET /api/agentic/profiles
5. System Statistics
GET /api/agentic/stats
6. Collect Specific Individual
POST /api/agentic/collect-individual
Content-Type: application/json
{
"name": "Andrew Ng",
"context": "machine learning stanford"
}
Web Interface Routes
/rag- Main RAG search interface/agentic-dashboard- System monitoring and control dashboard/health- Health check endpoint
๐ Example Use Cases
Use Case 1: Building a Research Team
orchestrator = AgenticRAGOrchestrator()
# Discover experts in required areas
for expertise in ['medical imaging', 'deep learning', 'computer vision']:
orchestrator.discover_and_index(expertise, max_profiles=10)
# Search for qualified candidates
results = orchestrator.search(
"AI healthcare medical imaging deep learning",
k=15
)
# Filter by criteria
qualified = [
r for r in results['results']
if r['h_index'] >= 20 and r['total_citations'] >= 5000
]
# Select team
team = qualified[:5]
Use Case 2: Literature Review Assistant
orchestrator = AgenticRAGOrchestrator()
# Build knowledge base for a topic
orchestrator.discover_and_index("transformer models NLP", max_profiles=30)
# Ask research questions
questions = [
"Who pioneered transformer architectures?",
"Which researchers focus on attention mechanisms?",
"Who has recent work on large language models?"
]
for question in questions:
answer = orchestrator.ask(question, k=5)
print(f"Q: {question}")
print(f"A: {answer['answer']}\n")
Use Case 3: Collaboration Discovery
orchestrator = AgenticRAGOrchestrator()
# Index your research area
orchestrator.discover_and_index("reinforcement learning", max_profiles=50)
# Find potential collaborators
results = orchestrator.search(
"multi-agent systems game theory reinforcement learning",
k=10
)
# Analyze collaboration potential
for researcher in results['results']:
print(f"{researcher['name']}")
print(f" Interests: {', '.join(researcher.get('interests', []))}")
print(f" H-index: {researcher['h_index']}")
โ๏ธ Configuration
Environment Variables
# Required for LLM generation
export HF_TOKEN="your_huggingface_token"
# Optional: Configure rate limits
export OPENALEX_RATE_LIMIT=10 # requests per second
export SCHOLAR_RATE_LIMIT=2 # requests per second
System Requirements
- Python: 3.8+
- Memory: 2GB+ RAM (for embeddings)
- Network: Internet connection for API calls
- Storage: Minimal (in-memory vector store)
Model Configuration
The system uses these models via HuggingFace API:
Embeddings:
sentence-transformers/all-MiniLM-L6-v2- Lightweight, fast, high-quality
- No local download required
LLM:
meta-llama/Meta-Llama-3-8B-Instruct- Via HuggingFace Inference API
- Requires HF_TOKEN
- No local download required
๐ง Advanced Features
Custom Data Collection
class CustomCollector(AgenticDataCollector):
def _execute_collection_pipeline(self, name, context):
# Add custom data sources
custom_data = self._collect_from_custom_source(name)
# Call parent implementation
profile = super()._execute_collection_pipeline(name, context)
# Enrich profile
profile.metadata['custom_data'] = custom_data
return profile
Custom RAG Prompts
rag_system = IntelligentRAGSystem()
# Modify the system prompt
custom_prompt = ChatPromptTemplate.from_messages([
("system", "You are a domain-specific research assistant..."),
("user", "{query}\n\nContext: {context}")
])
# Use in synthesis
answer = rag_system.synthesize_answer(
query="Who are the experts?",
k=5,
custom_prompt=custom_prompt
)
Export Formats
# JSON export
orchestrator.export_profiles("profiles.json")
# Custom export
profiles = orchestrator.get_all_profiles()
df = pd.DataFrame([asdict(p) for p in profiles])
df.to_csv("profiles.csv", index=False)
๐ฏ Performance Optimization
Batch Processing
# Efficient batch collection
names = [f"researcher_{i}" for i in range(100)]
batch_size = 10
for i in range(0, len(names), batch_size):
batch = names[i:i+batch_size]
profiles = collector.batch_collect(batch)
rag_system.index_profiles(profiles)
Caching Strategy
# The system automatically caches collected profiles for 1 hour
# Force refresh by clearing cache:
collector.collection_memory.clear()
Rate Limiting
import time
# Add delays between API calls
for name in names:
profile = collector.collect_individual_data(name)
time.sleep(1) # 1 second delay
๐ Troubleshooting
Common Issues
Issue: "No HF_TOKEN provided"
# Solution: Set environment variable
import os
os.environ['HF_TOKEN'] = 'your_token_here'
Issue: "Rate limit exceeded"
# Solution: Add delays or reduce batch size
collector = AgenticDataCollector()
collector.rate_limit = 1 # 1 request per second
Issue: "No profiles found"
# Solution: Try broader search terms
result = orchestrator.discover_and_index(
"machine learning", # Broader term
max_profiles=30 # More profiles
)
๐ Monitoring & Logging
Enable Verbose Logging
import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('agentic_rag_system')
Track Performance
import time
start = time.time()
result = orchestrator.discover_and_index("AI", max_profiles=20)
elapsed = time.time() - start
print(f"Time: {elapsed:.2f}s")
print(f"Rate: {result['profiles_collected']/elapsed:.2f} profiles/sec")
๐ Security Considerations
- API tokens are never logged or exposed
- Rate limiting prevents abuse
- User agent identifies legitimate academic use
- No scraping of paywalled content
- Respects robots.txt and API terms of service
๐ License
This system respects academic data sources and their terms of service:
- OpenAlex: CC0 License (public domain)
- Google Scholar: Use via scholarly library
- arXiv: Open access repository
๐ค Contributing
Contributions welcome! Areas for improvement:
- Additional data sources (Semantic Scholar, ORCID, etc.)
- Enhanced profile enrichment
- Better deduplication algorithms
- UI/UX improvements
- Performance optimizations
๐ฎ Support
For issues, questions, or feature requests:
- Check the troubleshooting section
- Review example usage scripts
- Examine system logs
- Contact the development team
๐ Citation
If you use this system in your research, please cite:
@software{agentic_rag_system,
title={Agentic RAG System for Academic Profile Collection},
author={Your Organization},
year={2025},
url={https://github.com/your-repo}
}
๐ Changelog
Version 1.0.0 (2025-01-28)
- Initial release
- Multi-source data collection
- Semantic search with vector embeddings
- RAG-powered question answering
- Flask API integration
- Web dashboard
Built with: Python, LangChain, HuggingFace, OpenAlex API, Google Scholar API
Status: Production-ready โ