Spaces:
Sleeping
Sleeping
AdityaAdaki commited on
Commit ·
4ac113f
1
Parent(s): 9d37152
ui updates and readme add
Browse files
README.md
CHANGED
|
@@ -1,12 +1,129 @@
|
|
| 1 |
-
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
-
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# F1-AI: Formula 1 RAG Application
|
| 2 |
+
|
| 3 |
+
F1-AI is a Retrieval-Augmented Generation (RAG) application specifically designed for Formula 1 information. It features an intelligent web scraper that automatically discovers and extracts Formula 1-related content from the web, stores it in a vector database, and enables natural language querying of the stored information.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
|
| 7 |
+

|
| 8 |
+
|
| 9 |
+
- Web scraping of Formula 1 content with automatic content extraction
|
| 10 |
+
- Vector database storage using Pinecone for efficient similarity search
|
| 11 |
+
- OpenRouter integration with Mistral-7B-Instruct model for advanced LLM capabilities
|
| 12 |
+
- HuggingFace embeddings for improved semantic understanding
|
| 13 |
+
- RAG-powered question answering with contextual understanding and source citations
|
| 14 |
+
- Command-line interface for automation and scripting
|
| 15 |
+
- User-friendly Streamlit web interface with chat history
|
| 16 |
+
- Asynchronous data ingestion and processing for improved performance
|
| 17 |
+
|
| 18 |
+
## Architecture
|
| 19 |
+
|
| 20 |
+
F1-AI is built on a modern tech stack:
|
| 21 |
+
|
| 22 |
+
- **LangChain**: Orchestrates the RAG pipeline and manages interactions between components
|
| 23 |
+
- **Pinecone**: Vector database for storing and retrieving embeddings
|
| 24 |
+
- **OpenRouter**: Primary LLM provider with Mistral-7B-Instruct model
|
| 25 |
+
- **HuggingFace**: Provides all-MiniLM-L6-v2 embeddings model
|
| 26 |
+
- **Playwright**: Handles web scraping with JavaScript support
|
| 27 |
+
- **BeautifulSoup4**: Processes HTML content and extracts relevant information
|
| 28 |
+
- **Streamlit**: Provides an interactive web interface with chat functionality
|
| 29 |
+
|
| 30 |
+
## Prerequisites
|
| 31 |
+
|
| 32 |
+
- Python 3.8 or higher
|
| 33 |
+
- OpenRouter API key (set as OPENROUTER_API_KEY environment variable)
|
| 34 |
+
- Pinecone API key (set as PINECONE_API_KEY environment variable)
|
| 35 |
+
- 8GB RAM minimum (16GB recommended)
|
| 36 |
+
- Internet connection for web scraping
|
| 37 |
+
|
| 38 |
+
## Installation
|
| 39 |
+
|
| 40 |
+
1. Clone the repository:
|
| 41 |
+
```bash
|
| 42 |
+
git clone <repository-url>
|
| 43 |
+
cd f1-ai
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
2. Install the required dependencies:
|
| 47 |
+
```bash
|
| 48 |
+
pip install -r requirements.txt
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
3. Install Playwright browsers:
|
| 52 |
+
```bash
|
| 53 |
+
playwright install chromium
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
4. Set up environment variables:
|
| 57 |
+
Create a .env file with:
|
| 58 |
+
```
|
| 59 |
+
OPENROUTER_API_KEY=your_api_key_here # Required for LLM functionality
|
| 60 |
+
PINECONE_API_KEY=your_api_key_here # Required for vector storage
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
## Usage
|
| 64 |
+
|
| 65 |
+
### Command Line Interface
|
| 66 |
+
|
| 67 |
+
1. Scrape and ingest F1 content:
|
| 68 |
+
```bash
|
| 69 |
+
python f1_scraper.py --start-urls https://www.formula1.com/ --max-pages 100 --depth 2 --ingest
|
| 70 |
+
```
|
| 71 |
+
Options:
|
| 72 |
+
- `--start-urls`: Space-separated list of URLs to start crawling from
|
| 73 |
+
- `--max-pages`: Maximum number of pages to crawl (default: 100)
|
| 74 |
+
- `--depth`: Maximum crawl depth (default: 2)
|
| 75 |
+
- `--ingest`: Flag to ingest discovered content into RAG system
|
| 76 |
+
- `--max-chunks`: Maximum chunks per URL for ingestion (default: 50)
|
| 77 |
+
- `--llm-provider`: Choose LLM provider (openrouter)
|
| 78 |
+
|
| 79 |
+
2. Ask questions about Formula 1:
|
| 80 |
+
```bash
|
| 81 |
+
python f1_ai.py ask "Who won the 2023 F1 World Championship?"
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
### Streamlit Interface
|
| 85 |
+
|
| 86 |
+
Run the Streamlit app:
|
| 87 |
+
```bash
|
| 88 |
+
streamlit run app.py
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
This will open a web interface where you can:
|
| 92 |
+
- Ask questions about Formula 1
|
| 93 |
+
- View responses in a chat-like interface
|
| 94 |
+
- See source citations for answers
|
| 95 |
+
- Track conversation history
|
| 96 |
+
- Get real-time updates on response generation
|
| 97 |
+
|
| 98 |
+
## Project Structure
|
| 99 |
+
|
| 100 |
+
- `f1_scraper.py`: Intelligent web crawler implementation
|
| 101 |
+
- Automatically discovers F1-related content using keyword scoring
|
| 102 |
+
- Handles content relevance detection with priority paths
|
| 103 |
+
- Manages crawling depth and limits
|
| 104 |
+
- Implements domain-specific filtering
|
| 105 |
+
- `f1_ai.py`: Core RAG application implementation
|
| 106 |
+
- Handles data ingestion and chunking
|
| 107 |
+
- Manages vector database operations
|
| 108 |
+
- Implements question-answering logic with source tracking
|
| 109 |
+
- Provides robust error handling
|
| 110 |
+
- `llm_manager.py`: LLM provider management
|
| 111 |
+
- Integrates with OpenRouter for advanced LLM capabilities
|
| 112 |
+
- Manages HuggingFace embeddings generation
|
| 113 |
+
- Implements rate limiting and error recovery
|
| 114 |
+
- Handles async API interactions
|
| 115 |
+
- `app.py`: Streamlit web interface
|
| 116 |
+
- Provides chat-based UI with message history
|
| 117 |
+
- Manages conversation state
|
| 118 |
+
- Handles async operations with progress tracking
|
| 119 |
+
- Implements error handling and user feedback
|
| 120 |
+
|
| 121 |
+
## Contributing
|
| 122 |
+
|
| 123 |
+
Contributions are welcome! Please follow these steps:
|
| 124 |
+
|
| 125 |
+
1. Fork the repository
|
| 126 |
+
2. Create a feature branch
|
| 127 |
+
3. Commit your changes
|
| 128 |
+
4. Push to the branch
|
| 129 |
+
5. Submit a Pull Request
|
app.py
CHANGED
|
@@ -24,98 +24,67 @@ st.markdown("""
|
|
| 24 |
This application uses Retrieval-Augmented Generation (RAG) to answer questions about Formula 1.
|
| 25 |
""")
|
| 26 |
|
| 27 |
-
#
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
text-decoration: none;
|
| 50 |
-
}
|
| 51 |
-
</style>
|
| 52 |
-
""", unsafe_allow_html=True)
|
| 53 |
-
|
| 54 |
-
# Display chat history with enhanced formatting
|
| 55 |
-
for message in st.session_state.chat_history:
|
| 56 |
-
with st.chat_message(message["role"]):
|
| 57 |
-
if message["role"] == "assistant" and isinstance(message["content"], dict):
|
| 58 |
-
st.markdown(message["content"]["answer"])
|
| 59 |
-
if message["content"]["sources"]:
|
| 60 |
-
st.markdown("---")
|
| 61 |
-
st.markdown("**Sources:**")
|
| 62 |
-
for source in message["content"]["sources"]:
|
| 63 |
-
st.markdown(f"- [{source['url']}]({source['url']})")
|
| 64 |
-
else:
|
| 65 |
-
st.markdown(message["content"])
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
response = asyncio.run(st.session_state.f1_ai.ask_question(question))
|
| 80 |
-
st.markdown(response["answer"])
|
| 81 |
-
|
| 82 |
-
# Display sources if available
|
| 83 |
-
if response["sources"]:
|
| 84 |
-
st.markdown("---")
|
| 85 |
-
st.markdown("**Sources:**")
|
| 86 |
-
for source in response["sources"]:
|
| 87 |
-
st.markdown(f"- [{source['url']}]({source['url']})")
|
| 88 |
-
|
| 89 |
-
# Add assistant response to chat history
|
| 90 |
-
st.session_state.chat_history.append({"role": "assistant", "content": response})
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
placeholder="https://en.wikipedia.org/wiki/Formula_One\nhttps://www.formula1.com/en/latest/article....")
|
| 97 |
|
| 98 |
-
|
|
|
|
|
|
|
| 99 |
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
st.error("Please enter at least one valid URL.")
|
| 116 |
-
else:
|
| 117 |
-
st.error("Please enter at least one URL to ingest.")
|
| 118 |
|
| 119 |
# Add a footer with credits
|
| 120 |
st.markdown("---")
|
| 121 |
-
st.markdown("F1-AI: A Formula 1 RAG Application
|
|
|
|
| 24 |
This application uses Retrieval-Augmented Generation (RAG) to answer questions about Formula 1.
|
| 25 |
""")
|
| 26 |
|
| 27 |
+
# Custom CSS for better styling
|
| 28 |
+
st.markdown("""
|
| 29 |
+
<style>
|
| 30 |
+
.stChatMessage {
|
| 31 |
+
padding: 1rem;
|
| 32 |
+
border-radius: 0.5rem;
|
| 33 |
+
margin-bottom: 1rem;
|
| 34 |
+
box-shadow: 0 2px 4px rgba(0,0,0,0.1);
|
| 35 |
+
}
|
| 36 |
+
.stChatMessage.user {
|
| 37 |
+
background-color: #f0f2f6;
|
| 38 |
+
}
|
| 39 |
+
.stChatMessage.assistant {
|
| 40 |
+
background-color: #ffffff;
|
| 41 |
+
}
|
| 42 |
+
.source-link {
|
| 43 |
+
font-size: 0.8rem;
|
| 44 |
+
color: #666;
|
| 45 |
+
text-decoration: none;
|
| 46 |
+
}
|
| 47 |
+
</style>
|
| 48 |
+
""", unsafe_allow_html=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
+
# Display chat history with enhanced formatting
|
| 51 |
+
for message in st.session_state.chat_history:
|
| 52 |
+
with st.chat_message(message["role"]):
|
| 53 |
+
if message["role"] == "assistant" and isinstance(message["content"], dict):
|
| 54 |
+
st.markdown(message["content"]["answer"])
|
| 55 |
+
if message["content"]["sources"]:
|
| 56 |
+
st.markdown("---")
|
| 57 |
+
st.markdown("**Sources:**")
|
| 58 |
+
for source in message["content"]["sources"]:
|
| 59 |
+
st.markdown(f"- [{source['url']}]({source['url']})")
|
| 60 |
+
else:
|
| 61 |
+
st.markdown(message["content"])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
# Question input
|
| 64 |
+
if question := st.chat_input("Ask a question about Formula 1"):
|
| 65 |
+
# Add user question to chat history
|
| 66 |
+
st.session_state.chat_history.append({"role": "user", "content": question})
|
|
|
|
| 67 |
|
| 68 |
+
# Display user question
|
| 69 |
+
with st.chat_message("user"):
|
| 70 |
+
st.write(question)
|
| 71 |
|
| 72 |
+
# Generate and display response with enhanced formatting
|
| 73 |
+
with st.chat_message("assistant"):
|
| 74 |
+
with st.spinner("🤔 Analyzing Formula 1 knowledge..."):
|
| 75 |
+
response = asyncio.run(st.session_state.f1_ai.ask_question(question))
|
| 76 |
+
st.markdown(response["answer"])
|
| 77 |
+
|
| 78 |
+
# Display sources if available
|
| 79 |
+
if response["sources"]:
|
| 80 |
+
st.markdown("---")
|
| 81 |
+
st.markdown("**Sources:**")
|
| 82 |
+
for source in response["sources"]:
|
| 83 |
+
st.markdown(f"- [{source['url']}]({source['url']})")
|
| 84 |
+
|
| 85 |
+
# Add assistant response to chat history
|
| 86 |
+
st.session_state.chat_history.append({"role": "assistant", "content": response})
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
# Add a footer with credits
|
| 89 |
st.markdown("---")
|
| 90 |
+
st.markdown("F1-AI: A Formula 1 RAG Application")
|
f1_scraper.py
ADDED
|
@@ -0,0 +1,294 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import asyncio
|
| 3 |
+
import argparse
|
| 4 |
+
import logging
|
| 5 |
+
from datetime import datetime
|
| 6 |
+
from urllib.parse import urlparse, urljoin
|
| 7 |
+
from typing import List, Dict, Set, Optional, Any
|
| 8 |
+
from rich.console import Console
|
| 9 |
+
from rich.progress import Progress
|
| 10 |
+
from playwright.async_api import async_playwright, TimeoutError
|
| 11 |
+
from bs4 import BeautifulSoup
|
| 12 |
+
from dotenv import load_dotenv
|
| 13 |
+
|
| 14 |
+
# Import our custom F1AI class
|
| 15 |
+
from f1_ai import F1AI
|
| 16 |
+
|
| 17 |
+
# Configure logging
|
| 18 |
+
logging.basicConfig(level=logging.INFO)
|
| 19 |
+
logger = logging.getLogger(__name__)
|
| 20 |
+
console = Console()
|
| 21 |
+
|
| 22 |
+
# Load environment variables
|
| 23 |
+
load_dotenv()
|
| 24 |
+
|
| 25 |
+
class F1Scraper:
|
| 26 |
+
def __init__(self, max_pages: int = 100, depth: int = 2, f1_ai: Optional[F1AI] = None):
|
| 27 |
+
"""
|
| 28 |
+
Initialize the F1 web scraper.
|
| 29 |
+
|
| 30 |
+
Args:
|
| 31 |
+
max_pages (int): Maximum number of pages to scrape
|
| 32 |
+
depth (int): Maximum depth for crawling
|
| 33 |
+
f1_ai (F1AI): Optional F1AI instance to use for ingestion
|
| 34 |
+
"""
|
| 35 |
+
self.max_pages = max_pages
|
| 36 |
+
self.depth = depth
|
| 37 |
+
self.visited_urls: Set[str] = set()
|
| 38 |
+
self.f1_urls: List[str] = []
|
| 39 |
+
self.f1_ai = f1_ai if f1_ai else F1AI(llm_provider="openrouter")
|
| 40 |
+
|
| 41 |
+
# Define F1-related keywords to identify relevant pages
|
| 42 |
+
self.f1_keywords = [
|
| 43 |
+
"formula 1", "formula one", "f1", "grand prix", "gp", "race", "racing",
|
| 44 |
+
"driver", "team", "championship", "qualifying", "podium", "ferrari",
|
| 45 |
+
"mercedes", "red bull", "mclaren", "williams", "alpine", "aston martin",
|
| 46 |
+
"haas", "alfa romeo", "alphatauri", "fia", "pirelli", "drs", "pit stop",
|
| 47 |
+
"verstappen", "hamilton", "leclerc", "sainz", "norris", "perez",
|
| 48 |
+
"russell", "alonso", "track", "circuit", "lap", "pole position"
|
| 49 |
+
]
|
| 50 |
+
|
| 51 |
+
# Core F1 websites to target
|
| 52 |
+
self.f1_core_sites = [
|
| 53 |
+
"formula1.com",
|
| 54 |
+
"autosport.com",
|
| 55 |
+
"motorsport.com",
|
| 56 |
+
"f1i.com",
|
| 57 |
+
"racefans.net",
|
| 58 |
+
"crash.net/f1",
|
| 59 |
+
"espn.com/f1",
|
| 60 |
+
"bbc.com/sport/formula1",
|
| 61 |
+
"skysports.com/f1"
|
| 62 |
+
]
|
| 63 |
+
|
| 64 |
+
def is_f1_related(self, url: str, content: Optional[str] = None) -> bool:
|
| 65 |
+
"""Determine if a URL and its content are F1-related."""
|
| 66 |
+
# Check if URL is from a core F1 site
|
| 67 |
+
parsed_url = urlparse(url)
|
| 68 |
+
domain = parsed_url.netloc
|
| 69 |
+
|
| 70 |
+
for core_site in self.f1_core_sites:
|
| 71 |
+
if core_site in domain:
|
| 72 |
+
return True
|
| 73 |
+
|
| 74 |
+
# High-priority paths that are definitely F1-related
|
| 75 |
+
priority_paths = [
|
| 76 |
+
"/racing/", "/drivers/", "/teams/", "/results/",
|
| 77 |
+
"/grands-prix/", "/championship/", "/races/",
|
| 78 |
+
"/season/", "/standings/", "/stats/","/calendar/",
|
| 79 |
+
"/schedule/"
|
| 80 |
+
]
|
| 81 |
+
|
| 82 |
+
# Skip these paths even if they contain F1-related terms
|
| 83 |
+
skip_paths = [
|
| 84 |
+
"/privacy/", "/terms/", "/legal/", "/contact/",
|
| 85 |
+
"/cookie/", "/account/", "/login/", "/register/",
|
| 86 |
+
"/admin/", "/about/", "/careers/", "/press/",
|
| 87 |
+
"/media-centre/", "/corporate/", "/investors/",
|
| 88 |
+
"/f1store", "f1authentincs", "/articles/", "/news/",
|
| 89 |
+
"/blog/", "/videos/", "/photos/", "/gallery/", "/photoshoot/"
|
| 90 |
+
]
|
| 91 |
+
|
| 92 |
+
url_lower = url.lower()
|
| 93 |
+
|
| 94 |
+
# Check if URL is in skip paths
|
| 95 |
+
if any(path in url_lower for path in skip_paths):
|
| 96 |
+
return False
|
| 97 |
+
|
| 98 |
+
# Priority paths are always considered F1-related
|
| 99 |
+
if any(path in url_lower for path in priority_paths):
|
| 100 |
+
return True
|
| 101 |
+
|
| 102 |
+
# Check URL path for F1 keywords
|
| 103 |
+
url_path = parsed_url.path.lower()
|
| 104 |
+
for keyword in self.f1_keywords:
|
| 105 |
+
if keyword in url_path:
|
| 106 |
+
return True
|
| 107 |
+
|
| 108 |
+
# If content provided, check for F1 keywords
|
| 109 |
+
if content:
|
| 110 |
+
content_lower = content.lower()
|
| 111 |
+
# Count keyword occurrences to determine relevance
|
| 112 |
+
keyword_count = sum(1 for keyword in self.f1_keywords if keyword in content_lower)
|
| 113 |
+
# If many keywords are found, it's likely F1-related
|
| 114 |
+
if keyword_count >= 3:
|
| 115 |
+
return True
|
| 116 |
+
|
| 117 |
+
return False
|
| 118 |
+
|
| 119 |
+
async def extract_links(self, url: str) -> List[str]:
|
| 120 |
+
"""Extract links from a webpage."""
|
| 121 |
+
links = []
|
| 122 |
+
try:
|
| 123 |
+
async with async_playwright() as p:
|
| 124 |
+
browser = await p.chromium.launch()
|
| 125 |
+
page = await browser.new_page()
|
| 126 |
+
|
| 127 |
+
try:
|
| 128 |
+
await page.goto(url, timeout=30000)
|
| 129 |
+
html_content = await page.content()
|
| 130 |
+
soup = BeautifulSoup(html_content, 'html.parser')
|
| 131 |
+
|
| 132 |
+
# Get base domain for domain restriction
|
| 133 |
+
parsed_url = urlparse(url)
|
| 134 |
+
base_domain = parsed_url.netloc
|
| 135 |
+
|
| 136 |
+
# Find all links
|
| 137 |
+
for a_tag in soup.find_all('a', href=True):
|
| 138 |
+
href = a_tag['href']
|
| 139 |
+
# Convert relative URLs to absolute
|
| 140 |
+
if href.startswith('/'):
|
| 141 |
+
href = urljoin(url, href)
|
| 142 |
+
|
| 143 |
+
# Skip non-http(s) URLs
|
| 144 |
+
if not href.startswith(('http://', 'https://')):
|
| 145 |
+
continue
|
| 146 |
+
|
| 147 |
+
# Only include links from formula1.com if it's the default start URL
|
| 148 |
+
if base_domain == 'www.formula1.com':
|
| 149 |
+
parsed_href = urlparse(href)
|
| 150 |
+
if parsed_href.netloc != 'www.formula1.com':
|
| 151 |
+
continue
|
| 152 |
+
|
| 153 |
+
links.append(href)
|
| 154 |
+
|
| 155 |
+
# Check if content is F1 related before returning
|
| 156 |
+
text_content = soup.get_text(separator=' ', strip=True)
|
| 157 |
+
if self.is_f1_related(url, text_content):
|
| 158 |
+
self.f1_urls.append(url)
|
| 159 |
+
logger.info(f"✅ F1-related content found: {url}")
|
| 160 |
+
|
| 161 |
+
except TimeoutError:
|
| 162 |
+
logger.error(f"Timeout while loading {url}")
|
| 163 |
+
finally:
|
| 164 |
+
await browser.close()
|
| 165 |
+
|
| 166 |
+
return links
|
| 167 |
+
except Exception as e:
|
| 168 |
+
logger.error(f"Error extracting links from {url}: {str(e)}")
|
| 169 |
+
return []
|
| 170 |
+
|
| 171 |
+
async def crawl(self, start_urls: List[str]) -> List[str]:
|
| 172 |
+
"""
|
| 173 |
+
Crawl F1-related websites starting from the provided URLs.
|
| 174 |
+
|
| 175 |
+
Args:
|
| 176 |
+
start_urls (List[str]): Starting URLs for crawling
|
| 177 |
+
|
| 178 |
+
Returns:
|
| 179 |
+
List[str]: List of discovered F1-related URLs
|
| 180 |
+
"""
|
| 181 |
+
to_visit = start_urls.copy()
|
| 182 |
+
current_depth = 0
|
| 183 |
+
|
| 184 |
+
with Progress() as progress:
|
| 185 |
+
task = progress.add_task("[green]Crawling F1 websites...", total=self.max_pages)
|
| 186 |
+
|
| 187 |
+
while to_visit and len(self.visited_urls) < self.max_pages and current_depth <= self.depth:
|
| 188 |
+
current_depth += 1
|
| 189 |
+
next_level = []
|
| 190 |
+
|
| 191 |
+
for url in to_visit:
|
| 192 |
+
if url in self.visited_urls:
|
| 193 |
+
continue
|
| 194 |
+
|
| 195 |
+
self.visited_urls.add(url)
|
| 196 |
+
progress.update(task, advance=1, description=f"[green]Crawling: {url[:50]}...")
|
| 197 |
+
|
| 198 |
+
links = await self.extract_links(url)
|
| 199 |
+
next_level.extend([link for link in links if link not in self.visited_urls])
|
| 200 |
+
|
| 201 |
+
# Update progress
|
| 202 |
+
progress.update(task, completed=len(self.visited_urls), total=self.max_pages)
|
| 203 |
+
if len(self.visited_urls) >= self.max_pages:
|
| 204 |
+
break
|
| 205 |
+
|
| 206 |
+
to_visit = next_level
|
| 207 |
+
logger.info(f"Completed depth {current_depth}, discovered {len(self.f1_urls)} F1-related URLs")
|
| 208 |
+
|
| 209 |
+
# Deduplicate and return results
|
| 210 |
+
self.f1_urls = list(set(self.f1_urls))
|
| 211 |
+
return self.f1_urls
|
| 212 |
+
|
| 213 |
+
async def ingest_discovered_urls(self, max_chunks_per_url: int = 50) -> None:
|
| 214 |
+
"""
|
| 215 |
+
Ingest discovered F1-related URLs into the RAG system.
|
| 216 |
+
|
| 217 |
+
Args:
|
| 218 |
+
max_chunks_per_url (int): Maximum chunks to extract per URL
|
| 219 |
+
"""
|
| 220 |
+
if not self.f1_urls:
|
| 221 |
+
logger.warning("No F1-related URLs to ingest. Run crawl() first.")
|
| 222 |
+
return
|
| 223 |
+
|
| 224 |
+
logger.info(f"Ingesting {len(self.f1_urls)} F1-related URLs into RAG system...")
|
| 225 |
+
await self.f1_ai.ingest(self.f1_urls, max_chunks_per_url=max_chunks_per_url)
|
| 226 |
+
logger.info("✅ Ingestion complete!")
|
| 227 |
+
|
| 228 |
+
def save_urls_to_file(self, filename: str = "f1_urls.txt") -> None:
|
| 229 |
+
"""
|
| 230 |
+
Save discovered F1 URLs to a text file.
|
| 231 |
+
|
| 232 |
+
Args:
|
| 233 |
+
filename (str): Name of the output file
|
| 234 |
+
"""
|
| 235 |
+
if not self.f1_urls:
|
| 236 |
+
logger.warning("No F1-related URLs to save. Run crawl() first.")
|
| 237 |
+
return
|
| 238 |
+
|
| 239 |
+
with open(filename, "w") as f:
|
| 240 |
+
f.write(f"# F1-related URLs discovered on {datetime.now().isoformat()}\n")
|
| 241 |
+
f.write(f"# Total URLs: {len(self.f1_urls)}\n\n")
|
| 242 |
+
for url in self.f1_urls:
|
| 243 |
+
f.write(f"{url}\n")
|
| 244 |
+
|
| 245 |
+
logger.info(f"✅ Saved {len(self.f1_urls)} URLs to {filename}")
|
| 246 |
+
|
| 247 |
+
async def main():
|
| 248 |
+
"""Main function to run the F1 scraper."""
|
| 249 |
+
parser = argparse.ArgumentParser(description="F1 Web Scraper to discover and ingest F1-related content")
|
| 250 |
+
parser.add_argument("--start-urls", nargs="+", default=["https://www.formula1.com/"],
|
| 251 |
+
help="Starting URLs for crawling")
|
| 252 |
+
parser.add_argument("--max-pages", type=int, default=100,
|
| 253 |
+
help="Maximum number of pages to crawl")
|
| 254 |
+
parser.add_argument("--depth", type=int, default=2,
|
| 255 |
+
help="Maximum crawl depth")
|
| 256 |
+
parser.add_argument("--ingest", action="store_true",
|
| 257 |
+
help="Ingest discovered URLs into RAG system")
|
| 258 |
+
parser.add_argument("--max-chunks", type=int, default=50,
|
| 259 |
+
help="Maximum chunks per URL for ingestion")
|
| 260 |
+
parser.add_argument("--output", type=str, default="f1_urls.txt",
|
| 261 |
+
help="Output file for discovered URLs")
|
| 262 |
+
parser.add_argument("--llm-provider", choices=["ollama", "openrouter"], default="openrouter",
|
| 263 |
+
help="Provider for LLM (default: openrouter)")
|
| 264 |
+
|
| 265 |
+
args = parser.parse_args()
|
| 266 |
+
|
| 267 |
+
# Initialize F1AI if needed
|
| 268 |
+
f1_ai = None
|
| 269 |
+
if args.ingest:
|
| 270 |
+
f1_ai = F1AI(llm_provider=args.llm_provider)
|
| 271 |
+
|
| 272 |
+
# Initialize and run the scraper
|
| 273 |
+
scraper = F1Scraper(
|
| 274 |
+
max_pages=args.max_pages,
|
| 275 |
+
depth=args.depth,
|
| 276 |
+
f1_ai=f1_ai
|
| 277 |
+
)
|
| 278 |
+
|
| 279 |
+
# Crawl to discover F1-related URLs
|
| 280 |
+
console.print("[bold blue]Starting F1 web crawler[/bold blue]")
|
| 281 |
+
discovered_urls = await scraper.crawl(args.start_urls)
|
| 282 |
+
console.print(f"[bold green]Discovered {len(discovered_urls)} F1-related URLs[/bold green]")
|
| 283 |
+
|
| 284 |
+
# Save URLs to file
|
| 285 |
+
scraper.save_urls_to_file(args.output)
|
| 286 |
+
|
| 287 |
+
# Ingest if requested
|
| 288 |
+
if args.ingest:
|
| 289 |
+
console.print("[bold yellow]Starting ingestion into RAG system...[/bold yellow]")
|
| 290 |
+
await scraper.ingest_discovered_urls(max_chunks_per_url=args.max_chunks)
|
| 291 |
+
console.print("[bold green]Ingestion complete![/bold green]")
|
| 292 |
+
|
| 293 |
+
if __name__ == "__main__":
|
| 294 |
+
asyncio.run(main())
|
image.png
ADDED
|