eligapris commited on
Commit
e51e040
·
verified ·
1 Parent(s): 62f3268

Upload 7 files

Browse files
Files changed (7) hide show
  1. Dockerfile +48 -0
  2. README.md +129 -12
  3. main.py +80 -0
  4. pyproject.toml +20 -0
  5. requirements.txt +3 -0
  6. search.py +99 -0
  7. website_viewer.py +64 -0
Dockerfile ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use Python 3.11 as the base image
2
+ FROM python:3.11-slim
3
+
4
+ # Set working directory
5
+ WORKDIR /app
6
+
7
+ # Install system dependencies
8
+ RUN apt-get update && apt-get install -y \
9
+ gcc \
10
+ libglib2.0-0 \
11
+ libnss3 \
12
+ libnspr4 \
13
+ libdbus-1-3 \
14
+ libatk1.0-0 \
15
+ libatk-bridge2.0-0 \
16
+ libcups2 \
17
+ libdrm2 \
18
+ libxkbcommon0 \
19
+ libatspi2.0-0 \
20
+ libxcomposite1 \
21
+ libxdamage1 \
22
+ libxext6 \
23
+ libxfixes3 \
24
+ libxrandr2 \
25
+ libgbm1 \
26
+ libpango-1.0-0 \
27
+ libcairo2 \
28
+ libasound2 \
29
+ && rm -rf /var/lib/apt/lists/*
30
+
31
+ # Copy pyproject.toml and install Python dependencies
32
+ COPY pyproject.toml .
33
+ RUN pip install -e .
34
+
35
+ # Copy the application code
36
+ COPY . .
37
+
38
+ # Install playwright
39
+ RUN playwright install
40
+
41
+ # Set environment variables
42
+ ENV PYTHONUNBUFFERED=1
43
+
44
+ # Expose the port the app runs on
45
+ EXPOSE 8000
46
+
47
+ # Command to run the FastAPI application with uvicorn
48
+ CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
README.md CHANGED
@@ -1,12 +1,129 @@
1
- ---
2
- title: Search Api
3
- emoji: 🐠
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- short_description: 'Opensource Search Api '
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Search Engine API
2
+
3
+ A FastAPI-based search engine that uses DuckDuckGo's lite search interface.
4
+
5
+ ## Setup
6
+
7
+ 1. Install the required dependencies:
8
+ ```bash
9
+ pip install -r requirements.txt
10
+ ```
11
+
12
+ 2. Run the API server:
13
+ ```bash
14
+ uvicorn main:app --reload
15
+ ```
16
+
17
+ The API will be available at `http://localhost:8000`
18
+
19
+ ## API Documentation
20
+
21
+ Once the server is running, you can access the interactive API documentation at:
22
+ - Swagger UI: `http://localhost:8000/docs`
23
+ - ReDoc: `http://localhost:8000/redoc`
24
+
25
+ ## API Endpoints
26
+
27
+ ### POST /search
28
+
29
+ Search for content using a search phrase.
30
+
31
+ Request body:
32
+ ```json
33
+ {
34
+ "search_phrase": "your search query"
35
+ }
36
+ ```
37
+
38
+ Response:
39
+ ```json
40
+ {
41
+ "results": {
42
+ "Item_1": {
43
+ "title": "Result title",
44
+ "snippet": "Result snippet",
45
+ "linkText": "Link text"
46
+ },
47
+ // ... more items
48
+ }
49
+ }
50
+ ```
51
+
52
+ ### POST /websiteView
53
+
54
+ View and browse a website's content in Markdown format.
55
+
56
+ Request body:
57
+ ```json
58
+ {
59
+ "url": "https://example.com"
60
+ }
61
+ ```
62
+
63
+ Response:
64
+ ```json
65
+ {
66
+ "title": "Website Title",
67
+ "markdown": "# Website Title\n\nMain content in Markdown format...\n\n## Links\n\n- [Link text](https://example.com/link)\n\n## Images\n\n![Image description](https://example.com/image.jpg)",
68
+ "links": [
69
+ {
70
+ "text": "Link text",
71
+ "url": "https://example.com/link",
72
+ "markdown": "[Link text](https://example.com/link)"
73
+ }
74
+ ],
75
+ "images": [
76
+ {
77
+ "src": "https://example.com/image.jpg",
78
+ "alt": "Image description",
79
+ "markdown": "![Image description](https://example.com/image.jpg)"
80
+ }
81
+ ],
82
+ "url": "https://example.com"
83
+ }
84
+ ```
85
+
86
+ ### POST /searchWithContent
87
+
88
+ Search for content and retrieve the full content of the top N results.
89
+
90
+ Request body:
91
+ ```json
92
+ {
93
+ "search_phrase": "your search query",
94
+ "top_n": 5
95
+ }
96
+ ```
97
+
98
+ Response:
99
+ ```json
100
+ {
101
+ "results": {
102
+ "Item_1": {
103
+ "title": "Result title",
104
+ "snippet": "Result snippet",
105
+ "linkText": "https://example.com",
106
+ "content": {
107
+ "title": "Website Title",
108
+ "markdown": "# Website Title\n\nMain content in Markdown format...",
109
+ "links": [...],
110
+ "images": [...],
111
+ "url": "https://example.com"
112
+ }
113
+ },
114
+ // ... more items up to top_n
115
+ }
116
+ }
117
+ ```
118
+
119
+ The `content` field for each result contains the full website content in Markdown format, including:
120
+ - Title as a level 1 heading
121
+ - Main content
122
+ - Links section with Markdown-formatted links
123
+ - Images section with Markdown-formatted images
124
+
125
+ ## Error Handling
126
+
127
+ The API will return appropriate HTTP status codes and error messages in case of failures:
128
+ - 500: Internal server error
129
+ - 422: Validation error (invalid request body)
main.py ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI, HTTPException
2
+ from fastapi.middleware.cors import CORSMiddleware
3
+ from pydantic import BaseModel
4
+ from search import search_n_browse, search_with_content
5
+ from website_viewer import fetch_website_content
6
+ from typing import Dict, List
7
+
8
+
9
+ app = FastAPI(title="VerboAI Search Engine API", description="""
10
+ VerboAI Search Engine API provides three core functionalities:
11
+ 1. Web search with browsing capabilities
12
+ 2. Website content extraction with structured data (title, markdown, links, and images)
13
+ 3. Content-based search that returns the top N most relevant results from websites
14
+
15
+ The API is built with FastAPI, supports CORS, and is ready for cloud deployment. Powered by VerboAI.
16
+ """, version="1.0.0", docs_url="/", redoc_url="/redoc")
17
+
18
+ # Enable CORS
19
+ app.add_middleware(
20
+ CORSMiddleware,
21
+ allow_origins=["*"],
22
+ allow_credentials=True,
23
+ allow_methods=["*"],
24
+ allow_headers=["*"],
25
+ )
26
+
27
+ class SearchRequest(BaseModel):
28
+ search_phrase: str
29
+
30
+ class SearchResponse(BaseModel):
31
+ results: dict
32
+
33
+ class WebsiteViewRequest(BaseModel):
34
+ url: str
35
+
36
+
37
+
38
+
39
+ class WebsiteViewResponse(BaseModel):
40
+ title: str
41
+ markdown: str
42
+ links: Dict[str, List[Dict]] = {}
43
+ images: Dict[str, List[Dict]] = {}
44
+ url: str
45
+
46
+ class SearchWithContentRequest(BaseModel):
47
+ search_phrase: str
48
+ top_n: int = 5
49
+
50
+ class SearchWithContentResponse(BaseModel):
51
+ results: dict
52
+
53
+ @app.post("/search", response_model=SearchResponse)
54
+ async def search(request: SearchRequest):
55
+ try:
56
+ results = await search_n_browse(request.search_phrase)
57
+ return SearchResponse(results=results)
58
+ except Exception as e:
59
+ raise HTTPException(status_code=500, detail=str(e))
60
+
61
+ @app.post("/websiteView", response_model=WebsiteViewResponse)
62
+ async def view_website(request: WebsiteViewRequest):
63
+ try:
64
+
65
+ website_content = await fetch_website_content(request.url)
66
+ return WebsiteViewResponse(**website_content)
67
+ except Exception as e:
68
+ raise HTTPException(status_code=500, detail=str(e))
69
+
70
+ @app.post("/searchWithContent", response_model=SearchWithContentResponse)
71
+ async def search_with_content_endpoint(request: SearchWithContentRequest):
72
+ try:
73
+ results = await search_with_content(request.search_phrase, request.top_n)
74
+ return SearchWithContentResponse(results=results)
75
+ except Exception as e:
76
+ raise HTTPException(status_code=500, detail=str(e))
77
+
78
+ if __name__ == "__main__":
79
+ import uvicorn
80
+ uvicorn.run(app, host="0.0.0.0", port=8000)
pyproject.toml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "search-engine-aapi"
3
+ version = "0.1.0"
4
+ description = "A search engine API"
5
+ authors = [
6
+ { name = "Your Name", email = "your.email@example.com" }
7
+ ]
8
+ dependencies = [
9
+ "fastapi==0.109.2",
10
+ "uvicorn==0.27.1",
11
+ "crawl4ai==0.3.3",
12
+ ]
13
+ requires-python = ">=3.9,<3.13"
14
+
15
+ [build-system]
16
+ requires = ["hatchling"]
17
+ build-backend = "hatchling.build"
18
+
19
+ [tool.hatch.build.targets.wheel]
20
+ packages = ["src/search_engine_aapi"]
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ fastapi==0.109.2
2
+ uvicorn==0.27.1
3
+ crawl4ai==0.3.3
search.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import aiohttp
2
+ from bs4 import BeautifulSoup
3
+ import re
4
+ from website_viewer import fetch_website_content
5
+
6
+ async def search_n_browse(search_phrase: str) -> dict:
7
+ url = "https://lite.duckduckgo.com/lite/"
8
+ headers = {
9
+ "Content-Type": "application/x-www-form-urlencoded",
10
+ }
11
+ data = {
12
+ "q": search_phrase,
13
+ }
14
+
15
+ try:
16
+ async with aiohttp.ClientSession() as session:
17
+ async with session.post(url, data=data, headers=headers) as response:
18
+ if response.status == 200:
19
+ body = await response.text()
20
+ soup = BeautifulSoup(body, 'html.parser')
21
+
22
+ # Extract links
23
+ links = [a['href'] for a in soup.find_all('a', href=True)
24
+ if not a['href'].startswith('#')]
25
+ links = [link if link.startswith('http') else '' for link in links]
26
+
27
+ # Extract titles
28
+ titles = [a.text.strip() for a in soup.select('tr > td > a')]
29
+
30
+ # Extract snippets
31
+ snippets = [td.text.strip() for td in soup.select('td.result-snippet')]
32
+
33
+ # Extract link texts
34
+ link_texts = [span.text for span in soup.select('td > span.link-text')]
35
+
36
+ # Process results
37
+ json_data = {}
38
+ if snippets:
39
+ for index, snippet in enumerate(snippets):
40
+ json_data[f"Item_{index + 1}"] = {
41
+ "title": titles[index] if index < len(titles) else "",
42
+ "snippet": snippet,
43
+ "linkText": link_texts[index] if index < len(link_texts) else "",
44
+ }
45
+ else:
46
+ json_data["Result"] = {
47
+ "title": "No results were found.",
48
+ "snippet": "Our search engine could not find what you are looking for.",
49
+ "linkText": "Thank you for using our search engine.",
50
+ }
51
+
52
+ return json_data
53
+ else:
54
+ raise Exception(f"Search API error. Status code: {response.status}")
55
+ except Exception as e:
56
+ raise Exception(f"Error during search: {str(e)}")
57
+
58
+ async def search_with_content(search_phrase: str, top_n: int = 5) -> dict:
59
+ """
60
+ Search for content and retrieve the content of the top N results
61
+ """
62
+ try:
63
+ # First, get the search results
64
+ search_results = await search_n_browse(search_phrase)
65
+
66
+ # Process only the top N results
67
+ top_results = {}
68
+ count = 0
69
+
70
+ for key, result in search_results.items():
71
+ if key.startswith("Item_") and count < top_n:
72
+ try:
73
+ # Get the URL from linkText
74
+ url = result.get("linkText", "")
75
+ if url:
76
+ # Fetch the content of the page
77
+ content = await fetch_website_content(url)
78
+
79
+ # Add the content to the result
80
+ top_results[key] = {
81
+ "title": result.get("title", ""),
82
+ "snippet": result.get("snippet", ""),
83
+ "linkText": result.get("linkText", ""),
84
+ "content": content
85
+ }
86
+ count += 1
87
+ except Exception as e:
88
+ # If we can't fetch the content, still include the basic result
89
+ top_results[key] = {
90
+ "title": result.get("title", ""),
91
+ "snippet": result.get("snippet", ""),
92
+ "linkText": result.get("linkText", ""),
93
+ "error": f"Could not fetch content: {str(e)}"
94
+ }
95
+ count += 1
96
+
97
+ return top_results
98
+ except Exception as e:
99
+ raise Exception(f"Error during search with content: {str(e)}")
website_viewer.py ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import aiohttp
3
+ from bs4 import BeautifulSoup
4
+ from html2text import HTML2Text
5
+ from typing import Dict, List, Optional
6
+ from urllib.parse import urljoin, urlparse
7
+ from crawl4ai import *
8
+
9
+
10
+ # Configure logging
11
+ logging.basicConfig(level=logging.INFO)
12
+ logger = logging.getLogger(__name__)
13
+
14
+ class WebsiteViewerError(Exception):
15
+ pass
16
+
17
+ async def fetch_website_content(url: str) -> Dict:
18
+ """
19
+ Fetch website content using aiohttp and BeautifulSoup.
20
+
21
+ Args:
22
+ url (str): The URL of the website to fetch
23
+
24
+ Returns:
25
+ Dict: A dictionary containing:
26
+ - title: The page title (str)
27
+ - content: The main content in markdown format (str)
28
+ - links: List of absolute URLs found on the page (List[str])
29
+ - images: List of image URLs found on the page (List[str])
30
+ - url: The original URL (str)
31
+ """
32
+ try:
33
+ if not url.startswith(('http://', 'https://')):
34
+ url = 'https://' + url
35
+
36
+ async with AsyncWebCrawler() as crawler:
37
+ result = await crawler.arun(
38
+ url=url,
39
+ )
40
+ soup = BeautifulSoup(result.html, 'html.parser')
41
+
42
+ # Process links - convert relative URLs to absolute
43
+ links = result.links
44
+
45
+ # Process images - get src attributes
46
+ media = result.media
47
+
48
+
49
+ # Get title
50
+ title = soup.title.string if soup.title else ''
51
+
52
+ output = {
53
+ "title": title,
54
+ "markdown": result.markdown,
55
+ "links": links,
56
+ "media": media,
57
+ "url": url
58
+ }
59
+
60
+ return output
61
+
62
+ except Exception as e:
63
+ logger.error(f"Error fetching website content: {str(e)}")
64
+ raise WebsiteViewerError(f"Failed to fetch website content: {str(e)}")