Huseyin Kir commited on
Commit
0f0e7c3
·
1 Parent(s): 7340cef

download service added

Browse files
Files changed (3) hide show
  1. README.md +129 -1
  2. app.py +159 -6
  3. requirements.txt +2 -1
README.md CHANGED
@@ -8,4 +8,132 @@ pinned: false
8
  short_description: Semantic search API for NDL Core datasets
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  short_description: Semantic search API for NDL Core datasets
9
  ---
10
 
11
+ # NDL Core Data API
12
+
13
+ A FastAPI-based service that provides semantic search and data download capabilities for NDL Core datasets. The API uses LanceDB for vector search with sentence transformers for embedding.
14
+
15
+ ## Base URL
16
+
17
+ ```
18
+ https://hkir-dev-ndl-core-data-api.hf.space
19
+ ```
20
+
21
+ ## Endpoints
22
+
23
+ ### Search
24
+
25
+ **GET** `/search`
26
+
27
+ Perform semantic search across NDL Core datasets using natural language queries.
28
+
29
+ **Parameters:**
30
+ | Parameter | Type | Required | Default | Description |
31
+ |-----------|------|----------|---------|-------------|
32
+ | `query` | string | Yes | - | Natural language search query |
33
+ | `limit` | integer | No | 5 | Maximum number of results to return |
34
+
35
+ **Example:**
36
+ ```bash
37
+ curl "https://hkir-dev-ndl-core-data-api.hf.space/search?query="Police%20use%20of%20force"&limit=10"
38
+ ```
39
+
40
+ **Response:**
41
+ ```json
42
+ [
43
+ {
44
+ "identifier": "UUID1",
45
+ "title": "Police use of force dataset1",
46
+ "description": "...",
47
+ "format": "parquet",
48
+ ...
49
+ },
50
+ {
51
+ "identifier": "UUID2",
52
+ "title": "Police use of force dataset2",
53
+ "description": "...",
54
+ "format": "text",
55
+ ...
56
+ },
57
+ ]
58
+ ```
59
+
60
+ see [NDL Corpus](https://huggingface.co/datasets/hkir-dev/ndl-core-corpus) the definition of all fields.
61
+
62
+ ---
63
+
64
+ ### Download
65
+
66
+ **GET** `/download`
67
+
68
+ Get download information for one or more datasets by their identifiers.
69
+
70
+ **Parameters:**
71
+ | Parameter | Type | Required | Description |
72
+ |-----------|------|----------|-------------|
73
+ | `identifiers` | string | Yes | Comma-separated list of dataset identifiers |
74
+
75
+ **Example:**
76
+ ```bash
77
+ curl "https://hkir-dev-ndl-core-data-api.hf.space/download?identifiers=UUID1,UUID2"
78
+ ```
79
+
80
+ **Response:**
81
+ ```json
82
+ [
83
+ {
84
+ "identifier": "UUID1",
85
+ "format": "parquet",
86
+ "data": ["https://huggingface.co/datasets/hkir-dev/ndl-core-structured-data/resolve/main/some.parquet"]
87
+ },
88
+ {
89
+ "identifier": "UUID2",
90
+ "format": "text",
91
+ "data": ["https://hkir-dev-ndl-core-data-api.hf.space/download/text/UUID2"]
92
+ }
93
+ ]
94
+ ```
95
+
96
+ **Response Fields:**
97
+ - `identifier` - The requested dataset identifier
98
+ - `format` - Either `text` or `parquet`
99
+ - `data` - Array of download URLs
100
+ - `error` - Error message (only present if the request failed)
101
+
102
+ ---
103
+
104
+ ### Download Text File
105
+
106
+ **GET** `/download/text/{identifier}`
107
+
108
+ Stream text content as a downloadable `.txt` file.
109
+
110
+ **Parameters:**
111
+ | Parameter | Type | Required | Description |
112
+ |-----------|------|----------|-------------|
113
+ | `identifier` | path | Yes | The dataset identifier |
114
+
115
+ **Example:**
116
+ ```bash
117
+ curl -O "https://hkir-dev-ndl-core-data-api.hf.space/download/text/UUID2"
118
+ ```
119
+
120
+ **Response:**
121
+ - Returns a `text/plain` file download with `Content-Disposition: attachment`
122
+
123
+ **Errors:**
124
+ - `404` - No record found with the given identifier
125
+ - `400` - Record exists but is not in text format
126
+
127
+ ---
128
+
129
+ ## Data Sources
130
+
131
+ - **Vector Index:** [hkir-dev/ndl-core-rag-index](https://huggingface.co/datasets/hkir-dev/ndl-core-rag-index)
132
+ - **Structured Data:** [hkir-dev/ndl-core-structured-data](https://huggingface.co/datasets/hkir-dev/ndl-core-structured-data)
133
+
134
+ ## Technology Stack
135
+
136
+ - **Framework:** FastAPI
137
+ - **Vector Database:** LanceDB
138
+ - **Embeddings:** Sentence Transformers (all-MiniLM-L6-v2)
139
+ - **Deployment:** Docker on Hugging Face Spaces
app.py CHANGED
@@ -1,9 +1,17 @@
1
  import os
2
- from fastapi import FastAPI
 
3
  import lancedb
4
  from sentence_transformers import SentenceTransformer
5
  from huggingface_hub import snapshot_download
6
  import shutil
 
 
 
 
 
 
 
7
 
8
  app = FastAPI()
9
 
@@ -16,6 +24,7 @@ index_path = snapshot_download(
16
  force_download=True # ensure we get the latest version
17
  )
18
 
 
19
  dst = "/tmp/lancedb_search_index"
20
  shutil.copytree(f"{index_path}/lancedb_search_index", dst)
21
 
@@ -36,10 +45,154 @@ model = SentenceTransformer('all-MiniLM-L6-v2')
36
  def search(query: str, limit: int = 5):
37
  query_vector = model.encode(query)
38
  results = (
39
- table.search(query_vector) # Your vector search
40
- .metric("cosine") # Ensure metric matches your index
41
- .select(columns_to_select) # <--- The key step: explicit column selection
42
- .limit(5) # Number of results
43
- .to_pandas() # Convert to DataFrame
44
  )
 
 
 
 
 
45
  return results.to_dict(orient='records')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import os
2
+ from fastapi import FastAPI, HTTPException
3
+ from fastapi.responses import StreamingResponse
4
  import lancedb
5
  from sentence_transformers import SentenceTransformer
6
  from huggingface_hub import snapshot_download
7
  import shutil
8
+ import requests
9
+ import io
10
+
11
+ HF_DATASET_BASE_URL = "https://huggingface.co/datasets/hkir-dev/ndl-core-structured-data"
12
+ HF_API_BASE_URL = "https://huggingface.co/api/datasets/hkir-dev/ndl-core-structured-data"
13
+
14
+ THIS_API_URL = "https://hkir-dev-ndl-core-data-api.hf.space/"
15
 
16
  app = FastAPI()
17
 
 
24
  force_download=True # ensure we get the latest version
25
  )
26
 
27
+ # This i mandatory to avoid "file size is too small" errors from LanceDB
28
  dst = "/tmp/lancedb_search_index"
29
  shutil.copytree(f"{index_path}/lancedb_search_index", dst)
30
 
 
45
  def search(query: str, limit: int = 5):
46
  query_vector = model.encode(query)
47
  results = (
48
+ table.search(query_vector) # vector search
49
+ .metric("cosine") # Ensure metric matches index
50
+ .select(columns_to_select) # explicit column selection
51
+ .limit(limit)
52
+ .to_pandas()
53
  )
54
+
55
+ # Truncate text column to preview only
56
+ if "text" in results.columns:
57
+ results["text"] = results["text"].apply(truncate_text)
58
+
59
  return results.to_dict(orient='records')
60
+
61
+
62
+ @app.get("/download")
63
+ def download(identifiers: str):
64
+ """
65
+ Download endpoint that returns data based on the identifiers.
66
+
67
+ Args:
68
+ identifiers: Comma-separated list of identifiers
69
+
70
+ Returns:
71
+ List of objects with:
72
+ - For text format: {"identifier": "...", "format": "text", "data": ["<text content>"]}
73
+ - For parquet format: {"identifier": "...", "format": "parquet", "data": ["<download links>"]}
74
+ """
75
+ identifier_list = [id.strip() for id in identifiers.split(",") if id.strip()]
76
+ return [process_single_identifier(identifier) for identifier in identifier_list]
77
+
78
+ @app.get("/download/text/{identifier}")
79
+ def download_text_file(identifier: str):
80
+ """
81
+ Stream text content as a downloadable file.
82
+
83
+ Args:
84
+ identifier: The record identifier
85
+
86
+ Returns:
87
+ StreamingResponse with the text content as a downloadable file
88
+ """
89
+ record = find_record_by_identifier(identifier)
90
+
91
+ if record is None:
92
+ raise HTTPException(status_code=404, detail=f"No record found with identifier: {identifier}")
93
+
94
+ record_format = record.get("format", "")
95
+ if record_format != "text":
96
+ raise HTTPException(status_code=400, detail=f"Record is not text format: {record_format}")
97
+
98
+ text_data = record.get("text", "")
99
+
100
+ # Create a file-like object from the text
101
+ file_stream = io.BytesIO(text_data.encode("utf-8"))
102
+
103
+ return StreamingResponse(
104
+ file_stream,
105
+ media_type="text/plain",
106
+ headers={
107
+ "Content-Disposition": f"attachment; filename={identifier}.txt"
108
+ }
109
+ )
110
+
111
+ def truncate_text(text: str, max_length: int = 100) -> str:
112
+ """Return first max_length characters of text with '...' if truncated, or empty string if no text."""
113
+ if not text:
114
+ return ""
115
+ if len(text) <= max_length:
116
+ return text
117
+ return text[:max_length] + "..."
118
+
119
+ def get_folder_file_urls(folder_name: str) -> list:
120
+ """Fetch all file URLs from a folder in the HuggingFace dataset."""
121
+ api_url = f"{HF_API_BASE_URL}/tree/main/{folder_name}"
122
+ response = requests.get(api_url)
123
+ if response.status_code != 200:
124
+ return []
125
+
126
+ files = response.json()
127
+ file_urls = []
128
+ for file_info in files:
129
+ if file_info.get("type") == "file":
130
+ file_path = file_info.get("path", "")
131
+ download_url = f"{HF_DATASET_BASE_URL}/resolve/main/{file_path}"
132
+ file_urls.append(download_url)
133
+ return file_urls
134
+
135
+
136
+ def find_record_by_identifier(identifier: str):
137
+ """Search for a record in LanceDB by identifier."""
138
+ results = (
139
+ table.search()
140
+ .where(f"identifier = '{identifier}'")
141
+ .select(columns_to_select)
142
+ .limit(1)
143
+ .to_pandas()
144
+ )
145
+ return results.iloc[0] if not results.empty else None
146
+
147
+
148
+ def build_error_response(identifier: str, message: str) -> dict:
149
+ """Build an error response object."""
150
+ return {"identifier": identifier, "error": message}
151
+
152
+
153
+ def build_success_response(identifier: str, format_type: str, data: list) -> dict:
154
+ """Build a success response object."""
155
+ return {"identifier": identifier, "format": format_type, "data": data}
156
+
157
+
158
+ def process_text_record(identifier: str, record) -> dict:
159
+ """Process a text format record and return response."""
160
+ download_url = f"{THIS_API_URL}/download/text/{identifier}"
161
+ return build_success_response(identifier, "text", [download_url])
162
+
163
+
164
+ def process_parquet_record(identifier: str, record) -> dict:
165
+ """Process a parquet format record and return response."""
166
+ data_file = record.get("data_file", "")
167
+
168
+ if not data_file:
169
+ return build_error_response(identifier, "No data_file found for this parquet record")
170
+
171
+ if data_file.endswith(".parquet"):
172
+ download_url = f"{HF_DATASET_BASE_URL}/resolve/main/{data_file}"
173
+ return build_success_response(identifier, "parquet", [download_url])
174
+
175
+ # It's a folder (UUID) - fetch all files in the folder
176
+ file_urls = get_folder_file_urls(data_file)
177
+ if not file_urls:
178
+ return build_error_response(identifier, f"No files found in folder: {data_file}")
179
+
180
+ return build_success_response(identifier, "parquet", file_urls)
181
+
182
+
183
+ def process_single_identifier(identifier: str) -> dict:
184
+ """Process a single identifier and return the appropriate download response based on its format."""
185
+ record = find_record_by_identifier(identifier)
186
+
187
+ if record is None:
188
+ return build_error_response(identifier, f"No record found with identifier: {identifier}")
189
+
190
+ record_format = record.get("format", "")
191
+
192
+ if record_format == "text":
193
+ return process_text_record(identifier, record)
194
+ elif record_format == "parquet":
195
+ return process_parquet_record(identifier, record)
196
+ else:
197
+ return build_error_response(identifier, f"Unknown format: {record_format}")
198
+
requirements.txt CHANGED
@@ -6,4 +6,5 @@ pandas
6
  huggingface-hub
7
  pyarrow
8
  torch
9
- numpy
 
 
6
  huggingface-hub
7
  pyarrow
8
  torch
9
+ numpy
10
+ requests