File size: 2,734 Bytes
9057a71
 
 
 
 
 
 
 
 
 
0f0e7c3
 
 
 
 
 
 
210d485
0f0e7c3
 
 
 
 
 
 
 
210d485
0f0e7c3
 
 
 
 
 
 
 
 
210d485
0f0e7c3
 
 
 
 
 
 
 
 
 
210d485
0f0e7c3
 
 
 
 
 
 
210d485
0f0e7c3
 
 
 
 
210d485
0f0e7c3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210d485
0f0e7c3
 
 
 
 
 
 
 
 
 
 
 
 
210d485
 
0f0e7c3
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
---
title: Ndl Core Data Api
emoji: 🏃
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
short_description: Semantic search API for NDL Core datasets
---

# NDL Core Data API

A FastAPI-based service that provides semantic search and data download capabilities for NDL Core datasets. The API uses LanceDB for vector search with sentence transformers for embedding.

## Base URL

```
https://theodi-ndl-core-data-api.hf.space
```

## Endpoints

### Search

**GET** `/search`

Perform semantic search across NDL Core datasets using natural language queries and provides dataset details along with the ownload links.

**Parameters:**
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `query` | string | Yes | - | Natural language search query |
| `limit` | integer | No | 5 | Maximum number of results to return |

**Example:**
```bash
curl "https://theodi-ndl-core-data-api.hf.space/search?query="Police%20use%20of%20force"&limit=10"
```

**Response:**
```json
[
  {
    "identifier": "UUID1",
    "title": "Police use of force dataset1",
    "description": "...",
    "format": "parquet",
    "download": ["https://huggingface.co/datasets/theodi/ndl-core-structured-data/resolve/main/ac923bbd-57ca-4a84-8d6d-53dbb3614d3d/eca8b02a-c09a-43e9-86c4-b5a9294bce67.parquet"],
    ...
  },
  {
    "identifier": "UUID2",
    "title": "Police use of force dataset2",
    "description": "...",
    "format": "text",
    : ["https://theodi-ndl-core-data-api.hf.space/download/text/e06b5cf8-3e2f-4bc6-a6e9-0530b5bd165d"]
    ...
  },
]
```

see [NDL Corpus](https://huggingface.co/datasets/theodi/ndl-core-corpus) the definition of all fields.

---

### Download Text File

**GET** `/download/text/{identifier}`

Stream text content as a downloadable `.txt` file.

**Parameters:**
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `identifier` | path | Yes | The dataset identifier |

**Example:**
```bash
curl -O "https://theodi-ndl-core-data-api.hf.space/download/text/UUID2"
```

**Response:**
- Returns a `text/plain` file download with `Content-Disposition: attachment`

**Errors:**
- `404` - No record found with the given identifier
- `400` - Record exists but is not in text format

---

## Data Sources

- **Vector Index:** [theodi/ndl-core-rag-index](https://huggingface.co/datasets/theodi/ndl-core-rag-index)
- **Structured Data:** [theodi/ndl-core-structured-data](https://huggingface.co/datasets/theodi/ndl-core-structured-data)

## Technology Stack

- **Framework:** FastAPI
- **Vector Database:** LanceDB
- **Embeddings:** Sentence Transformers (all-MiniLM-L6-v2)
- **Deployment:** Docker on Hugging Face Spaces