ruidiao commited on
Commit
cf1c7cb
·
1 Parent(s): 649d78e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -1
README.md CHANGED
@@ -9,4 +9,42 @@ license: mit
9
  short_description: 'Client-side AI quote search: fast, private, no servers.'
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  short_description: 'Client-side AI quote search: fast, private, no servers.'
10
  ---
11
 
12
+ # Client-Side Semantic Quote Retrieval Engine
13
+
14
+ This project implements a "Zero-Infrastructure" client-side semantic quote retrieval engine. All computationally intensive tasks, including vectorization, indexing, and model quantization, are handled offline. The search and retrieval operations then occur entirely within the user's browser, ensuring data privacy and minimal operational overhead.
15
+
16
+ ## Project Overview
17
+
18
+ The core idea is to pre-process a large dataset of quotes into a highly optimized, quantized vector index. This index, along with a small machine learning model, is then loaded by the client-side application. When a user enters a query, the application generates an embedding for the query and performs a fast Approximate Nearest Neighbor Search (ANNS) directly in the browser to find semantically similar quotes.
19
+
20
+ ## Data Source
21
+
22
+ The quotes data used in this project was sourced from:
23
+ [https://archive.org/details/quotes_20230625](https://archive.org/details/quotes_20230625)
24
+
25
+ ## Setup and Usage
26
+
27
+ 1. **Offline Data Processing**:
28
+ * Ensure you have Python and the necessary libraries (e.g., `pandas`, `numpy`, `torch`, `sentence-transformers`, `tqdm`) installed.
29
+ * Run the `offline_processing.py` script to generate the `quotes_index.bin` file. This file contains the pre-computed embeddings and metadata.
30
+ ```bash
31
+ python offline_processing.py
32
+ ```
33
+ * *Note: This step can be time-consuming for large datasets, especially the first time as the embedding model needs to be downloaded.*
34
+ * **Important:** The script now includes validation to ensure that categories in the CSV do not contain uppercase letters. Rows with invalid categories will be ignored.
35
+
36
+ 2. **Client-Side Application**:
37
+ * Open `index.html` in your web browser.
38
+ * The search input and button are immediately available.
39
+ * The application will first check for a cached index. If not found, it will display a message indicating a significant one-time download (for the model and index) which will occur on your first search.
40
+ * The `quotes_index.bin` is loaded (from cache or downloaded) and the necessary machine learning model (via `transformers.js`) is downloaded on demand during your first search.
41
+ * A progress bar with detailed status will be shown during downloads and processing, disappearing once complete.
42
+ * A "Delete Cached Data" button will appear when data is cached, allowing you to clear local storage.
43
+ * Enter your search queries to retrieve semantically similar quotes.
44
+
45
+ ## Technologies Used
46
+
47
+ * **Frontend**: HTML, CSS (Tailwind CSS), JavaScript
48
+ * **Offline Processing**: Python (pandas, numpy, torch, sentence-transformers)
49
+ * **Embedding Model**: nomic-ai/nomic-embed-text-v1.5
50
+ * **Client-Side ML**: transformers.js, Web Workers