Spaces:

ruidiao
/

QuoteSearch

Running

App Files Files Community

ruidiao commited on Oct 19, 2025

Commit

cf1c7cb

1 Parent(s): 649d78e

Update README.md

Browse files

Files changed (1) hide show

README.md +39 -1

README.md CHANGED Viewed

@@ -9,4 +9,42 @@ license: mit
 short_description: 'Client-side AI quote search: fast, private, no servers.'
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: 'Client-side AI quote search: fast, private, no servers.'
 ---
+# Client-Side Semantic Quote Retrieval Engine
+This project implements a "Zero-Infrastructure" client-side semantic quote retrieval engine. All computationally intensive tasks, including vectorization, indexing, and model quantization, are handled offline. The search and retrieval operations then occur entirely within the user's browser, ensuring data privacy and minimal operational overhead.
+## Project Overview
+The core idea is to pre-process a large dataset of quotes into a highly optimized, quantized vector index. This index, along with a small machine learning model, is then loaded by the client-side application. When a user enters a query, the application generates an embedding for the query and performs a fast Approximate Nearest Neighbor Search (ANNS) directly in the browser to find semantically similar quotes.
+## Data Source
+The quotes data used in this project was sourced from:
+[https://archive.org/details/quotes_20230625](https://archive.org/details/quotes_20230625)
+## Setup and Usage
+1.  **Offline Data Processing**:
+    *   Ensure you have Python and the necessary libraries (e.g., `pandas`, `numpy`, `torch`, `sentence-transformers`, `tqdm`) installed.
+    *   Run the `offline_processing.py` script to generate the `quotes_index.bin` file. This file contains the pre-computed embeddings and metadata.
+    ```bash
+    python offline_processing.py
+    ```
+    *   *Note: This step can be time-consuming for large datasets, especially the first time as the embedding model needs to be downloaded.*
+    *   **Important:** The script now includes validation to ensure that categories in the CSV do not contain uppercase letters. Rows with invalid categories will be ignored.
+2.  **Client-Side Application**:
+    *   Open `index.html` in your web browser.
+    *   The search input and button are immediately available.
+    *   The application will first check for a cached index. If not found, it will display a message indicating a significant one-time download (for the model and index) which will occur on your first search.
+    *   The `quotes_index.bin` is loaded (from cache or downloaded) and the necessary machine learning model (via `transformers.js`) is downloaded on demand during your first search.
+    *   A progress bar with detailed status will be shown during downloads and processing, disappearing once complete.
+    *   A "Delete Cached Data" button will appear when data is cached, allowing you to clear local storage.
+    *   Enter your search queries to retrieve semantically similar quotes.
+## Technologies Used
+*   **Frontend**: HTML, CSS (Tailwind CSS), JavaScript
+*   **Offline Processing**: Python (pandas, numpy, torch, sentence-transformers)
+*   **Embedding Model**: nomic-ai/nomic-embed-text-v1.5
+*   **Client-Side ML**: transformers.js, Web Workers