Spaces:
Running
Running
| # Semantic IR: Project Details & Architecture | |
| ## 1. Problem Statement: The Limits of Traditional Search | |
| Traditional Information Retrieval (IR) relies heavily on two main paradigms: | |
| 1. **Lexical / Keyword Matching (BM25):** Matches exact words but completely ignores grammar, word order, and context. | |
| 2. **Dense Retrieval (Vector Embeddings):** Captures semantic similarity and context but frequently struggles with **directionality** and **role reversal**. | |
| For example, given the query: *"Company A acquired Startup B"*, dense vectors might return *"Startup B acquired Company A"* with a near-perfect 99% similarity score because the words and overall context are identical, despite the meaning being exactly the opposite. | |
| ## 2. Our Solution: Semantic Role Labeling (SRL) | |
| To solve this, we introduced **Semantic Role Labeling (SRL)** into the retrieval pipeline. SRL extracts the predicate-argument structure of a sentence—answering *"who did what to whom, when, and where."* | |
| By explicitly parsing documents and queries into structured roles: | |
| - **Predicate:** The main action (e.g., `acquire`) | |
| - **ARG0 (Agent):** The doer (e.g., `Company A`) | |
| - **ARG1 (Patient):** The receiver (e.g., `Startup B`) | |
| We can structurally compare the query against the document. If the query asks for `Company A` as the `ARG0`, the document *must* have `Company A` as the `ARG0` to get a high SRL score. | |
| ## 3. What We Built | |
| We built a **hybrid search engine and visualizer** that runs a real-time, side-by-side comparison of 4 different retrieval paradigms: | |
| 1. **BM25 (Lexical):** Using `rank-bm25`. | |
| 2. **Dense (Vectors):** Using `sentence-transformers` (`all-MiniLM-L6-v2`). | |
| 3. **SRL (Structural):** A custom scoring algorithm that calculates the Intersection-over-Union (IoU) of tokens within matching semantic roles between the query and the document. | |
| 4. **Hybrid:** A weighted ensemble score: `(0.5 * BM25) + (0.3 * Dense) + (0.2 * SRL)`. | |
| ### Key Uniqueness & Features | |
| - **4-Way Comparison UI:** Users can type a query and instantly see how the four paradigms rank the exact same corpus. | |
| - **Semantic Highlighting:** The React frontend parses the `roles` JSON returned by the backend and color-codes the Predicate (Red), ARG0 (Blue), and ARG1 (Green) directly in the text. | |
| - **Interactive Parse Trees:** The backend uses `spaCy` and `displaCy` to generate raw HTML/SVG dependency trees for every search result, which the frontend renders inside an expandable accordion. This allows researchers to physically see the grammatical structure the algorithm used. | |
| ## 4. Technical Architecture | |
| - **Data Source:** We used the English PropBank dataset (`mertas7/propbank` on Hugging Face), which contains high-quality, pre-annotated SRL tags in CoNLL-U CSV format. | |
| - **Data Parsing (`parse_data.py`):** We wrote a script to flatten these complex token-by-token TSV files into a unified `corpus.json` where each document is a sentence mapped to its extracted roles. | |
| - **Backend (`api.py`):** A FastAPI server that loads the `corpus.json` into memory. On every request, it uses `spaCy` to dynamically extract the semantic roles of the *user's query*, scores it against the corpus, and returns the top results. | |
| - **Frontend (React/Vite):** A beautifully styled dashboard (Tailwind CSS v4) that fetches from the backend and renders the complex comparative data cleanly. | |