# Semantic IR: Project Details & Architecture ## 1. Problem Statement: The Limits of Traditional Search Traditional Information Retrieval (IR) relies heavily on two main paradigms: 1. **Lexical / Keyword Matching (BM25):** Matches exact words but completely ignores grammar, word order, and context. 2. **Dense Retrieval (Vector Embeddings):** Captures semantic similarity and context but frequently struggles with **directionality** and **role reversal**. For example, given the query: *"Company A acquired Startup B"*, dense vectors might return *"Startup B acquired Company A"* with a near-perfect 99% similarity score because the words and overall context are identical, despite the meaning being exactly the opposite. ## 2. Our Solution: Semantic Role Labeling (SRL) To solve this, we introduced **Semantic Role Labeling (SRL)** into the retrieval pipeline. SRL extracts the predicate-argument structure of a sentence—answering *"who did what to whom, when, and where."* By explicitly parsing documents and queries into structured roles: - **Predicate:** The main action (e.g., `acquire`) - **ARG0 (Agent):** The doer (e.g., `Company A`) - **ARG1 (Patient):** The receiver (e.g., `Startup B`) We can structurally compare the query against the document. If the query asks for `Company A` as the `ARG0`, the document *must* have `Company A` as the `ARG0` to get a high SRL score. ## 3. What We Built We built a **hybrid search engine and visualizer** that runs a real-time, side-by-side comparison of 4 different retrieval paradigms: 1. **BM25 (Lexical):** Using `rank-bm25`. 2. **Dense (Vectors):** Using `sentence-transformers` (`all-MiniLM-L6-v2`). 3. **SRL (Structural):** A custom scoring algorithm that calculates the Intersection-over-Union (IoU) of tokens within matching semantic roles between the query and the document. 4. **Hybrid:** A weighted ensemble score: `(0.5 * BM25) + (0.3 * Dense) + (0.2 * SRL)`. ### Key Uniqueness & Features - **4-Way Comparison UI:** Users can type a query and instantly see how the four paradigms rank the exact same corpus. - **Semantic Highlighting:** The React frontend parses the `roles` JSON returned by the backend and color-codes the Predicate (Red), ARG0 (Blue), and ARG1 (Green) directly in the text. - **Interactive Parse Trees:** The backend uses `spaCy` and `displaCy` to generate raw HTML/SVG dependency trees for every search result, which the frontend renders inside an expandable accordion. This allows researchers to physically see the grammatical structure the algorithm used. ## 4. Technical Architecture - **Data Source:** We used the English PropBank dataset (`mertas7/propbank` on Hugging Face), which contains high-quality, pre-annotated SRL tags in CoNLL-U CSV format. - **Data Parsing (`parse_data.py`):** We wrote a script to flatten these complex token-by-token TSV files into a unified `corpus.json` where each document is a sentence mapped to its extracted roles. - **Backend (`api.py`):** A FastAPI server that loads the `corpus.json` into memory. On every request, it uses `spaCy` to dynamically extract the semantic roles of the *user's query*, scores it against the corpus, and returns the top results. - **Frontend (React/Vite):** A beautifully styled dashboard (Tailwind CSS v4) that fetches from the backend and renders the complex comparative data cleanly.