semantic-ir-backend / PROJECT_DETAILS.md
Kamal Nayan Kumar
deployment ready
6301bc6
# Semantic IR: Project Details & Architecture
## 1. Problem Statement: The Limits of Traditional Search
Traditional Information Retrieval (IR) relies heavily on two main paradigms:
1. **Lexical / Keyword Matching (BM25):** Matches exact words but completely ignores grammar, word order, and context.
2. **Dense Retrieval (Vector Embeddings):** Captures semantic similarity and context but frequently struggles with **directionality** and **role reversal**.
For example, given the query: *"Company A acquired Startup B"*, dense vectors might return *"Startup B acquired Company A"* with a near-perfect 99% similarity score because the words and overall context are identical, despite the meaning being exactly the opposite.
## 2. Our Solution: Semantic Role Labeling (SRL)
To solve this, we introduced **Semantic Role Labeling (SRL)** into the retrieval pipeline. SRL extracts the predicate-argument structure of a sentence—answering *"who did what to whom, when, and where."*
By explicitly parsing documents and queries into structured roles:
- **Predicate:** The main action (e.g., `acquire`)
- **ARG0 (Agent):** The doer (e.g., `Company A`)
- **ARG1 (Patient):** The receiver (e.g., `Startup B`)
We can structurally compare the query against the document. If the query asks for `Company A` as the `ARG0`, the document *must* have `Company A` as the `ARG0` to get a high SRL score.
## 3. What We Built
We built a **hybrid search engine and visualizer** that runs a real-time, side-by-side comparison of 4 different retrieval paradigms:
1. **BM25 (Lexical):** Using `rank-bm25`.
2. **Dense (Vectors):** Using `sentence-transformers` (`all-MiniLM-L6-v2`).
3. **SRL (Structural):** A custom scoring algorithm that calculates the Intersection-over-Union (IoU) of tokens within matching semantic roles between the query and the document.
4. **Hybrid:** A weighted ensemble score: `(0.5 * BM25) + (0.3 * Dense) + (0.2 * SRL)`.
### Key Uniqueness & Features
- **4-Way Comparison UI:** Users can type a query and instantly see how the four paradigms rank the exact same corpus.
- **Semantic Highlighting:** The React frontend parses the `roles` JSON returned by the backend and color-codes the Predicate (Red), ARG0 (Blue), and ARG1 (Green) directly in the text.
- **Interactive Parse Trees:** The backend uses `spaCy` and `displaCy` to generate raw HTML/SVG dependency trees for every search result, which the frontend renders inside an expandable accordion. This allows researchers to physically see the grammatical structure the algorithm used.
## 4. Technical Architecture
- **Data Source:** We used the English PropBank dataset (`mertas7/propbank` on Hugging Face), which contains high-quality, pre-annotated SRL tags in CoNLL-U CSV format.
- **Data Parsing (`parse_data.py`):** We wrote a script to flatten these complex token-by-token TSV files into a unified `corpus.json` where each document is a sentence mapped to its extracted roles.
- **Backend (`api.py`):** A FastAPI server that loads the `corpus.json` into memory. On every request, it uses `spaCy` to dynamically extract the semantic roles of the *user's query*, scores it against the corpus, and returns the top results.
- **Frontend (React/Vite):** A beautifully styled dashboard (Tailwind CSS v4) that fetches from the backend and renders the complex comparative data cleanly.