semantic-ir-backend / PROJECT_DETAILS.md
Kamal Nayan Kumar
deployment ready
6301bc6

Semantic IR: Project Details & Architecture

1. Problem Statement: The Limits of Traditional Search

Traditional Information Retrieval (IR) relies heavily on two main paradigms:

  1. Lexical / Keyword Matching (BM25): Matches exact words but completely ignores grammar, word order, and context.
  2. Dense Retrieval (Vector Embeddings): Captures semantic similarity and context but frequently struggles with directionality and role reversal.

For example, given the query: "Company A acquired Startup B", dense vectors might return "Startup B acquired Company A" with a near-perfect 99% similarity score because the words and overall context are identical, despite the meaning being exactly the opposite.

2. Our Solution: Semantic Role Labeling (SRL)

To solve this, we introduced Semantic Role Labeling (SRL) into the retrieval pipeline. SRL extracts the predicate-argument structure of a sentence—answering "who did what to whom, when, and where."

By explicitly parsing documents and queries into structured roles:

  • Predicate: The main action (e.g., acquire)
  • ARG0 (Agent): The doer (e.g., Company A)
  • ARG1 (Patient): The receiver (e.g., Startup B)

We can structurally compare the query against the document. If the query asks for Company A as the ARG0, the document must have Company A as the ARG0 to get a high SRL score.

3. What We Built

We built a hybrid search engine and visualizer that runs a real-time, side-by-side comparison of 4 different retrieval paradigms:

  1. BM25 (Lexical): Using rank-bm25.
  2. Dense (Vectors): Using sentence-transformers (all-MiniLM-L6-v2).
  3. SRL (Structural): A custom scoring algorithm that calculates the Intersection-over-Union (IoU) of tokens within matching semantic roles between the query and the document.
  4. Hybrid: A weighted ensemble score: (0.5 * BM25) + (0.3 * Dense) + (0.2 * SRL).

Key Uniqueness & Features

  • 4-Way Comparison UI: Users can type a query and instantly see how the four paradigms rank the exact same corpus.
  • Semantic Highlighting: The React frontend parses the roles JSON returned by the backend and color-codes the Predicate (Red), ARG0 (Blue), and ARG1 (Green) directly in the text.
  • Interactive Parse Trees: The backend uses spaCy and displaCy to generate raw HTML/SVG dependency trees for every search result, which the frontend renders inside an expandable accordion. This allows researchers to physically see the grammatical structure the algorithm used.

4. Technical Architecture

  • Data Source: We used the English PropBank dataset (mertas7/propbank on Hugging Face), which contains high-quality, pre-annotated SRL tags in CoNLL-U CSV format.
  • Data Parsing (parse_data.py): We wrote a script to flatten these complex token-by-token TSV files into a unified corpus.json where each document is a sentence mapped to its extracted roles.
  • Backend (api.py): A FastAPI server that loads the corpus.json into memory. On every request, it uses spaCy to dynamically extract the semantic roles of the user's query, scores it against the corpus, and returns the top results.
  • Frontend (React/Vite): A beautifully styled dashboard (Tailwind CSS v4) that fetches from the backend and renders the complex comparative data cleanly.