Spaces:
Running
Semantic IR: Project Details & Architecture
1. Problem Statement: The Limits of Traditional Search
Traditional Information Retrieval (IR) relies heavily on two main paradigms:
- Lexical / Keyword Matching (BM25): Matches exact words but completely ignores grammar, word order, and context.
- Dense Retrieval (Vector Embeddings): Captures semantic similarity and context but frequently struggles with directionality and role reversal.
For example, given the query: "Company A acquired Startup B", dense vectors might return "Startup B acquired Company A" with a near-perfect 99% similarity score because the words and overall context are identical, despite the meaning being exactly the opposite.
2. Our Solution: Semantic Role Labeling (SRL)
To solve this, we introduced Semantic Role Labeling (SRL) into the retrieval pipeline. SRL extracts the predicate-argument structure of a sentence—answering "who did what to whom, when, and where."
By explicitly parsing documents and queries into structured roles:
- Predicate: The main action (e.g.,
acquire) - ARG0 (Agent): The doer (e.g.,
Company A) - ARG1 (Patient): The receiver (e.g.,
Startup B)
We can structurally compare the query against the document. If the query asks for Company A as the ARG0, the document must have Company A as the ARG0 to get a high SRL score.
3. What We Built
We built a hybrid search engine and visualizer that runs a real-time, side-by-side comparison of 4 different retrieval paradigms:
- BM25 (Lexical): Using
rank-bm25. - Dense (Vectors): Using
sentence-transformers(all-MiniLM-L6-v2). - SRL (Structural): A custom scoring algorithm that calculates the Intersection-over-Union (IoU) of tokens within matching semantic roles between the query and the document.
- Hybrid: A weighted ensemble score:
(0.5 * BM25) + (0.3 * Dense) + (0.2 * SRL).
Key Uniqueness & Features
- 4-Way Comparison UI: Users can type a query and instantly see how the four paradigms rank the exact same corpus.
- Semantic Highlighting: The React frontend parses the
rolesJSON returned by the backend and color-codes the Predicate (Red), ARG0 (Blue), and ARG1 (Green) directly in the text. - Interactive Parse Trees: The backend uses
spaCyanddisplaCyto generate raw HTML/SVG dependency trees for every search result, which the frontend renders inside an expandable accordion. This allows researchers to physically see the grammatical structure the algorithm used.
4. Technical Architecture
- Data Source: We used the English PropBank dataset (
mertas7/propbankon Hugging Face), which contains high-quality, pre-annotated SRL tags in CoNLL-U CSV format. - Data Parsing (
parse_data.py): We wrote a script to flatten these complex token-by-token TSV files into a unifiedcorpus.jsonwhere each document is a sentence mapped to its extracted roles. - Backend (
api.py): A FastAPI server that loads thecorpus.jsoninto memory. On every request, it usesspaCyto dynamically extract the semantic roles of the user's query, scores it against the corpus, and returns the top results. - Frontend (React/Vite): A beautifully styled dashboard (Tailwind CSS v4) that fetches from the backend and renders the complex comparative data cleanly.