Spaces:

nayan90k
/

semantic-ir-backend

Running

App Files Files Community

semantic-ir-backend / PROJECT_DETAILS.md

Kamal Nayan Kumar

deployment ready

6301bc6 3 days ago

preview code

raw

history blame contribute delete

3.35 kB

	# Semantic IR: Project Details & Architecture

	## 1. Problem Statement: The Limits of Traditional Search
	Traditional Information Retrieval (IR) relies heavily on two main paradigms:
	1. Lexical / Keyword Matching (BM25): Matches exact words but completely ignores grammar, word order, and context.
	2. Dense Retrieval (Vector Embeddings): Captures semantic similarity and context but frequently struggles with directionality and role reversal.

	For example, given the query: "Company A acquired Startup B", dense vectors might return "Startup B acquired Company A" with a near-perfect 99% similarity score because the words and overall context are identical, despite the meaning being exactly the opposite.

	## 2. Our Solution: Semantic Role Labeling (SRL)
	To solve this, we introduced Semantic Role Labeling (SRL) into the retrieval pipeline. SRL extracts the predicate-argument structure of a sentence—answering "who did what to whom, when, and where."

	By explicitly parsing documents and queries into structured roles:
	- Predicate: The main action (e.g., `acquire`)
	- ARG0 (Agent): The doer (e.g., `Company A`)
	- ARG1 (Patient): The receiver (e.g., `Startup B`)

	We can structurally compare the query against the document. If the query asks for `Company A` as the `ARG0`, the document must have `Company A` as the `ARG0` to get a high SRL score.

	## 3. What We Built
	We built a hybrid search engine and visualizer that runs a real-time, side-by-side comparison of 4 different retrieval paradigms:

	1. BM25 (Lexical): Using `rank-bm25`.
	2. Dense (Vectors): Using `sentence-transformers` (`all-MiniLM-L6-v2`).
	3. SRL (Structural): A custom scoring algorithm that calculates the Intersection-over-Union (IoU) of tokens within matching semantic roles between the query and the document.
	4. Hybrid: A weighted ensemble score: `(0.5 * BM25) + (0.3 * Dense) + (0.2 * SRL)`.

	### Key Uniqueness & Features
	- 4-Way Comparison UI: Users can type a query and instantly see how the four paradigms rank the exact same corpus.
	- Semantic Highlighting: The React frontend parses the `roles` JSON returned by the backend and color-codes the Predicate (Red), ARG0 (Blue), and ARG1 (Green) directly in the text.
	- Interactive Parse Trees: The backend uses `spaCy` and `displaCy` to generate raw HTML/SVG dependency trees for every search result, which the frontend renders inside an expandable accordion. This allows researchers to physically see the grammatical structure the algorithm used.

	## 4. Technical Architecture
	- Data Source: We used the English PropBank dataset (`mertas7/propbank` on Hugging Face), which contains high-quality, pre-annotated SRL tags in CoNLL-U CSV format.
	- Data Parsing (`parse_data.py`): We wrote a script to flatten these complex token-by-token TSV files into a unified `corpus.json` where each document is a sentence mapped to its extracted roles.
	- Backend (`api.py`): A FastAPI server that loads the `corpus.json` into memory. On every request, it uses `spaCy` to dynamically extract the semantic roles of the user's query, scores it against the corpus, and returns the top results.
	- Frontend (React/Vite): A beautifully styled dashboard (Tailwind CSS v4) that fetches from the backend and renders the complex comparative data cleanly.

	# Semantic IR: Project Details & Architecture

	## 1. Problem Statement: The Limits of Traditional Search
	Traditional Information Retrieval (IR) relies heavily on two main paradigms:
	1. Lexical / Keyword Matching (BM25): Matches exact words but completely ignores grammar, word order, and context.
	2. Dense Retrieval (Vector Embeddings): Captures semantic similarity and context but frequently struggles with directionality and role reversal.

	For example, given the query: "Company A acquired Startup B", dense vectors might return "Startup B acquired Company A" with a near-perfect 99% similarity score because the words and overall context are identical, despite the meaning being exactly the opposite.

	## 2. Our Solution: Semantic Role Labeling (SRL)
	To solve this, we introduced Semantic Role Labeling (SRL) into the retrieval pipeline. SRL extracts the predicate-argument structure of a sentence—answering "who did what to whom, when, and where."

	By explicitly parsing documents and queries into structured roles:
	- Predicate: The main action (e.g., `acquire`)
	- ARG0 (Agent): The doer (e.g., `Company A`)
	- ARG1 (Patient): The receiver (e.g., `Startup B`)

	We can structurally compare the query against the document. If the query asks for `Company A` as the `ARG0`, the document must have `Company A` as the `ARG0` to get a high SRL score.

	## 3. What We Built
	We built a hybrid search engine and visualizer that runs a real-time, side-by-side comparison of 4 different retrieval paradigms:

	1. BM25 (Lexical): Using `rank-bm25`.
	2. Dense (Vectors): Using `sentence-transformers` (`all-MiniLM-L6-v2`).
	3. SRL (Structural): A custom scoring algorithm that calculates the Intersection-over-Union (IoU) of tokens within matching semantic roles between the query and the document.
	4. Hybrid: A weighted ensemble score: `(0.5 * BM25) + (0.3 * Dense) + (0.2 * SRL)`.

	### Key Uniqueness & Features
	- 4-Way Comparison UI: Users can type a query and instantly see how the four paradigms rank the exact same corpus.
	- Semantic Highlighting: The React frontend parses the `roles` JSON returned by the backend and color-codes the Predicate (Red), ARG0 (Blue), and ARG1 (Green) directly in the text.
	- Interactive Parse Trees: The backend uses `spaCy` and `displaCy` to generate raw HTML/SVG dependency trees for every search result, which the frontend renders inside an expandable accordion. This allows researchers to physically see the grammatical structure the algorithm used.

	## 4. Technical Architecture
	- Data Source: We used the English PropBank dataset (`mertas7/propbank` on Hugging Face), which contains high-quality, pre-annotated SRL tags in CoNLL-U CSV format.
	- Data Parsing (`parse_data.py`): We wrote a script to flatten these complex token-by-token TSV files into a unified `corpus.json` where each document is a sentence mapped to its extracted roles.
	- Backend (`api.py`): A FastAPI server that loads the `corpus.json` into memory. On every request, it uses `spaCy` to dynamically extract the semantic roles of the user's query, scores it against the corpus, and returns the top results.
	- Frontend (React/Vite): A beautifully styled dashboard (Tailwind CSS v4) that fetches from the backend and renders the complex comparative data cleanly.