readctrl / assignment_sc_2 /assignment_documentation.md

Add files using upload-large-folder tool

1db7196 verified about 1 month ago

8.14 kB

	# Text-Attributed Network Analysis Documentation

	This document explains how the implementation in `assignment_sc_2/code.py` addresses the assignment requirements and grading rubric.

	## 1. Objective

	The assignment analyzes a network of research papers where:

	- each node is a paper with metadata (`id`, `year`, `authors`, `title`, `abstract`),
	- each edge represents semantic similarity between two papers,
	- edge `weight` indicates tie strength (higher weight = stronger topical similarity).

	The code loads `aclbib.graphml`, extracts the Largest Connected Component (LCC), and performs:

	- weak/strong tie removal analysis,
	- centrality analysis,
	- centrality ranking correlation analysis,
	- optional temporal topic-shift analysis.

	---

	## 2. Rubric Coverage Summary

	### (Part 2, 30%) Weak/Strong Ties and LCC Dynamics

	Covered in `weaktie_analysis(LCC)`:

	- ties are ordered by weight to represent weak-to-strong and strong-to-weak removal,
	- two experiments are run:
	- removing weakest ties first,
	- removing strongest ties first,
	- after each single edge removal, LCC size is recomputed,
	- x-axis is fraction of ties removed,
	- y-axis is LCC size (number of nodes).

	Note: The implementation uses rank-based weak/strong definitions (by sorted weights). If explicit threshold-based counts are required by instructor policy, add a threshold rule (e.g., bottom/top quartile) and print those counts.

	### (Part 2, 35%) Centrality + Central Papers + Correlation + Interpretation

	Covered in `centrality_analysis(LCC)`:

	- computes degree, closeness, and betweenness centrality,
	- identifies top 10 papers for each metric,
	- outputs entries in `ID<TAB>Title` format,
	- converts centrality scores to ranking vectors,
	- computes Pearson correlation between metric rankings,
	- prints a correlation table,
	- identifies the lowest-correlation pair,
	- provides interpretation grounded in metric definitions.

	### (Part 2, 10%) Report Quality

	This markdown report provides:

	- clear method descriptions,
	- consistent structure by rubric item,
	- direct mapping from requirements to implementation,
	- interpretation guidance and limitations.

	### (Part 2, Optional Extra Credit, 50%) Research Evolution Analysis

	Covered in `research_evolution_analysis(G)`:

	- splits papers into before-2023 and after-2023 groups,
	- tokenizes title + abstract,
	- builds a shared global dictionary (vocabulary),
	- trains LDA models for both groups using same vocabulary,
	- obtains comparable topic-term matrices:
	- `D` for pre-2023,
	- `S` for post-2023,
	- computes topic shift using cosine similarity,
	- ranks potentially disappearing and emerging themes,
	- prints top words for contextual interpretation.

	---

	## 3. Detailed Methodology

	## 3.1 Data Loading and LCC Extraction

	1. Load graph from `aclbib.graphml`.
	2. Extract the largest connected component:
	- this ensures path-based metrics (closeness, betweenness) are meaningful and comparable.

	---

	## 3.2 Weak vs Strong Tie Analysis

	### Definitions

	- Weak ties: lower edge weights (lower semantic similarity).
	- Strong ties: higher edge weights (higher semantic similarity).

	### Procedure

	1. Sort edges by weight ascending (`weak -> strong`).
	2. Create reversed order (`strong -> weak`).
	3. For each removal order:
	- remove one edge at a time,
	- recompute LCC size after each removal,
	- record:
	- fraction removed = removed_edges / total_edges,
	- LCC size = number of nodes in current largest connected component.
	4. Plot both removal curves.

	### What this shows

	- If removing weak ties first rapidly fragments the network, weak ties are acting as bridges.
	- If removing strong ties first causes larger impact, strong ties are most critical to global cohesion.

	---

	## 3.3 Centrality Analysis

	### Metrics

	- Degree centrality: local connectivity prominence.
	- Closeness centrality: global proximity to all nodes.
	- Betweenness centrality: control over shortest-path flow.

	### Output

	- Top 10 papers for each metric, as `ID<TAB>Title`.
	- These lists identify influential papers under different notions of centrality.

	---

	## 3.4 Correlation Between Centrality Rankings

	The assignment requests correlation between rankings, not raw centrality values.

	### Procedure

	1. Convert each metric score map into rank vector (rank 1 = highest centrality).
	2. Compute Pearson correlation for each pair:
	- Degree vs Closeness,
	- Degree vs Betweenness,
	- Closeness vs Betweenness.
	3. Build and print correlation table.
	4. Find lowest-correlation pair and print interpretation.

	### Interpretation principle

	Low correlation occurs when two metrics encode different structural roles, e.g.:

	- local popularity (degree) vs bridge control (betweenness),
	- global distance efficiency (closeness) vs brokerage roles (betweenness).

	---

	## 3.5 Optional Extra Credit: Research Evolution

	### Goal

	Trace thematic shifts in research trends before and after 2023.

	### Procedure

	1. Split nodes by publication year:
	- before 2023,
	- 2023 and later.
	2. Build documents from title + abstract.
	3. Tokenize and clean text.
	4. Create one shared vocabulary dictionary for both groups.
	5. Train two LDA models (same vocabulary, separate corpora).
	6. Extract topic-term matrices:
	- `D` (pre-2023),
	- `S` (post-2023).
	7. Compute shift score for each topic:
	- shift = `1 - max cosine similarity` to any topic in opposite period.
	8. Rank:
	- pre-2023 topics with highest shift (potentially disappearing),
	- post-2023 topics with highest shift (potentially emerging).
	9. Print top words for each ranked topic.

	### Why this is valid

	- Shared vocabulary ensures `D` and `S` are directly comparable.
	- Cosine similarity captures semantic overlap between topic distributions.
	- Ranking by shift provides interpretable emergence/disappearance candidates.

	---

	## 4. Observed Results from Current Run

	The following results were generated by running:

	`python /home/mshahidul/readctrl/assignment_sc_2/code.py`

	### 4.1 Network and LCC Summary

	- LCC contains `1662` nodes and `26134` edges.
	- This indicates analysis is performed on a large connected core, suitable for centrality and connectivity experiments.

	### 4.2 Centrality Correlation Results

	Pearson correlation between centrality rankings:

	\| Metric \| Degree \| Closeness \| Betweenness \|
	\|---\|---:\|---:\|---:\|
	\| Degree \| 1.0000 \| 0.9361 \| 0.8114 \|
	\| Closeness \| 0.9361 \| 1.0000 \| 0.7684 \|
	\| Betweenness \| 0.8114 \| 0.7684 \| 1.0000 \|

	- Lowest-correlation pair: Closeness vs Betweenness (`r = 0.7684`).
	- Interpretation: closeness captures global proximity, while betweenness captures shortest-path brokerage; these are related but not identical structural roles.

	### 4.3 Central Papers (Top-10) Highlights

	Across Degree, Closeness, and Betweenness top-10 lists, several papers repeatedly appear, including:

	- `ahuja-etal-2023-mega` (`{MEGA}: Multilingual Evaluation of Generative {AI}`),
	- `ding-etal-2020-discriminatively`,
	- `shin-etal-2020-autoprompt`,
	- `weller-etal-2020-learning`,
	- `qin-etal-2023-chatgpt`.

	This overlap suggests robust influence of these papers across local connectivity, global accessibility, and bridge-like structural importance.

	### 4.4 Optional Topic Evolution Results

	Topic matrices:

	- `D` (before 2023): shape `(5, 5000)`
	- `S` (after 2023): shape `(5, 5000)`

	Top potentially disappearing theme example:

	- Before Topic 4, shift `0.1912`, keywords:
	`question, knowledge, event, performance, questions, task, graph, can`

	Top potentially emerging theme example:

	- After Topic 2, shift `0.1989`, keywords:
	`llms, large, data, tasks, knowledge, reasoning, generation, performance`

	Interpretation: post-2023 topics show stronger emphasis on LLMs, reasoning, and generation-centered trends.

	---

	## 5. Limitations and Practical Notes

	- Weak/strong tie counts are currently implicit via sorted order; explicit threshold-based counts can be added if required.
	- Topic modeling quality depends on preprocessing and corpus size.
	- Interpretation quality in final report should connect output topics/central papers to real NLP/AI trends for stronger grading.

	---