readctrl / assignment_sc_2 /assignment_documentation.md
shahidul034's picture
Add files using upload-large-folder tool
1db7196 verified
# Text-Attributed Network Analysis Documentation
This document explains how the implementation in `assignment_sc_2/code.py` addresses the assignment requirements and grading rubric.
## 1. Objective
The assignment analyzes a network of research papers where:
- each node is a paper with metadata (`id`, `year`, `authors`, `title`, `abstract`),
- each edge represents semantic similarity between two papers,
- edge `weight` indicates tie strength (higher weight = stronger topical similarity).
The code loads `aclbib.graphml`, extracts the Largest Connected Component (LCC), and performs:
- weak/strong tie removal analysis,
- centrality analysis,
- centrality ranking correlation analysis,
- optional temporal topic-shift analysis.
---
## 2. Rubric Coverage Summary
### (Part 2, 30%) Weak/Strong Ties and LCC Dynamics
Covered in `weaktie_analysis(LCC)`:
- ties are ordered by weight to represent weak-to-strong and strong-to-weak removal,
- two experiments are run:
- removing weakest ties first,
- removing strongest ties first,
- after each single edge removal, LCC size is recomputed,
- x-axis is fraction of ties removed,
- y-axis is LCC size (number of nodes).
Note: The implementation uses rank-based weak/strong definitions (by sorted weights). If explicit threshold-based counts are required by instructor policy, add a threshold rule (e.g., bottom/top quartile) and print those counts.
### (Part 2, 35%) Centrality + Central Papers + Correlation + Interpretation
Covered in `centrality_analysis(LCC)`:
- computes degree, closeness, and betweenness centrality,
- identifies top 10 papers for each metric,
- outputs entries in `ID<TAB>Title` format,
- converts centrality scores to ranking vectors,
- computes Pearson correlation between metric rankings,
- prints a correlation table,
- identifies the lowest-correlation pair,
- provides interpretation grounded in metric definitions.
### (Part 2, 10%) Report Quality
This markdown report provides:
- clear method descriptions,
- consistent structure by rubric item,
- direct mapping from requirements to implementation,
- interpretation guidance and limitations.
### (Part 2, Optional Extra Credit, 50%) Research Evolution Analysis
Covered in `research_evolution_analysis(G)`:
- splits papers into before-2023 and after-2023 groups,
- tokenizes title + abstract,
- builds a shared global dictionary (vocabulary),
- trains LDA models for both groups using same vocabulary,
- obtains comparable topic-term matrices:
- `D` for pre-2023,
- `S` for post-2023,
- computes topic shift using cosine similarity,
- ranks potentially disappearing and emerging themes,
- prints top words for contextual interpretation.
---
## 3. Detailed Methodology
## 3.1 Data Loading and LCC Extraction
1. Load graph from `aclbib.graphml`.
2. Extract the largest connected component:
- this ensures path-based metrics (closeness, betweenness) are meaningful and comparable.
---
## 3.2 Weak vs Strong Tie Analysis
### Definitions
- Weak ties: lower edge weights (lower semantic similarity).
- Strong ties: higher edge weights (higher semantic similarity).
### Procedure
1. Sort edges by weight ascending (`weak -> strong`).
2. Create reversed order (`strong -> weak`).
3. For each removal order:
- remove one edge at a time,
- recompute LCC size after each removal,
- record:
- fraction removed = removed_edges / total_edges,
- LCC size = number of nodes in current largest connected component.
4. Plot both removal curves.
### What this shows
- If removing weak ties first rapidly fragments the network, weak ties are acting as bridges.
- If removing strong ties first causes larger impact, strong ties are most critical to global cohesion.
---
## 3.3 Centrality Analysis
### Metrics
- Degree centrality: local connectivity prominence.
- Closeness centrality: global proximity to all nodes.
- Betweenness centrality: control over shortest-path flow.
### Output
- Top 10 papers for each metric, as `ID<TAB>Title`.
- These lists identify influential papers under different notions of centrality.
---
## 3.4 Correlation Between Centrality Rankings
The assignment requests correlation between rankings, not raw centrality values.
### Procedure
1. Convert each metric score map into rank vector (rank 1 = highest centrality).
2. Compute Pearson correlation for each pair:
- Degree vs Closeness,
- Degree vs Betweenness,
- Closeness vs Betweenness.
3. Build and print correlation table.
4. Find lowest-correlation pair and print interpretation.
### Interpretation principle
Low correlation occurs when two metrics encode different structural roles, e.g.:
- local popularity (degree) vs bridge control (betweenness),
- global distance efficiency (closeness) vs brokerage roles (betweenness).
---
## 3.5 Optional Extra Credit: Research Evolution
### Goal
Trace thematic shifts in research trends before and after 2023.
### Procedure
1. Split nodes by publication year:
- before 2023,
- 2023 and later.
2. Build documents from title + abstract.
3. Tokenize and clean text.
4. Create one shared vocabulary dictionary for both groups.
5. Train two LDA models (same vocabulary, separate corpora).
6. Extract topic-term matrices:
- `D` (pre-2023),
- `S` (post-2023).
7. Compute shift score for each topic:
- shift = `1 - max cosine similarity` to any topic in opposite period.
8. Rank:
- pre-2023 topics with highest shift (potentially disappearing),
- post-2023 topics with highest shift (potentially emerging).
9. Print top words for each ranked topic.
### Why this is valid
- Shared vocabulary ensures `D` and `S` are directly comparable.
- Cosine similarity captures semantic overlap between topic distributions.
- Ranking by shift provides interpretable emergence/disappearance candidates.
---
## 4. Observed Results from Current Run
The following results were generated by running:
`python /home/mshahidul/readctrl/assignment_sc_2/code.py`
### 4.1 Network and LCC Summary
- LCC contains `1662` nodes and `26134` edges.
- This indicates analysis is performed on a large connected core, suitable for centrality and connectivity experiments.
### 4.2 Centrality Correlation Results
Pearson correlation between centrality rankings:
| Metric | Degree | Closeness | Betweenness |
|---|---:|---:|---:|
| Degree | 1.0000 | 0.9361 | 0.8114 |
| Closeness | 0.9361 | 1.0000 | 0.7684 |
| Betweenness | 0.8114 | 0.7684 | 1.0000 |
- Lowest-correlation pair: **Closeness vs Betweenness** (`r = 0.7684`).
- Interpretation: closeness captures global proximity, while betweenness captures shortest-path brokerage; these are related but not identical structural roles.
### 4.3 Central Papers (Top-10) Highlights
Across Degree, Closeness, and Betweenness top-10 lists, several papers repeatedly appear, including:
- `ahuja-etal-2023-mega` (`{MEGA}: Multilingual Evaluation of Generative {AI}`),
- `ding-etal-2020-discriminatively`,
- `shin-etal-2020-autoprompt`,
- `weller-etal-2020-learning`,
- `qin-etal-2023-chatgpt`.
This overlap suggests robust influence of these papers across local connectivity, global accessibility, and bridge-like structural importance.
### 4.4 Optional Topic Evolution Results
Topic matrices:
- `D` (before 2023): shape `(5, 5000)`
- `S` (after 2023): shape `(5, 5000)`
Top potentially disappearing theme example:
- Before Topic 4, shift `0.1912`, keywords:
`question, knowledge, event, performance, questions, task, graph, can`
Top potentially emerging theme example:
- After Topic 2, shift `0.1989`, keywords:
`llms, large, data, tasks, knowledge, reasoning, generation, performance`
Interpretation: post-2023 topics show stronger emphasis on **LLMs**, reasoning, and generation-centered trends.
---
## 5. Limitations and Practical Notes
- Weak/strong tie counts are currently implicit via sorted order; explicit threshold-based counts can be added if required.
- Topic modeling quality depends on preprocessing and corpus size.
- Interpretation quality in final report should connect output topics/central papers to real NLP/AI trends for stronger grading.
---