File size: 8,143 Bytes

1db7196

# Text-Attributed Network Analysis Documentation

This document explains how the implementation in `assignment_sc_2/code.py` addresses the assignment requirements and grading rubric.

## 1. Objective

The assignment analyzes a network of research papers where:

- each node is a paper with metadata (`id`, `year`, `authors`, `title`, `abstract`),
- each edge represents semantic similarity between two papers,
- edge `weight` indicates tie strength (higher weight = stronger topical similarity).

The code loads `aclbib.graphml`, extracts the Largest Connected Component (LCC), and performs:

- weak/strong tie removal analysis,
- centrality analysis,
- centrality ranking correlation analysis,
- optional temporal topic-shift analysis.

---

## 2. Rubric Coverage Summary

### (Part 2, 30%) Weak/Strong Ties and LCC Dynamics

Covered in `weaktie_analysis(LCC)`:

- ties are ordered by weight to represent weak-to-strong and strong-to-weak removal,
- two experiments are run:
  - removing weakest ties first,
  - removing strongest ties first,
- after each single edge removal, LCC size is recomputed,
- x-axis is fraction of ties removed,
- y-axis is LCC size (number of nodes).

Note: The implementation uses rank-based weak/strong definitions (by sorted weights). If explicit threshold-based counts are required by instructor policy, add a threshold rule (e.g., bottom/top quartile) and print those counts.

### (Part 2, 35%) Centrality + Central Papers + Correlation + Interpretation

Covered in `centrality_analysis(LCC)`:

- computes degree, closeness, and betweenness centrality,
- identifies top 10 papers for each metric,
- outputs entries in `ID<TAB>Title` format,
- converts centrality scores to ranking vectors,
- computes Pearson correlation between metric rankings,
- prints a correlation table,
- identifies the lowest-correlation pair,
- provides interpretation grounded in metric definitions.

### (Part 2, 10%) Report Quality

This markdown report provides:

- clear method descriptions,
- consistent structure by rubric item,
- direct mapping from requirements to implementation,
- interpretation guidance and limitations.

### (Part 2, Optional Extra Credit, 50%) Research Evolution Analysis

Covered in `research_evolution_analysis(G)`:

- splits papers into before-2023 and after-2023 groups,
- tokenizes title + abstract,
- builds a shared global dictionary (vocabulary),
- trains LDA models for both groups using same vocabulary,
- obtains comparable topic-term matrices:
  - `D` for pre-2023,
  - `S` for post-2023,
- computes topic shift using cosine similarity,
- ranks potentially disappearing and emerging themes,
- prints top words for contextual interpretation.

---

## 3. Detailed Methodology

## 3.1 Data Loading and LCC Extraction

1. Load graph from `aclbib.graphml`.
2. Extract the largest connected component:
   - this ensures path-based metrics (closeness, betweenness) are meaningful and comparable.

---

## 3.2 Weak vs Strong Tie Analysis

### Definitions

- Weak ties: lower edge weights (lower semantic similarity).
- Strong ties: higher edge weights (higher semantic similarity).

### Procedure

1. Sort edges by weight ascending (`weak -> strong`).
2. Create reversed order (`strong -> weak`).
3. For each removal order:
   - remove one edge at a time,
   - recompute LCC size after each removal,
   - record:
     - fraction removed = removed_edges / total_edges,
     - LCC size = number of nodes in current largest connected component.
4. Plot both removal curves.

### What this shows

- If removing weak ties first rapidly fragments the network, weak ties are acting as bridges.
- If removing strong ties first causes larger impact, strong ties are most critical to global cohesion.

---

## 3.3 Centrality Analysis

### Metrics

- Degree centrality: local connectivity prominence.
- Closeness centrality: global proximity to all nodes.
- Betweenness centrality: control over shortest-path flow.

### Output

- Top 10 papers for each metric, as `ID<TAB>Title`.
- These lists identify influential papers under different notions of centrality.

---

## 3.4 Correlation Between Centrality Rankings

The assignment requests correlation between rankings, not raw centrality values.

### Procedure

1. Convert each metric score map into rank vector (rank 1 = highest centrality).
2. Compute Pearson correlation for each pair:
   - Degree vs Closeness,
   - Degree vs Betweenness,
   - Closeness vs Betweenness.
3. Build and print correlation table.
4. Find lowest-correlation pair and print interpretation.

### Interpretation principle

Low correlation occurs when two metrics encode different structural roles, e.g.:

- local popularity (degree) vs bridge control (betweenness),
- global distance efficiency (closeness) vs brokerage roles (betweenness).

---

## 3.5 Optional Extra Credit: Research Evolution

### Goal

Trace thematic shifts in research trends before and after 2023.

### Procedure

1. Split nodes by publication year:
   - before 2023,
   - 2023 and later.
2. Build documents from title + abstract.
3. Tokenize and clean text.
4. Create one shared vocabulary dictionary for both groups.
5. Train two LDA models (same vocabulary, separate corpora).
6. Extract topic-term matrices:
   - `D` (pre-2023),
   - `S` (post-2023).
7. Compute shift score for each topic:
   - shift = `1 - max cosine similarity` to any topic in opposite period.
8. Rank:
   - pre-2023 topics with highest shift (potentially disappearing),
   - post-2023 topics with highest shift (potentially emerging).
9. Print top words for each ranked topic.

### Why this is valid

- Shared vocabulary ensures `D` and `S` are directly comparable.
- Cosine similarity captures semantic overlap between topic distributions.
- Ranking by shift provides interpretable emergence/disappearance candidates.

---

## 4. Observed Results from Current Run

The following results were generated by running:

`python /home/mshahidul/readctrl/assignment_sc_2/code.py`

### 4.1 Network and LCC Summary

- LCC contains `1662` nodes and `26134` edges.
- This indicates analysis is performed on a large connected core, suitable for centrality and connectivity experiments.

### 4.2 Centrality Correlation Results

Pearson correlation between centrality rankings:

| Metric | Degree | Closeness | Betweenness |
|---|---:|---:|---:|
| Degree | 1.0000 | 0.9361 | 0.8114 |
| Closeness | 0.9361 | 1.0000 | 0.7684 |
| Betweenness | 0.8114 | 0.7684 | 1.0000 |

- Lowest-correlation pair: **Closeness vs Betweenness** (`r = 0.7684`).
- Interpretation: closeness captures global proximity, while betweenness captures shortest-path brokerage; these are related but not identical structural roles.

### 4.3 Central Papers (Top-10) Highlights

Across Degree, Closeness, and Betweenness top-10 lists, several papers repeatedly appear, including:

- `ahuja-etal-2023-mega` (`{MEGA}: Multilingual Evaluation of Generative {AI}`),
- `ding-etal-2020-discriminatively`,
- `shin-etal-2020-autoprompt`,
- `weller-etal-2020-learning`,
- `qin-etal-2023-chatgpt`.

This overlap suggests robust influence of these papers across local connectivity, global accessibility, and bridge-like structural importance.

### 4.4 Optional Topic Evolution Results

Topic matrices:

- `D` (before 2023): shape `(5, 5000)`
- `S` (after 2023): shape `(5, 5000)`

Top potentially disappearing theme example:

- Before Topic 4, shift `0.1912`, keywords:
  `question, knowledge, event, performance, questions, task, graph, can`

Top potentially emerging theme example:

- After Topic 2, shift `0.1989`, keywords:
  `llms, large, data, tasks, knowledge, reasoning, generation, performance`

Interpretation: post-2023 topics show stronger emphasis on **LLMs**, reasoning, and generation-centered trends.

---

## 5. Limitations and Practical Notes

- Weak/strong tie counts are currently implicit via sorted order; explicit threshold-based counts can be added if required.
- Topic modeling quality depends on preprocessing and corpus size.
- Interpretation quality in final report should connect output topics/central papers to real NLP/AI trends for stronger grading.

---