| # Text-Attributed Network Analysis Documentation |
|
|
| This document explains how the implementation in `assignment_sc_2/code.py` addresses the assignment requirements and grading rubric. |
|
|
| ## 1. Objective |
|
|
| The assignment analyzes a network of research papers where: |
|
|
| - each node is a paper with metadata (`id`, `year`, `authors`, `title`, `abstract`), |
| - each edge represents semantic similarity between two papers, |
| - edge `weight` indicates tie strength (higher weight = stronger topical similarity). |
|
|
| The code loads `aclbib.graphml`, extracts the Largest Connected Component (LCC), and performs: |
|
|
| - weak/strong tie removal analysis, |
| - centrality analysis, |
| - centrality ranking correlation analysis, |
| - optional temporal topic-shift analysis. |
|
|
| --- |
|
|
| ## 2. Rubric Coverage Summary |
|
|
| ### (Part 2, 30%) Weak/Strong Ties and LCC Dynamics |
|
|
| Covered in `weaktie_analysis(LCC)`: |
|
|
| - ties are ordered by weight to represent weak-to-strong and strong-to-weak removal, |
| - two experiments are run: |
| - removing weakest ties first, |
| - removing strongest ties first, |
| - after each single edge removal, LCC size is recomputed, |
| - x-axis is fraction of ties removed, |
| - y-axis is LCC size (number of nodes). |
|
|
| Note: The implementation uses rank-based weak/strong definitions (by sorted weights). If explicit threshold-based counts are required by instructor policy, add a threshold rule (e.g., bottom/top quartile) and print those counts. |
|
|
| ### (Part 2, 35%) Centrality + Central Papers + Correlation + Interpretation |
|
|
| Covered in `centrality_analysis(LCC)`: |
|
|
| - computes degree, closeness, and betweenness centrality, |
| - identifies top 10 papers for each metric, |
| - outputs entries in `ID<TAB>Title` format, |
| - converts centrality scores to ranking vectors, |
| - computes Pearson correlation between metric rankings, |
| - prints a correlation table, |
| - identifies the lowest-correlation pair, |
| - provides interpretation grounded in metric definitions. |
|
|
| ### (Part 2, 10%) Report Quality |
|
|
| This markdown report provides: |
|
|
| - clear method descriptions, |
| - consistent structure by rubric item, |
| - direct mapping from requirements to implementation, |
| - interpretation guidance and limitations. |
|
|
| ### (Part 2, Optional Extra Credit, 50%) Research Evolution Analysis |
|
|
| Covered in `research_evolution_analysis(G)`: |
|
|
| - splits papers into before-2023 and after-2023 groups, |
| - tokenizes title + abstract, |
| - builds a shared global dictionary (vocabulary), |
| - trains LDA models for both groups using same vocabulary, |
| - obtains comparable topic-term matrices: |
| - `D` for pre-2023, |
| - `S` for post-2023, |
| - computes topic shift using cosine similarity, |
| - ranks potentially disappearing and emerging themes, |
| - prints top words for contextual interpretation. |
|
|
| --- |
|
|
| ## 3. Detailed Methodology |
|
|
| ## 3.1 Data Loading and LCC Extraction |
|
|
| 1. Load graph from `aclbib.graphml`. |
| 2. Extract the largest connected component: |
| - this ensures path-based metrics (closeness, betweenness) are meaningful and comparable. |
|
|
| --- |
|
|
| ## 3.2 Weak vs Strong Tie Analysis |
|
|
| ### Definitions |
|
|
| - Weak ties: lower edge weights (lower semantic similarity). |
| - Strong ties: higher edge weights (higher semantic similarity). |
|
|
| ### Procedure |
|
|
| 1. Sort edges by weight ascending (`weak -> strong`). |
| 2. Create reversed order (`strong -> weak`). |
| 3. For each removal order: |
| - remove one edge at a time, |
| - recompute LCC size after each removal, |
| - record: |
| - fraction removed = removed_edges / total_edges, |
| - LCC size = number of nodes in current largest connected component. |
| 4. Plot both removal curves. |
|
|
| ### What this shows |
|
|
| - If removing weak ties first rapidly fragments the network, weak ties are acting as bridges. |
| - If removing strong ties first causes larger impact, strong ties are most critical to global cohesion. |
|
|
| --- |
|
|
| ## 3.3 Centrality Analysis |
|
|
| ### Metrics |
|
|
| - Degree centrality: local connectivity prominence. |
| - Closeness centrality: global proximity to all nodes. |
| - Betweenness centrality: control over shortest-path flow. |
|
|
| ### Output |
|
|
| - Top 10 papers for each metric, as `ID<TAB>Title`. |
| - These lists identify influential papers under different notions of centrality. |
|
|
| --- |
|
|
| ## 3.4 Correlation Between Centrality Rankings |
|
|
| The assignment requests correlation between rankings, not raw centrality values. |
|
|
| ### Procedure |
|
|
| 1. Convert each metric score map into rank vector (rank 1 = highest centrality). |
| 2. Compute Pearson correlation for each pair: |
| - Degree vs Closeness, |
| - Degree vs Betweenness, |
| - Closeness vs Betweenness. |
| 3. Build and print correlation table. |
| 4. Find lowest-correlation pair and print interpretation. |
|
|
| ### Interpretation principle |
|
|
| Low correlation occurs when two metrics encode different structural roles, e.g.: |
|
|
| - local popularity (degree) vs bridge control (betweenness), |
| - global distance efficiency (closeness) vs brokerage roles (betweenness). |
|
|
| --- |
|
|
| ## 3.5 Optional Extra Credit: Research Evolution |
|
|
| ### Goal |
|
|
| Trace thematic shifts in research trends before and after 2023. |
|
|
| ### Procedure |
|
|
| 1. Split nodes by publication year: |
| - before 2023, |
| - 2023 and later. |
| 2. Build documents from title + abstract. |
| 3. Tokenize and clean text. |
| 4. Create one shared vocabulary dictionary for both groups. |
| 5. Train two LDA models (same vocabulary, separate corpora). |
| 6. Extract topic-term matrices: |
| - `D` (pre-2023), |
| - `S` (post-2023). |
| 7. Compute shift score for each topic: |
| - shift = `1 - max cosine similarity` to any topic in opposite period. |
| 8. Rank: |
| - pre-2023 topics with highest shift (potentially disappearing), |
| - post-2023 topics with highest shift (potentially emerging). |
| 9. Print top words for each ranked topic. |
|
|
| ### Why this is valid |
|
|
| - Shared vocabulary ensures `D` and `S` are directly comparable. |
| - Cosine similarity captures semantic overlap between topic distributions. |
| - Ranking by shift provides interpretable emergence/disappearance candidates. |
|
|
| --- |
|
|
| ## 4. Observed Results from Current Run |
|
|
| The following results were generated by running: |
|
|
| `python /home/mshahidul/readctrl/assignment_sc_2/code.py` |
|
|
| ### 4.1 Network and LCC Summary |
|
|
| - LCC contains `1662` nodes and `26134` edges. |
| - This indicates analysis is performed on a large connected core, suitable for centrality and connectivity experiments. |
|
|
| ### 4.2 Centrality Correlation Results |
|
|
| Pearson correlation between centrality rankings: |
|
|
| | Metric | Degree | Closeness | Betweenness | |
| |---|---:|---:|---:| |
| | Degree | 1.0000 | 0.9361 | 0.8114 | |
| | Closeness | 0.9361 | 1.0000 | 0.7684 | |
| | Betweenness | 0.8114 | 0.7684 | 1.0000 | |
|
|
| - Lowest-correlation pair: **Closeness vs Betweenness** (`r = 0.7684`). |
| - Interpretation: closeness captures global proximity, while betweenness captures shortest-path brokerage; these are related but not identical structural roles. |
|
|
| ### 4.3 Central Papers (Top-10) Highlights |
|
|
| Across Degree, Closeness, and Betweenness top-10 lists, several papers repeatedly appear, including: |
|
|
| - `ahuja-etal-2023-mega` (`{MEGA}: Multilingual Evaluation of Generative {AI}`), |
| - `ding-etal-2020-discriminatively`, |
| - `shin-etal-2020-autoprompt`, |
| - `weller-etal-2020-learning`, |
| - `qin-etal-2023-chatgpt`. |
|
|
| This overlap suggests robust influence of these papers across local connectivity, global accessibility, and bridge-like structural importance. |
|
|
| ### 4.4 Optional Topic Evolution Results |
|
|
| Topic matrices: |
|
|
| - `D` (before 2023): shape `(5, 5000)` |
| - `S` (after 2023): shape `(5, 5000)` |
|
|
| Top potentially disappearing theme example: |
|
|
| - Before Topic 4, shift `0.1912`, keywords: |
| `question, knowledge, event, performance, questions, task, graph, can` |
|
|
| Top potentially emerging theme example: |
|
|
| - After Topic 2, shift `0.1989`, keywords: |
| `llms, large, data, tasks, knowledge, reasoning, generation, performance` |
|
|
| Interpretation: post-2023 topics show stronger emphasis on **LLMs**, reasoning, and generation-centered trends. |
|
|
| --- |
|
|
| ## 5. Limitations and Practical Notes |
|
|
| - Weak/strong tie counts are currently implicit via sorted order; explicit threshold-based counts can be added if required. |
| - Topic modeling quality depends on preprocessing and corpus size. |
| - Interpretation quality in final report should connect output topics/central papers to real NLP/AI trends for stronger grading. |
|
|
| --- |
|
|
|
|
|
|
|
|