| # Text-Attributed Network Analysis Documentation | |
| This document explains how the implementation in `assignment_sc_2/code.py` addresses the assignment requirements and grading rubric. | |
| ## 1. Objective | |
| The assignment analyzes a network of research papers where: | |
| - each node is a paper with metadata (`id`, `year`, `authors`, `title`, `abstract`), | |
| - each edge represents semantic similarity between two papers, | |
| - edge `weight` indicates tie strength (higher weight = stronger topical similarity). | |
| The code loads `aclbib.graphml`, extracts the Largest Connected Component (LCC), and performs: | |
| - weak/strong tie removal analysis, | |
| - centrality analysis, | |
| - centrality ranking correlation analysis, | |
| - optional temporal topic-shift analysis. | |
| --- | |
| ## 2. Rubric Coverage Summary | |
| ### (Part 2, 30%) Weak/Strong Ties and LCC Dynamics | |
| Covered in `weaktie_analysis(LCC)`: | |
| - ties are ordered by weight to represent weak-to-strong and strong-to-weak removal, | |
| - two experiments are run: | |
| - removing weakest ties first, | |
| - removing strongest ties first, | |
| - after each single edge removal, LCC size is recomputed, | |
| - x-axis is fraction of ties removed, | |
| - y-axis is LCC size (number of nodes). | |
| Note: The implementation uses rank-based weak/strong definitions (by sorted weights). If explicit threshold-based counts are required by instructor policy, add a threshold rule (e.g., bottom/top quartile) and print those counts. | |
| ### (Part 2, 35%) Centrality + Central Papers + Correlation + Interpretation | |
| Covered in `centrality_analysis(LCC)`: | |
| - computes degree, closeness, and betweenness centrality, | |
| - identifies top 10 papers for each metric, | |
| - outputs entries in `ID<TAB>Title` format, | |
| - converts centrality scores to ranking vectors, | |
| - computes Pearson correlation between metric rankings, | |
| - prints a correlation table, | |
| - identifies the lowest-correlation pair, | |
| - provides interpretation grounded in metric definitions. | |
| ### (Part 2, 10%) Report Quality | |
| This markdown report provides: | |
| - clear method descriptions, | |
| - consistent structure by rubric item, | |
| - direct mapping from requirements to implementation, | |
| - interpretation guidance and limitations. | |
| ### (Part 2, Optional Extra Credit, 50%) Research Evolution Analysis | |
| Covered in `research_evolution_analysis(G)`: | |
| - splits papers into before-2023 and after-2023 groups, | |
| - tokenizes title + abstract, | |
| - builds a shared global dictionary (vocabulary), | |
| - trains LDA models for both groups using same vocabulary, | |
| - obtains comparable topic-term matrices: | |
| - `D` for pre-2023, | |
| - `S` for post-2023, | |
| - computes topic shift using cosine similarity, | |
| - ranks potentially disappearing and emerging themes, | |
| - prints top words for contextual interpretation. | |
| --- | |
| ## 3. Detailed Methodology | |
| ## 3.1 Data Loading and LCC Extraction | |
| 1. Load graph from `aclbib.graphml`. | |
| 2. Extract the largest connected component: | |
| - this ensures path-based metrics (closeness, betweenness) are meaningful and comparable. | |
| --- | |
| ## 3.2 Weak vs Strong Tie Analysis | |
| ### Definitions | |
| - Weak ties: lower edge weights (lower semantic similarity). | |
| - Strong ties: higher edge weights (higher semantic similarity). | |
| ### Procedure | |
| 1. Sort edges by weight ascending (`weak -> strong`). | |
| 2. Create reversed order (`strong -> weak`). | |
| 3. For each removal order: | |
| - remove one edge at a time, | |
| - recompute LCC size after each removal, | |
| - record: | |
| - fraction removed = removed_edges / total_edges, | |
| - LCC size = number of nodes in current largest connected component. | |
| 4. Plot both removal curves. | |
| ### What this shows | |
| - If removing weak ties first rapidly fragments the network, weak ties are acting as bridges. | |
| - If removing strong ties first causes larger impact, strong ties are most critical to global cohesion. | |
| --- | |
| ## 3.3 Centrality Analysis | |
| ### Metrics | |
| - Degree centrality: local connectivity prominence. | |
| - Closeness centrality: global proximity to all nodes. | |
| - Betweenness centrality: control over shortest-path flow. | |
| ### Output | |
| - Top 10 papers for each metric, as `ID<TAB>Title`. | |
| - These lists identify influential papers under different notions of centrality. | |
| --- | |
| ## 3.4 Correlation Between Centrality Rankings | |
| The assignment requests correlation between rankings, not raw centrality values. | |
| ### Procedure | |
| 1. Convert each metric score map into rank vector (rank 1 = highest centrality). | |
| 2. Compute Pearson correlation for each pair: | |
| - Degree vs Closeness, | |
| - Degree vs Betweenness, | |
| - Closeness vs Betweenness. | |
| 3. Build and print correlation table. | |
| 4. Find lowest-correlation pair and print interpretation. | |
| ### Interpretation principle | |
| Low correlation occurs when two metrics encode different structural roles, e.g.: | |
| - local popularity (degree) vs bridge control (betweenness), | |
| - global distance efficiency (closeness) vs brokerage roles (betweenness). | |
| --- | |
| ## 3.5 Optional Extra Credit: Research Evolution | |
| ### Goal | |
| Trace thematic shifts in research trends before and after 2023. | |
| ### Procedure | |
| 1. Split nodes by publication year: | |
| - before 2023, | |
| - 2023 and later. | |
| 2. Build documents from title + abstract. | |
| 3. Tokenize and clean text. | |
| 4. Create one shared vocabulary dictionary for both groups. | |
| 5. Train two LDA models (same vocabulary, separate corpora). | |
| 6. Extract topic-term matrices: | |
| - `D` (pre-2023), | |
| - `S` (post-2023). | |
| 7. Compute shift score for each topic: | |
| - shift = `1 - max cosine similarity` to any topic in opposite period. | |
| 8. Rank: | |
| - pre-2023 topics with highest shift (potentially disappearing), | |
| - post-2023 topics with highest shift (potentially emerging). | |
| 9. Print top words for each ranked topic. | |
| ### Why this is valid | |
| - Shared vocabulary ensures `D` and `S` are directly comparable. | |
| - Cosine similarity captures semantic overlap between topic distributions. | |
| - Ranking by shift provides interpretable emergence/disappearance candidates. | |
| --- | |
| ## 4. Observed Results from Current Run | |
| The following results were generated by running: | |
| `python /home/mshahidul/readctrl/assignment_sc_2/code.py` | |
| ### 4.1 Network and LCC Summary | |
| - LCC contains `1662` nodes and `26134` edges. | |
| - This indicates analysis is performed on a large connected core, suitable for centrality and connectivity experiments. | |
| ### 4.2 Centrality Correlation Results | |
| Pearson correlation between centrality rankings: | |
| | Metric | Degree | Closeness | Betweenness | | |
| |---|---:|---:|---:| | |
| | Degree | 1.0000 | 0.9361 | 0.8114 | | |
| | Closeness | 0.9361 | 1.0000 | 0.7684 | | |
| | Betweenness | 0.8114 | 0.7684 | 1.0000 | | |
| - Lowest-correlation pair: **Closeness vs Betweenness** (`r = 0.7684`). | |
| - Interpretation: closeness captures global proximity, while betweenness captures shortest-path brokerage; these are related but not identical structural roles. | |
| ### 4.3 Central Papers (Top-10) Highlights | |
| Across Degree, Closeness, and Betweenness top-10 lists, several papers repeatedly appear, including: | |
| - `ahuja-etal-2023-mega` (`{MEGA}: Multilingual Evaluation of Generative {AI}`), | |
| - `ding-etal-2020-discriminatively`, | |
| - `shin-etal-2020-autoprompt`, | |
| - `weller-etal-2020-learning`, | |
| - `qin-etal-2023-chatgpt`. | |
| This overlap suggests robust influence of these papers across local connectivity, global accessibility, and bridge-like structural importance. | |
| ### 4.4 Optional Topic Evolution Results | |
| Topic matrices: | |
| - `D` (before 2023): shape `(5, 5000)` | |
| - `S` (after 2023): shape `(5, 5000)` | |
| Top potentially disappearing theme example: | |
| - Before Topic 4, shift `0.1912`, keywords: | |
| `question, knowledge, event, performance, questions, task, graph, can` | |
| Top potentially emerging theme example: | |
| - After Topic 2, shift `0.1989`, keywords: | |
| `llms, large, data, tasks, knowledge, reasoning, generation, performance` | |
| Interpretation: post-2023 topics show stronger emphasis on **LLMs**, reasoning, and generation-centered trends. | |
| --- | |
| ## 5. Limitations and Practical Notes | |
| - Weak/strong tie counts are currently implicit via sorted order; explicit threshold-based counts can be added if required. | |
| - Topic modeling quality depends on preprocessing and corpus size. | |
| - Interpretation quality in final report should connect output topics/central papers to real NLP/AI trends for stronger grading. | |
| --- | |