# Text-Attributed Network Analysis Documentation This document explains how the implementation in `assignment_sc_2/code.py` addresses the assignment requirements and grading rubric. ## 1. Objective The assignment analyzes a network of research papers where: - each node is a paper with metadata (`id`, `year`, `authors`, `title`, `abstract`), - each edge represents semantic similarity between two papers, - edge `weight` indicates tie strength (higher weight = stronger topical similarity). The code loads `aclbib.graphml`, extracts the Largest Connected Component (LCC), and performs: - weak/strong tie removal analysis, - centrality analysis, - centrality ranking correlation analysis, - optional temporal topic-shift analysis. --- ## 2. Rubric Coverage Summary ### (Part 2, 30%) Weak/Strong Ties and LCC Dynamics Covered in `weaktie_analysis(LCC)`: - ties are ordered by weight to represent weak-to-strong and strong-to-weak removal, - two experiments are run: - removing weakest ties first, - removing strongest ties first, - after each single edge removal, LCC size is recomputed, - x-axis is fraction of ties removed, - y-axis is LCC size (number of nodes). Note: The implementation uses rank-based weak/strong definitions (by sorted weights). If explicit threshold-based counts are required by instructor policy, add a threshold rule (e.g., bottom/top quartile) and print those counts. ### (Part 2, 35%) Centrality + Central Papers + Correlation + Interpretation Covered in `centrality_analysis(LCC)`: - computes degree, closeness, and betweenness centrality, - identifies top 10 papers for each metric, - outputs entries in `IDTitle` format, - converts centrality scores to ranking vectors, - computes Pearson correlation between metric rankings, - prints a correlation table, - identifies the lowest-correlation pair, - provides interpretation grounded in metric definitions. ### (Part 2, 10%) Report Quality This markdown report provides: - clear method descriptions, - consistent structure by rubric item, - direct mapping from requirements to implementation, - interpretation guidance and limitations. ### (Part 2, Optional Extra Credit, 50%) Research Evolution Analysis Covered in `research_evolution_analysis(G)`: - splits papers into before-2023 and after-2023 groups, - tokenizes title + abstract, - builds a shared global dictionary (vocabulary), - trains LDA models for both groups using same vocabulary, - obtains comparable topic-term matrices: - `D` for pre-2023, - `S` for post-2023, - computes topic shift using cosine similarity, - ranks potentially disappearing and emerging themes, - prints top words for contextual interpretation. --- ## 3. Detailed Methodology ## 3.1 Data Loading and LCC Extraction 1. Load graph from `aclbib.graphml`. 2. Extract the largest connected component: - this ensures path-based metrics (closeness, betweenness) are meaningful and comparable. --- ## 3.2 Weak vs Strong Tie Analysis ### Definitions - Weak ties: lower edge weights (lower semantic similarity). - Strong ties: higher edge weights (higher semantic similarity). ### Procedure 1. Sort edges by weight ascending (`weak -> strong`). 2. Create reversed order (`strong -> weak`). 3. For each removal order: - remove one edge at a time, - recompute LCC size after each removal, - record: - fraction removed = removed_edges / total_edges, - LCC size = number of nodes in current largest connected component. 4. Plot both removal curves. ### What this shows - If removing weak ties first rapidly fragments the network, weak ties are acting as bridges. - If removing strong ties first causes larger impact, strong ties are most critical to global cohesion. --- ## 3.3 Centrality Analysis ### Metrics - Degree centrality: local connectivity prominence. - Closeness centrality: global proximity to all nodes. - Betweenness centrality: control over shortest-path flow. ### Output - Top 10 papers for each metric, as `IDTitle`. - These lists identify influential papers under different notions of centrality. --- ## 3.4 Correlation Between Centrality Rankings The assignment requests correlation between rankings, not raw centrality values. ### Procedure 1. Convert each metric score map into rank vector (rank 1 = highest centrality). 2. Compute Pearson correlation for each pair: - Degree vs Closeness, - Degree vs Betweenness, - Closeness vs Betweenness. 3. Build and print correlation table. 4. Find lowest-correlation pair and print interpretation. ### Interpretation principle Low correlation occurs when two metrics encode different structural roles, e.g.: - local popularity (degree) vs bridge control (betweenness), - global distance efficiency (closeness) vs brokerage roles (betweenness). --- ## 3.5 Optional Extra Credit: Research Evolution ### Goal Trace thematic shifts in research trends before and after 2023. ### Procedure 1. Split nodes by publication year: - before 2023, - 2023 and later. 2. Build documents from title + abstract. 3. Tokenize and clean text. 4. Create one shared vocabulary dictionary for both groups. 5. Train two LDA models (same vocabulary, separate corpora). 6. Extract topic-term matrices: - `D` (pre-2023), - `S` (post-2023). 7. Compute shift score for each topic: - shift = `1 - max cosine similarity` to any topic in opposite period. 8. Rank: - pre-2023 topics with highest shift (potentially disappearing), - post-2023 topics with highest shift (potentially emerging). 9. Print top words for each ranked topic. ### Why this is valid - Shared vocabulary ensures `D` and `S` are directly comparable. - Cosine similarity captures semantic overlap between topic distributions. - Ranking by shift provides interpretable emergence/disappearance candidates. --- ## 4. Observed Results from Current Run The following results were generated by running: `python /home/mshahidul/readctrl/assignment_sc_2/code.py` ### 4.1 Network and LCC Summary - LCC contains `1662` nodes and `26134` edges. - This indicates analysis is performed on a large connected core, suitable for centrality and connectivity experiments. ### 4.2 Centrality Correlation Results Pearson correlation between centrality rankings: | Metric | Degree | Closeness | Betweenness | |---|---:|---:|---:| | Degree | 1.0000 | 0.9361 | 0.8114 | | Closeness | 0.9361 | 1.0000 | 0.7684 | | Betweenness | 0.8114 | 0.7684 | 1.0000 | - Lowest-correlation pair: **Closeness vs Betweenness** (`r = 0.7684`). - Interpretation: closeness captures global proximity, while betweenness captures shortest-path brokerage; these are related but not identical structural roles. ### 4.3 Central Papers (Top-10) Highlights Across Degree, Closeness, and Betweenness top-10 lists, several papers repeatedly appear, including: - `ahuja-etal-2023-mega` (`{MEGA}: Multilingual Evaluation of Generative {AI}`), - `ding-etal-2020-discriminatively`, - `shin-etal-2020-autoprompt`, - `weller-etal-2020-learning`, - `qin-etal-2023-chatgpt`. This overlap suggests robust influence of these papers across local connectivity, global accessibility, and bridge-like structural importance. ### 4.4 Optional Topic Evolution Results Topic matrices: - `D` (before 2023): shape `(5, 5000)` - `S` (after 2023): shape `(5, 5000)` Top potentially disappearing theme example: - Before Topic 4, shift `0.1912`, keywords: `question, knowledge, event, performance, questions, task, graph, can` Top potentially emerging theme example: - After Topic 2, shift `0.1989`, keywords: `llms, large, data, tasks, knowledge, reasoning, generation, performance` Interpretation: post-2023 topics show stronger emphasis on **LLMs**, reasoning, and generation-centered trends. --- ## 5. Limitations and Practical Notes - Weak/strong tie counts are currently implicit via sorted order; explicit threshold-based counts can be added if required. - Topic modeling quality depends on preprocessing and corpus size. - Interpretation quality in final report should connect output topics/central papers to real NLP/AI trends for stronger grading. ---