readctrl / assignment_sc_2 /rubric_points_explanation.md
shahidul034's picture
Add files using upload-large-folder tool
1db7196 verified

Part 2 Rubric Explanation

1) Weak/strong ties and LCC change during removal

Tie strength is defined by edge weight in the LCC.

  • Weak ties: weight <= median
  • Strong ties: weight > median

From the run output: I run two removal orders on the LCC:

  1. weakest to strongest
  2. strongest to weakest

After each edge removal, the LCC is recomputed and recorded (fraction removed vs. LCC size). This directly satisfies the rubric requirement to compare structural robustness under weak-first and strong-first deletions. Edges are removed one by one. After every removal, the LCC is recalculated and its size is stored as node count. The x-axis is fraction of ties removed, and the y-axis is LCC size.

2) Centrality, top papers, and correlation analysis

From the run output, the starting LCC is: Centrality is computed on the LCC using:

  • 1662 nodes
  • 26134 edges

The code also prints exact weak/strong tie statistics:

  • total number of ties in the LCC: 26134
  • weak-tie threshold (median weight): 0.6276
  • number of weak ties (weight <= 0.6276): 13067
  • number of strong ties (weight > 0.6276): 13067

So both tie classification and total weak/strong counts are explicitly reported before the stepwise removal process.

Centrality, central papers, interpretation, correlation

Three centrality measures are computed on the LCC:

  • Degree
  • Closeness
  • Betweenness

For each metric, top-10 papers are printed in ID<TAB>Title format. Correlation between ranking vectors is: For each one, top-10 papers are listed in ID<TAB>Title format.

For correlation, I first convert centrality scores to ranking vectors and then compute Pearson correlation between rankings.

Results from the run:

Metric Degree Closeness Betweenness
Degree 1.0000 0.9361 0.8114
Closeness 0.9361 1.0000 0.7684
Betweenness 0.8114 0.7684 1.0000

Lowest-correlation pair: Closeness vs Betweenness (0.7684).

  • Degree vs Closeness: 0.9361
  • Degree vs Betweenness: 0.8114
  • Closeness vs Betweenness: 0.7684 (lowest) Interpretation: closeness captures global proximity, while betweenness captures bridge roles on shortest paths. A node can be globally near many others without being a major bridge, so these rankings diverge more than the other pairs. The output explicitly reports the lowest-correlation pair. Papers repeatedly appearing across top lists (e.g., ahuja-etal-2023-mega, ding-etal-2020-discriminatively, qin-etal-2023-chatgpt) indicate robust influence across multiple centrality notions. Lowest pair interpretation:

3) Optional extra credit: theme shift before vs after 2023

  • closeness measures overall proximity in the graph
  • betweenness measures bridge role on shortest paths
  • these are related but different structural roles, so their rankings are less aligned I split papers into two periods (before 2023, and 2023+), build text from title+abstract, use one shared vocabulary, train LDA for both periods, and compare topic vectors by cosine similarity. Repeatedly central papers across top lists include: Output evidence:
  • ahuja-etal-2023-mega
  • ding-etal-2020-discriminatively
  • shin-etal-2020-autoprompt
  • weller-etal-2020-learning
  • qin-etal-2023-chatgpt

The code also explicitly prints papers that appear in multiple metric top-10 lists (with metric names), which strengthens the evidence for identifying robustly central papers.

Optional Extra Credit (50%): Theme shift before and after 2023

I compare two time groups: before 2023 and 2023+.

Steps used:

  1. split papers by year
  2. create text from title + abstract
  3. tokenize and clean
  4. build one shared vocabulary
  5. train LDA for each period
  6. extract topic-term matrices D (before) and S (after)
  7. compare topics with cosine similarity and rank by shift score

Run evidence:

  • D shape: (5, 5000)
  • S shape: (5, 5000)

Examples from output:

  • emerging: After Topic 2 | shift=0.1989 | llms, large, data, tasks, knowledge, reasoning, generation, performance
  • disappearing: Before Topic 4 | shift=0.1912 | question, knowledge, event, performance, questions, task, graph, can

This indicates a stronger LLM/reasoning focus in the later period.

Results (from current execution)

  • Network loaded successfully; LCC size is 1662 nodes and 26134 edges.
  • Weak/strong tie section reports:
    • total ties: 26134
    • median-weight threshold: 0.6276
    • weak ties: 13067
    • strong ties: 13067
  • Centrality ranking correlations:
    • Degree-Closeness: 0.9361
    • Degree-Betweenness: 0.8114
    • Closeness-Betweenness: 0.7684
  • Lowest-correlation pair: Closeness vs Betweenness.
  • Top-10 central papers were produced for all three metrics in ID<TAB>Title format.
  • Repeated papers across multiple centrality top-10 lists are explicitly reported.
  • Topic-evolution matrices were produced:
    • D (before 2023): (5, 5000)
    • S (2023+): (5, 5000)
  • Highest-shift emerging topic: After Topic 2 (shift=0.1989) with keywords around llms, reasoning, and generation.
  • Highest-shift disappearing topic: Before Topic 4 (shift=0.1912) with keywords around question, knowledge, and graph.
  • Topic matrices: D (before) = (5, 5000), S (2023+) = (5, 5000)

Findings

Conclusion: post-2023 topics shift toward LLM- and reasoning-centered themes, while earlier topics are more question/knowledge/graph-oriented.

  • The centrality rankings are strongly related overall, but not identical.
  • Degree and closeness are most aligned (0.9361), indicating that papers with strong local connectivity are often globally well-positioned.
  • Closeness and betweenness are least aligned (0.7684), showing that global proximity and bridge-role influence capture different node functions.
  • Repeated appearance of papers such as ahuja-etal-2023-mega, ding-etal-2020-discriminatively, and qin-etal-2023-chatgpt across multiple lists suggests robust influence across different centrality definitions.
  • Topic-shift outputs indicate post-2023 movement toward LLM-oriented and reasoning-heavy themes.
  • Overall, the network remains highly connected at baseline, and the analysis pipeline covers connectivity, influence, and temporal theme evolution in a consistent way.