readctrl / assignment_sc_2 /rubric_points_explanation.md

Add files using upload-large-folder tool

1db7196 verified about 1 month ago

6.3 kB

	# Part 2 Rubric Explanation
	## 1) Weak/strong ties and LCC change during removal

	Tie strength is defined by edge `weight` in the LCC.

	- Weak ties: `weight <= median`
	- Strong ties: `weight > median`

	From the run output:
	I run two removal orders on the LCC:
	1. weakest to strongest
	2. strongest to weakest

	After each edge removal, the LCC is recomputed and recorded (fraction removed vs. LCC size). This directly satisfies the rubric requirement to compare structural robustness under weak-first and strong-first deletions.
	Edges are removed one by one. After every removal, the LCC is recalculated and its size is stored as node count. The x-axis is fraction of ties removed, and the y-axis is LCC size.
	## 2) Centrality, top papers, and correlation analysis
	From the run output, the starting LCC is:
	Centrality is computed on the LCC using:
	- `1662` nodes
	- `26134` edges

	The code also prints exact weak/strong tie statistics:

	- total number of ties in the LCC: `26134`
	- weak-tie threshold (median weight): `0.6276`
	- number of weak ties (`weight <= 0.6276`): `13067`
	- number of strong ties (`weight > 0.6276`): `13067`

	So both tie classification and total weak/strong counts are explicitly reported before the stepwise removal process.

	## Centrality, central papers, interpretation, correlation

	Three centrality measures are computed on the LCC:
	- Degree
	- Closeness
	- Betweenness

	For each metric, top-10 papers are printed in `ID<TAB>Title` format. Correlation between ranking vectors is:
	For each one, top-10 papers are listed in `ID<TAB>Title` format.

	For correlation, I first convert centrality scores to ranking vectors and then compute Pearson correlation between rankings.

	Results from the run:
	\| Metric \| Degree \| Closeness \| Betweenness \|
	\|---\|---:\|---:\|---:\|
	\| Degree \| 1.0000 \| 0.9361 \| 0.8114 \|
	\| Closeness \| 0.9361 \| 1.0000 \| 0.7684 \|
	\| Betweenness \| 0.8114 \| 0.7684 \| 1.0000 \|

	Lowest-correlation pair: Closeness vs Betweenness (`0.7684`).
	- Degree vs Closeness: `0.9361`
	- Degree vs Betweenness: `0.8114`
	- Closeness vs Betweenness: `0.7684` (lowest)
	Interpretation: closeness captures global proximity, while betweenness captures bridge roles on shortest paths. A node can be globally near many others without being a major bridge, so these rankings diverge more than the other pairs.
	The output explicitly reports the lowest-correlation pair.
	Papers repeatedly appearing across top lists (e.g., `ahuja-etal-2023-mega`, `ding-etal-2020-discriminatively`, `qin-etal-2023-chatgpt`) indicate robust influence across multiple centrality notions.
	Lowest pair interpretation:
	## 3) Optional extra credit: theme shift before vs after 2023
	- closeness measures overall proximity in the graph
	- betweenness measures bridge role on shortest paths
	- these are related but different structural roles, so their rankings are less aligned
	I split papers into two periods (before 2023, and 2023+), build text from title+abstract, use one shared vocabulary, train LDA for both periods, and compare topic vectors by cosine similarity.
	Repeatedly central papers across top lists include:
	Output evidence:
	- `ahuja-etal-2023-mega`
	- `ding-etal-2020-discriminatively`
	- `shin-etal-2020-autoprompt`
	- `weller-etal-2020-learning`
	- `qin-etal-2023-chatgpt`

	The code also explicitly prints papers that appear in multiple metric top-10 lists (with metric names), which strengthens the evidence for identifying robustly central papers.


	## Optional Extra Credit (50%): Theme shift before and after 2023

	I compare two time groups: before 2023 and 2023+.

	Steps used:

	1. split papers by year
	2. create text from title + abstract
	3. tokenize and clean
	4. build one shared vocabulary
	5. train LDA for each period
	6. extract topic-term matrices `D` (before) and `S` (after)
	7. compare topics with cosine similarity and rank by shift score

	Run evidence:

	- `D` shape: `(5, 5000)`
	- `S` shape: `(5, 5000)`

	Examples from output:

	- emerging: `After Topic 2 \| shift=0.1989 \| llms, large, data, tasks, knowledge, reasoning, generation, performance`
	- disappearing: `Before Topic 4 \| shift=0.1912 \| question, knowledge, event, performance, questions, task, graph, can`

	This indicates a stronger LLM/reasoning focus in the later period.

	## Results (from current execution)

	- Network loaded successfully; LCC size is `1662` nodes and `26134` edges.
	- Weak/strong tie section reports:
	- total ties: `26134`
	- median-weight threshold: `0.6276`
	- weak ties: `13067`
	- strong ties: `13067`
	- Centrality ranking correlations:
	- Degree-Closeness: `0.9361`
	- Degree-Betweenness: `0.8114`
	- Closeness-Betweenness: `0.7684`
	- Lowest-correlation pair: Closeness vs Betweenness.
	- Top-10 central papers were produced for all three metrics in `ID<TAB>Title` format.
	- Repeated papers across multiple centrality top-10 lists are explicitly reported.
	- Topic-evolution matrices were produced:
	- `D` (before 2023): `(5, 5000)`
	- `S` (2023+): `(5, 5000)`
	- Highest-shift emerging topic: After Topic 2 (`shift=0.1989`) with keywords around `llms`, `reasoning`, and `generation`.
	- Highest-shift disappearing topic: Before Topic 4 (`shift=0.1912`) with keywords around `question`, `knowledge`, and `graph`.
	- Topic matrices: `D` (before) = `(5, 5000)`, `S` (2023+) = `(5, 5000)`
	## Findings
	Conclusion: post-2023 topics shift toward LLM- and reasoning-centered themes, while earlier topics are more question/knowledge/graph-oriented.
	- The centrality rankings are strongly related overall, but not identical.
	- Degree and closeness are most aligned (`0.9361`), indicating that papers with strong local connectivity are often globally well-positioned.
	- Closeness and betweenness are least aligned (`0.7684`), showing that global proximity and bridge-role influence capture different node functions.
	- Repeated appearance of papers such as `ahuja-etal-2023-mega`, `ding-etal-2020-discriminatively`, and `qin-etal-2023-chatgpt` across multiple lists suggests robust influence across different centrality definitions.
	- Topic-shift outputs indicate post-2023 movement toward LLM-oriented and reasoning-heavy themes.
	- Overall, the network remains highly connected at baseline, and the analysis pipeline covers connectivity, influence, and temporal theme evolution in a consistent way.