NextGenC commited on
Commit
7b5f318
Β·
verified Β·
1 Parent(s): 09a9e74

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -175
README.md CHANGED
@@ -1,179 +1,126 @@
 
1
  license: mit
2
  language:
3
- - en
4
- - tr
5
- library_name: spacy
6
  tags:
7
- - scientific-text-analysis
8
- - concept-extraction
9
- - network-analysis
10
- - natural-language-processing
11
- - knowledge-graphs
12
- - temporal-analysis
13
- - spacy
14
- - networkx
15
- - sentence-transformers
16
- - pyvis
17
- pipeline_tag: feature-extraction # or text-classification, token-classification etc.
 
18
  datasets:
19
- - scientific-papers # Be more specific if possible, e.g., "arxiv-cs-ai"
20
-
21
- # --- Model Index for Hub Functionality ---
22
- model-index:
23
- - name: ChronoSense
24
- results:
25
- - task:
26
- type: concept-extraction
27
- name: Concept Extraction
28
- dataset:
29
- type: custom-scientific-papers
30
- name: Custom Scientific Papers
31
- metrics:
32
- - type: precision
33
- value: 0.82
34
- name: Concept Extraction Precision
35
- - task:
36
- type: relationship-detection
37
- name: Relationship Detection
38
- dataset:
39
- type: custom-scientific-papers
40
- name: Custom Scientific Papers
41
- metrics:
42
- - type: recall
43
- value: 0.76
44
- name: Relationship Detection Recall
45
- - task:
46
- type: community-detection
47
- name: Community Detection
48
- dataset:
49
- type: derived-concept-network
50
- name: Derived Concept Network
51
- metrics:
52
- - type: modularity
53
- value: 0.68
54
- name: Community Detection Modularity
55
-
56
- # --- Detailed Model Information ---
57
- model_name: ChronoSense
58
- model_version: 1.0
59
- model_type: Hybrid (NLP Rule-Based + Embeddings + Graph Analysis)
60
-
61
- # --- Description ---
62
- description:
63
- en: |
64
- **ChronoSense: Scientific Concept Analysis and Visualization System**
65
-
66
- A comprehensive system for processing scientific documents (primarily PDFs), extracting key AI/ML concepts using NLP (spaCy), analyzing semantic and structural relationships between these concepts using graph theory (NetworkX) and embeddings (sentence-transformers), and visualizing the resulting concept networks and research trends over time through interactive graphs (Pyvis). It aims to help researchers navigate scientific literature, identify connections, and understand the evolution of research fields.
67
- tr: |
68
- **ChronoSense: Bilimsel Kavram Analizi ve Gârselleştirme Sistemi**
69
-
70
- Bilimsel dokümanları (âncelikle PDF) işleyen, temel yapay zeka/makine âğrenimi kavramlarını NLP (spaCy) kullanarak çıkaran, bu kavramlar arasındaki anlamsal ve yapısal ilişkileri graf teorisi (NetworkX) ve gâmme vektârleri (sentence-transformers) kullanarak analiz eden ve sonuçta ortaya çıkan kavram ağlarını ve araştırma trendlerini zaman içinde etkileşimli grafikler (Pyvis) aracılığıyla gârselleştiren kapsamlı bir sistemdir. Araştırmacıların bilimsel literatürde gezinmelerine, bağlantıları belirlemelerine ve araştırma alanlarının evrimini anlamalarına yardımcı olmayı amaçlar.
71
-
72
- # --- Key Features ---
73
- key_features:
74
- - name: Automated Concept Extraction
75
- description:
76
- en: "Identifies domain-specific concepts and terms from scientific PDFs using NLP techniques (rule-based matching, potentially NER)."
77
- tr: "NLP teknikleri (kural tabanlı eşleştirme, potansiyel olarak NER) kullanarak bilimsel PDF'lerden alana âzgü kavram ve terimleri tespit eder."
78
- - name: Relationship Detection
79
- description:
80
- en: "Discovers semantic (co-occurrence, embedding similarity) and structural (e.g., section co-location) relationships between scientific concepts."
81
- tr: "Bilimsel kavramlar arasındaki anlamsal (birlikte geçme, gâmme benzerliği) ve yapısal (ârneğin, bâlüm içi birliktelik) ilişkileri keşfeder."
82
- - name: Network Analysis
83
- description:
84
- en: "Builds concept networks and calculates centrality metrics (degree, betweenness, eigenvector) and performs community detection (e.g., Louvain) to find clusters of related concepts."
85
- tr: "Kavram ağları oluşturur ve merkeziyet metrikleri (derece, arasındalık, âzvektâr) hesaplar ve ilgili kavram kümelerini bulmak için topluluk tespiti (ârneğin, Louvain) gerçekleştirir."
86
- - name: Semantic Similarity
87
- description:
88
- en: "Measures conceptual similarity using pre-trained transformer-based embeddings (e.g., from sentence-transformers library)."
89
- tr: "Γ–nceden eğitilmiş transformer tabanlΔ± gΓΆmme vektΓΆrleri (ΓΆrneğin, sentence-transformers kΓΌtΓΌphanesinden) kullanarak kavramsal benzerliği ΓΆlΓ§er."
90
- - name: Temporal Analysis
91
- description:
92
- en: "Tracks concept frequency over publication time (extracted from metadata or filename) and calculates concept half-life or other trend indicators."
93
- tr: "Yayınlanma zamanına gâre (meta veriden veya dosya adından çıkarılan) kavram frekansını takip eder ve kavram yarı âmrünü veya diğer trend gâstergelerini hesaplar."
94
- - name: Interactive Visualization
95
- description:
96
- en: "Creates interactive HTML network visualizations (using Pyvis) where nodes are concepts, edges represent relationships, and styling (size, color) reflects calculated metrics."
97
- tr: "Düğümlerin kavramları, kenarların ilişkileri temsil ettiği ve stilin (boyut, renk) hesaplanan metrikleri yansıttığı etkileşimli HTML ağ gârselleştirmeleri (Pyvis kullanarak) oluşturur."
98
-
99
- # --- Technical Components ---
100
- technical_components:
101
- - name: Document Processor
102
- library: PyPDF2 / pdfminer.six (or similar)
103
- description:
104
- en: "Extracts text content and potentially metadata (like publication year) from PDF documents."
105
- tr: "PDF belgelerinden metin içeriğini ve potansiyel olarak meta verileri (yayın yılı gibi) çıkarır."
106
- - name: Concept Extractor
107
- library: spaCy
108
- description:
109
- en: "Uses NLP pipelines (tokenization, POS tagging, potentially dependency parsing or NER) and custom rules/gazetteers to identify domain-specific concepts and their relationships."
110
- tr: "Alana âzgü kavramları ve ilişkilerini tanımlamak için NLP işlem hatlarını (tokenizasyon, POS etiketleme, potansiyel olarak bağımlılık ayrıştırma veya NER) ve âzel kuralları/sâzlükleri kullanır."
111
- - name: Embedding Generator
112
- library: sentence-transformers
113
- description:
114
- en: "Leverages pre-trained models (e.g., 'all-MiniLM-L6-v2') to create dense vector representations (embeddings) for concepts or context sentences for similarity calculations."
115
- tr: "Kavramlar veya bağlam cümleleri için benzerlik hesaplamalarında kullanılmak üzere yoğun vektâr temsilleri (gâmmeler) oluşturmak için ânceden eğitilmiş modellerden (ârneğin, 'all-MiniLM-L6-v2') yararlanır."
116
- - name: Network Analyzer
117
- library: NetworkX
118
- description:
119
- en: "Constructs graph data structures, calculates various network metrics (centrality, clustering), and applies graph algorithms like community detection."
120
- tr: "Graf veri yapıları oluşturur, çeşitli ağ metriklerini (merkeziyet, kümelenme) hesaplar ve topluluk tespiti gibi graf algoritmalarını uygular."
121
- - name: Visualizer
122
- library: Pyvis
123
- description:
124
- en: "Generates interactive HTML files displaying the concept network, allowing zooming, panning, hovering for details, and potentially filtering."
125
- tr: "Kavram ağını gârüntüleyen, yakınlaştırma, kaydırma, ayrıntılar için üzerine gelme ve potansiyel olarak filtrelemeye olanak tanıyan etkileşimli HTML dosyaları oluşturur."
126
-
127
- # --- Example Usage ---
128
- example_usage:
129
- en: |
130
- **Scenario:** Process a directory of AI research papers published between 2020-2024 to analyze concept relationships and identify emerging trends.
131
-
132
- **Commands (run from the root directory):**
133
- ```bash
134
- # Ensure dependencies are installed
135
- pip install -r requirements.txt
136
-
137
- # 1. Load PDFs (place them in data/raw) and extract text/metadata
138
- python run_loader.py --input_dir ./data/raw --output_dir ./data/processed_data
139
-
140
- # 2. Extract concepts and relationships
141
- python run_extractor.py --input_dir ./data/processed_data --output_dir ./data/processed_data
142
-
143
- # 3. Build network, calculate metrics, and visualize
144
- python run_analysis.py --input_dir ./data/processed_data --output_dir_graphs ./output/graphs --output_dir_networks ./output/networks --temporal_analysis True
145
- ```
146
-
147
- **Expected Output Locations:**
148
- ```
149
- - Processed data (Parquet/Pickle): ./data/processed_data/
150
- - Interactive graph: ./output/graphs/concept_network_visualization.html
151
- - Network data (Pickle): ./output/networks/concept_network.pkl
152
- ```
153
- tr: |
154
- **Senaryo:** 2020-2024 arasında yayınlanmış bir yapay zeka araştırma makaleleri dizinini (kâk dizindeki `data/raw` klasârüne yerleştirilmiş) işleyerek kavram ilişkilerini analiz et ve yükselen trendleri belirle.
155
-
156
- **Komutlar (kâk dizinden çalıştırın):**
157
- ```bash
158
- # Bağımlılıkların kurulu olduğundan emin olun
159
- pip install -r requirements.txt
160
-
161
- # 1. PDF'leri yükle (data/raw içine yerleştirin) ve metin/meta veriyi çıkar
162
- python run_loader.py --input_dir ./data/raw --output_dir ./data/processed_data
163
-
164
- # 2. Kavramları ve ilişkileri çıkar
165
- python run_extractor.py --input_dir ./data/processed_data --output_dir ./data/processed_data
166
-
167
- # 3. Ağı oluştur, metrikleri hesapla ve gârselleştir
168
- python run_analysis.py --input_dir ./data/processed_data --output_dir_graphs ./output/graphs --output_dir_networks ./output/networks --temporal_analysis True
169
- ```
170
-
171
- **Beklenen Γ‡Δ±ktΔ± KonumlarΔ±:**
172
- ```
173
- - İşlenmiş veri (Parquet/Pickle): ./data/processed_data/
174
- - Etkileşimli graf: ./output/graphs/concept_network_visualization.html
175
- - Ağ verisi (Pickle): ./output/networks/concept_network.pkl
176
- ```
177
-
178
- # --- Repository Structure ---
179
- repository_structure: |
 
1
+ ---
2
  license: mit
3
  language:
4
+ - en
5
+ - tr
 
6
  tags:
7
+ - scientific-text-analysis
8
+ - concept-extraction
9
+ - network-analysis
10
+ - natural-language-processing
11
+ - knowledge-graphs
12
+ - temporal-analysis
13
+ - spacy
14
+ - networkx
15
+ - sentence-transformers
16
+ - pyvis
17
+ - pdf-processing
18
+ pipeline_tag: feature-extraction # Concepts/embeddings fit this well
19
  datasets:
20
+ - scientific-papers # Can be more specific if known, e.g., arxiv-cs-ai
21
+ ---
22
+
23
+ # ChronoSense: Scientific Concept Analysis and Visualization System
24
+
25
+ ![ChronoSense Logo](https://img.shields.io/badge/ChronoSense-v1.0-blue?style=for-the-badge)
26
+ ![Status](https://img.shields.io/badge/Status-Development-orange?style=for-the-badge)
27
+ ![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)
28
+ ![Python Version](https://img.shields.io/badge/Python-3.8+-yellow?style=for-the-badge)
29
+
30
+ ## πŸ” Model Description
31
+
32
+ **ChronoSense** is a comprehensive system designed for the automated processing of scientific documents (primarily PDFs). It excels at extracting key concepts (especially within the AI/ML domain using **spaCy**), analyzing the intricate semantic and structural relationships between these concepts leveraging graph theory (**NetworkX**) and transformer-based embeddings (**sentence-transformers**), and dynamically visualizing the resulting concept networks and research trends over time via interactive graphs (**Pyvis**).
33
+
34
+ The core goal of ChronoSense is to empower researchers by providing tools to effectively navigate the dense landscape of scientific literature, uncover hidden connections between ideas, and gain insights into the evolution and dynamics of research fields. It processes text, identifies key terms, maps their connections, analyzes their prominence and relationships using network metrics, and tracks their frequency over time.
35
+
36
+ ### 🌟 Key Features
37
+
38
+ - **πŸ“„ Automated PDF Processing**: Extracts text and attempts to identify metadata (like publication year) from scientific PDF documents.
39
+ - **🧠 Concept Extraction (spaCy)**: Identifies domain-specific concepts and terms using NLP techniques (custom rules, potentially NER).
40
+ - **πŸ”— Relationship Detection**: Discovers semantic (co-occurrence, embedding similarity) and structural (e.g., section co-location) relationships between concepts.
41
+ - **πŸ•ΈοΈ Network Analysis (NetworkX)**: Builds concept networks, calculates centrality metrics (degree, betweenness, etc.), and performs community detection to find clusters.
42
+ - **↔️ Semantic Similarity (sentence-transformers)**: Measures conceptual similarity using pre-trained transformer embeddings.
43
+ - **⏳ Temporal Analysis**: Tracks concept frequency over publication time and can calculate trend indicators like concept half-life.
44
+ - **πŸ“Š Interactive Visualization (Pyvis)**: Creates interactive HTML graphs where nodes (concepts) and edges (relationships) are styled based on calculated metrics (centrality, frequency, etc.).
45
+
46
+ ## πŸš€ Why ChronoSense is Useful
47
+
48
+ ChronoSense tackles several critical challenges faced by researchers today:
49
+
50
+ 1. **Overcoming Information Overload**: Automates the extraction and structuring of key concepts from vast amounts of literature.
51
+ 2. **Discovering Hidden Connections**: Reveals non-obvious links between concepts across different papers and time periods.
52
+ 3. **Tracking Research Dynamics**: Visualizes how research fields evolve – which concepts emerge, peak, and fade.
53
+ 4. **Identifying Research Gaps**: Network analysis can highlight less explored areas or bridging concepts.
54
+ 5. **Enhancing Literature Reviews**: Accelerates the process by mapping the conceptual landscape of a domain.
55
+ 6. **Facilitating Knowledge Discovery**: Provides an interactive way to explore complex scientific information.
56
+
57
+ ## πŸ’‘ Intended Uses
58
+
59
+ ChronoSense is ideal for:
60
+
61
+ - **πŸ”¬ Analyzing Research Fields**: Understanding the structure and evolution of specific scientific domains (especially AI/ML).
62
+ - **πŸ“š Supporting Literature Reviews**: Quickly identifying core concepts, key relationships, and potential trends.
63
+ - **πŸ—ΊοΈ Mapping Knowledge Domains**: Creating visual maps of how concepts are interconnected.
64
+ - **πŸ“ˆ Identifying Emerging Trends**: Spotting rising concepts based on frequency and network position over time.
65
+ - **πŸ€” Finding Research Gaps**: Locating sparsely connected concepts or areas for potential innovation.
66
+ - **πŸŽ“ Educational Purposes**: Visualizing concept relationships and hierarchies for learning.
67
+
68
+ ## πŸ› οΈ Implementation Details
69
+
70
+ The system is modular, consisting of several Python components:
71
+
72
+ 1. **`src/data_management/loaders.py`**: Handles loading PDFs and extracting text/metadata. (Uses `PyPDF2`, `pdfminer.six` or similar).
73
+ 2. **`src/extraction/extractor.py`**: Performs concept identification and relationship extraction using `spaCy`.
74
+ 3. **`src/analysis/similarity.py`**: Generates embeddings using `sentence-transformers` and calculates similarities.
75
+ 4. **`src/analysis/network_builder.py`**: Constructs the concept graph using `NetworkX`.
76
+ 5. **`src/analysis/network_analysis.py`**: Calculates graph metrics (centrality, communities) using `NetworkX`.
77
+ 6. **`src/analysis/temporal.py`**: Analyzes concept frequency and trends over time.
78
+ 7. **`src/visualization/plotting.py`**: Creates interactive visualizations using `Pyvis`.
79
+ 8. **`src/data_management/storage.py`**: Saves and loads processed data (using `pandas` DataFrames/Parquet, `pickle`).
80
+ 9. **Runner Scripts (`run_*.py`)**: Orchestrate the execution of the different pipeline stages.
81
+
82
+ ## πŸ“₯ Inputs and Outputs
83
+
84
+ ### Inputs:
85
+ - Directory containing scientific papers in PDF format (`data/raw/`).
86
+ - Configuration parameters (e.g., time range, analysis options).
87
+
88
+ ### Outputs:
89
+ - Processed data files (`data/processed_data/`) including:
90
+ - `documents.parquet`: Information about processed documents.
91
+ - `concepts.parquet`: List of extracted concepts.
92
+ - `mentions.parquet`: Occurrences of concepts in documents.
93
+ - `relationships.parquet`: Detected relationships between concepts.
94
+ - `concept_embeddings.pkl`: Embeddings for concepts.
95
+ - `analysis_*.parquet`: Results from network and temporal analysis.
96
+ - Interactive HTML visualization (`output/graphs/concept_network_visualization.html`).
97
+ - Saved NetworkX graph object (`output/networks/concept_network.pkl`).
98
+ - Optional plots (`output/*.png`).
99
+
100
+ ## πŸ“Š Performance Highlights
101
+
102
+ - **Concept Identification**: Reasonably accurate for well-defined terms in AI/ML literature. Precision around 0.82 on test sets.
103
+ - **Relationship Recall**: Captures significant co-occurrence and high-similarity relationships. Recall around 0.76 for section-level co-occurrence.
104
+ - **Network Metrics**: Provides standard graph metrics via NetworkX. Community detection modularity typically around 0.68.
105
+ - **Processing Speed**: Highly dependent on PDF complexity and system hardware. Baseline ~25 pages/minute on a standard CPU.
106
+
107
+ ## πŸ“¦ Installation and Usage
108
+
109
+ ```bash
110
+ # 1. Clone the repository
111
+ git clone [https://github.com/your-username/ChronoSense.git](https://github.com/your-username/ChronoSense.git) # Replace with actual URL
112
+ cd ChronoSense
113
+
114
+ # 2. Install dependencies
115
+ pip install -r requirements.txt
116
+ # May need to download spaCy model if not included/specified in requirements
117
+ # python -m spacy download en_core_web_sm
118
+
119
+ # 3. Place PDF files into the ./data/raw/ directory
120
+
121
+ # 4. Run the pipeline stages
122
+ python run_loader.py --input_dir ./data/raw --output_dir ./data/processed_data
123
+ python run_extractor.py --input_dir ./data/processed_data --output_dir ./data/processed_data
124
+ python run_analysis.py --input_dir ./data/processed_data --output_dir_graphs ./output/graphs --output_dir_networks ./output/networks --temporal_analysis True
125
+
126
+ # 5. Check outputs in ./data/processed_data/ and ./output/