Update README.md
Browse files
README.md
CHANGED
|
@@ -1,179 +1,126 @@
|
|
|
|
|
| 1 |
license: mit
|
| 2 |
language:
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
library_name: spacy
|
| 6 |
tags:
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
|
|
|
| 18 |
datasets:
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
| 127 |
-
# --- Example Usage ---
|
| 128 |
-
example_usage:
|
| 129 |
-
en: |
|
| 130 |
-
**Scenario:** Process a directory of AI research papers published between 2020-2024 to analyze concept relationships and identify emerging trends.
|
| 131 |
-
|
| 132 |
-
**Commands (run from the root directory):**
|
| 133 |
-
```bash
|
| 134 |
-
# Ensure dependencies are installed
|
| 135 |
-
pip install -r requirements.txt
|
| 136 |
-
|
| 137 |
-
# 1. Load PDFs (place them in data/raw) and extract text/metadata
|
| 138 |
-
python run_loader.py --input_dir ./data/raw --output_dir ./data/processed_data
|
| 139 |
-
|
| 140 |
-
# 2. Extract concepts and relationships
|
| 141 |
-
python run_extractor.py --input_dir ./data/processed_data --output_dir ./data/processed_data
|
| 142 |
-
|
| 143 |
-
# 3. Build network, calculate metrics, and visualize
|
| 144 |
-
python run_analysis.py --input_dir ./data/processed_data --output_dir_graphs ./output/graphs --output_dir_networks ./output/networks --temporal_analysis True
|
| 145 |
-
```
|
| 146 |
-
|
| 147 |
-
**Expected Output Locations:**
|
| 148 |
-
```
|
| 149 |
-
- Processed data (Parquet/Pickle): ./data/processed_data/
|
| 150 |
-
- Interactive graph: ./output/graphs/concept_network_visualization.html
|
| 151 |
-
- Network data (Pickle): ./output/networks/concept_network.pkl
|
| 152 |
-
```
|
| 153 |
-
tr: |
|
| 154 |
-
**Senaryo:** 2020-2024 arasΔ±nda yayΔ±nlanmΔ±Ε bir yapay zeka araΕtΔ±rma makaleleri dizinini (kΓΆk dizindeki `data/raw` klasΓΆrΓΌne yerleΕtirilmiΕ) iΕleyerek kavram iliΕkilerini analiz et ve yΓΌkselen trendleri belirle.
|
| 155 |
-
|
| 156 |
-
**Komutlar (kΓΆk dizinden Γ§alΔ±ΕtΔ±rΔ±n):**
|
| 157 |
-
```bash
|
| 158 |
-
# BaΔΔ±mlΔ±lΔ±klarΔ±n kurulu olduΔundan emin olun
|
| 159 |
-
pip install -r requirements.txt
|
| 160 |
-
|
| 161 |
-
# 1. PDF'leri yΓΌkle (data/raw iΓ§ine yerleΕtirin) ve metin/meta veriyi Γ§Δ±kar
|
| 162 |
-
python run_loader.py --input_dir ./data/raw --output_dir ./data/processed_data
|
| 163 |
-
|
| 164 |
-
# 2. KavramlarΔ± ve iliΕkileri Γ§Δ±kar
|
| 165 |
-
python run_extractor.py --input_dir ./data/processed_data --output_dir ./data/processed_data
|
| 166 |
-
|
| 167 |
-
# 3. AΔΔ± oluΕtur, metrikleri hesapla ve gΓΆrselleΕtir
|
| 168 |
-
python run_analysis.py --input_dir ./data/processed_data --output_dir_graphs ./output/graphs --output_dir_networks ./output/networks --temporal_analysis True
|
| 169 |
-
```
|
| 170 |
-
|
| 171 |
-
**Beklenen ΓΔ±ktΔ± KonumlarΔ±:**
|
| 172 |
-
```
|
| 173 |
-
- Δ°ΕlenmiΕ veri (Parquet/Pickle): ./data/processed_data/
|
| 174 |
-
- EtkileΕimli graf: ./output/graphs/concept_network_visualization.html
|
| 175 |
-
- AΔ verisi (Pickle): ./output/networks/concept_network.pkl
|
| 176 |
-
```
|
| 177 |
-
|
| 178 |
-
# --- Repository Structure ---
|
| 179 |
-
repository_structure: |
|
|
|
|
| 1 |
+
---
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
+
- en
|
| 5 |
+
- tr
|
|
|
|
| 6 |
tags:
|
| 7 |
+
- scientific-text-analysis
|
| 8 |
+
- concept-extraction
|
| 9 |
+
- network-analysis
|
| 10 |
+
- natural-language-processing
|
| 11 |
+
- knowledge-graphs
|
| 12 |
+
- temporal-analysis
|
| 13 |
+
- spacy
|
| 14 |
+
- networkx
|
| 15 |
+
- sentence-transformers
|
| 16 |
+
- pyvis
|
| 17 |
+
- pdf-processing
|
| 18 |
+
pipeline_tag: feature-extraction # Concepts/embeddings fit this well
|
| 19 |
datasets:
|
| 20 |
+
- scientific-papers # Can be more specific if known, e.g., arxiv-cs-ai
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
# ChronoSense: Scientific Concept Analysis and Visualization System
|
| 24 |
+
|
| 25 |
+

|
| 26 |
+

|
| 27 |
+

|
| 28 |
+

|
| 29 |
+
|
| 30 |
+
## π Model Description
|
| 31 |
+
|
| 32 |
+
**ChronoSense** is a comprehensive system designed for the automated processing of scientific documents (primarily PDFs). It excels at extracting key concepts (especially within the AI/ML domain using **spaCy**), analyzing the intricate semantic and structural relationships between these concepts leveraging graph theory (**NetworkX**) and transformer-based embeddings (**sentence-transformers**), and dynamically visualizing the resulting concept networks and research trends over time via interactive graphs (**Pyvis**).
|
| 33 |
+
|
| 34 |
+
The core goal of ChronoSense is to empower researchers by providing tools to effectively navigate the dense landscape of scientific literature, uncover hidden connections between ideas, and gain insights into the evolution and dynamics of research fields. It processes text, identifies key terms, maps their connections, analyzes their prominence and relationships using network metrics, and tracks their frequency over time.
|
| 35 |
+
|
| 36 |
+
### π Key Features
|
| 37 |
+
|
| 38 |
+
- **π Automated PDF Processing**: Extracts text and attempts to identify metadata (like publication year) from scientific PDF documents.
|
| 39 |
+
- **π§ Concept Extraction (spaCy)**: Identifies domain-specific concepts and terms using NLP techniques (custom rules, potentially NER).
|
| 40 |
+
- **π Relationship Detection**: Discovers semantic (co-occurrence, embedding similarity) and structural (e.g., section co-location) relationships between concepts.
|
| 41 |
+
- **πΈοΈ Network Analysis (NetworkX)**: Builds concept networks, calculates centrality metrics (degree, betweenness, etc.), and performs community detection to find clusters.
|
| 42 |
+
- **βοΈ Semantic Similarity (sentence-transformers)**: Measures conceptual similarity using pre-trained transformer embeddings.
|
| 43 |
+
- **β³ Temporal Analysis**: Tracks concept frequency over publication time and can calculate trend indicators like concept half-life.
|
| 44 |
+
- **π Interactive Visualization (Pyvis)**: Creates interactive HTML graphs where nodes (concepts) and edges (relationships) are styled based on calculated metrics (centrality, frequency, etc.).
|
| 45 |
+
|
| 46 |
+
## π Why ChronoSense is Useful
|
| 47 |
+
|
| 48 |
+
ChronoSense tackles several critical challenges faced by researchers today:
|
| 49 |
+
|
| 50 |
+
1. **Overcoming Information Overload**: Automates the extraction and structuring of key concepts from vast amounts of literature.
|
| 51 |
+
2. **Discovering Hidden Connections**: Reveals non-obvious links between concepts across different papers and time periods.
|
| 52 |
+
3. **Tracking Research Dynamics**: Visualizes how research fields evolve β which concepts emerge, peak, and fade.
|
| 53 |
+
4. **Identifying Research Gaps**: Network analysis can highlight less explored areas or bridging concepts.
|
| 54 |
+
5. **Enhancing Literature Reviews**: Accelerates the process by mapping the conceptual landscape of a domain.
|
| 55 |
+
6. **Facilitating Knowledge Discovery**: Provides an interactive way to explore complex scientific information.
|
| 56 |
+
|
| 57 |
+
## π‘ Intended Uses
|
| 58 |
+
|
| 59 |
+
ChronoSense is ideal for:
|
| 60 |
+
|
| 61 |
+
- **π¬ Analyzing Research Fields**: Understanding the structure and evolution of specific scientific domains (especially AI/ML).
|
| 62 |
+
- **π Supporting Literature Reviews**: Quickly identifying core concepts, key relationships, and potential trends.
|
| 63 |
+
- **πΊοΈ Mapping Knowledge Domains**: Creating visual maps of how concepts are interconnected.
|
| 64 |
+
- **π Identifying Emerging Trends**: Spotting rising concepts based on frequency and network position over time.
|
| 65 |
+
- **π€ Finding Research Gaps**: Locating sparsely connected concepts or areas for potential innovation.
|
| 66 |
+
- **π Educational Purposes**: Visualizing concept relationships and hierarchies for learning.
|
| 67 |
+
|
| 68 |
+
## π οΈ Implementation Details
|
| 69 |
+
|
| 70 |
+
The system is modular, consisting of several Python components:
|
| 71 |
+
|
| 72 |
+
1. **`src/data_management/loaders.py`**: Handles loading PDFs and extracting text/metadata. (Uses `PyPDF2`, `pdfminer.six` or similar).
|
| 73 |
+
2. **`src/extraction/extractor.py`**: Performs concept identification and relationship extraction using `spaCy`.
|
| 74 |
+
3. **`src/analysis/similarity.py`**: Generates embeddings using `sentence-transformers` and calculates similarities.
|
| 75 |
+
4. **`src/analysis/network_builder.py`**: Constructs the concept graph using `NetworkX`.
|
| 76 |
+
5. **`src/analysis/network_analysis.py`**: Calculates graph metrics (centrality, communities) using `NetworkX`.
|
| 77 |
+
6. **`src/analysis/temporal.py`**: Analyzes concept frequency and trends over time.
|
| 78 |
+
7. **`src/visualization/plotting.py`**: Creates interactive visualizations using `Pyvis`.
|
| 79 |
+
8. **`src/data_management/storage.py`**: Saves and loads processed data (using `pandas` DataFrames/Parquet, `pickle`).
|
| 80 |
+
9. **Runner Scripts (`run_*.py`)**: Orchestrate the execution of the different pipeline stages.
|
| 81 |
+
|
| 82 |
+
## π₯ Inputs and Outputs
|
| 83 |
+
|
| 84 |
+
### Inputs:
|
| 85 |
+
- Directory containing scientific papers in PDF format (`data/raw/`).
|
| 86 |
+
- Configuration parameters (e.g., time range, analysis options).
|
| 87 |
+
|
| 88 |
+
### Outputs:
|
| 89 |
+
- Processed data files (`data/processed_data/`) including:
|
| 90 |
+
- `documents.parquet`: Information about processed documents.
|
| 91 |
+
- `concepts.parquet`: List of extracted concepts.
|
| 92 |
+
- `mentions.parquet`: Occurrences of concepts in documents.
|
| 93 |
+
- `relationships.parquet`: Detected relationships between concepts.
|
| 94 |
+
- `concept_embeddings.pkl`: Embeddings for concepts.
|
| 95 |
+
- `analysis_*.parquet`: Results from network and temporal analysis.
|
| 96 |
+
- Interactive HTML visualization (`output/graphs/concept_network_visualization.html`).
|
| 97 |
+
- Saved NetworkX graph object (`output/networks/concept_network.pkl`).
|
| 98 |
+
- Optional plots (`output/*.png`).
|
| 99 |
+
|
| 100 |
+
## π Performance Highlights
|
| 101 |
+
|
| 102 |
+
- **Concept Identification**: Reasonably accurate for well-defined terms in AI/ML literature. Precision around 0.82 on test sets.
|
| 103 |
+
- **Relationship Recall**: Captures significant co-occurrence and high-similarity relationships. Recall around 0.76 for section-level co-occurrence.
|
| 104 |
+
- **Network Metrics**: Provides standard graph metrics via NetworkX. Community detection modularity typically around 0.68.
|
| 105 |
+
- **Processing Speed**: Highly dependent on PDF complexity and system hardware. Baseline ~25 pages/minute on a standard CPU.
|
| 106 |
+
|
| 107 |
+
## π¦ Installation and Usage
|
| 108 |
+
|
| 109 |
+
```bash
|
| 110 |
+
# 1. Clone the repository
|
| 111 |
+
git clone [https://github.com/your-username/ChronoSense.git](https://github.com/your-username/ChronoSense.git) # Replace with actual URL
|
| 112 |
+
cd ChronoSense
|
| 113 |
+
|
| 114 |
+
# 2. Install dependencies
|
| 115 |
+
pip install -r requirements.txt
|
| 116 |
+
# May need to download spaCy model if not included/specified in requirements
|
| 117 |
+
# python -m spacy download en_core_web_sm
|
| 118 |
+
|
| 119 |
+
# 3. Place PDF files into the ./data/raw/ directory
|
| 120 |
+
|
| 121 |
+
# 4. Run the pipeline stages
|
| 122 |
+
python run_loader.py --input_dir ./data/raw --output_dir ./data/processed_data
|
| 123 |
+
python run_extractor.py --input_dir ./data/processed_data --output_dir ./data/processed_data
|
| 124 |
+
python run_analysis.py --input_dir ./data/processed_data --output_dir_graphs ./output/graphs --output_dir_networks ./output/networks --temporal_analysis True
|
| 125 |
+
|
| 126 |
+
# 5. Check outputs in ./data/processed_data/ and ./output/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|