3GPP-Innovation-Extractor / README_PROJECT.md
heymenn's picture
Rename README.md to README_PROJECT.md
3c7c478 verified
<div align="center">
<img width="1200" height="475" alt="GHBanner" src="https://github.com/user-attachments/assets/0aa67016-6eaf-458a-adb2-6e31a0763ed6" />
</div>
## Run Locally
**Prerequisites:** Node.js
1. Install dependencies:
`npm install`
2. Set the `GEMINI_API_KEY` in [.env.local](.env.local) to your Gemini API key
3. Run the app:
`npm run dev`
Workflow Overview
```mermaid
flowchart TD
subgraph S1 [Phase 1: Data Ingestion]
A[User Selects Working Group] -->|SA1-6, RAN1-2| B[Fetch Meetings via POST]
B --> C[User Selects Meeting]
C --> D[Filter Docs by Metadata]
D --> E[Extract Raw Text]
end
subgraph S2 [Phase 2: Refinement & Caching]
E --> F{Text in Cache?}
F -- Yes --> G[Retrieve Cached Refinement]
F -- No --> H[LLM Processing]
H --> I[Task: Dense Chunking & 'What's New']
I --> J[Store in Dataset]
J --> G
end
subgraph S3 [Phase 3: Pattern Analysis]
G --> K[User Selects Pattern/Prompt]
K --> L{Result in Cache?}
L -- Yes --> M[Retrieve Analysis]
L -- No --> N[Execute Pattern]
N --> O[Multi-Model Verification]
O --> P[Store Result]
end
S1 --> S2 --> S3
```
### Detailed Process Specification
#### Phase 1: Data Ingestion & Extraction
The user navigates a strict hierarchy to isolate relevant source text.
1. **Working Group Selection:** User selects one group from the allowlist: `['SA1', 'SA2', 'SA3', 'SA4', 'SA5', 'SA6', 'RAN1', 'RAN2']`.
2. **Meeting Retrieval:** System executes a `POST` request to the endpoint using the selected Working Group to retrieve the meeting list.
3. **Document Filtering:** User selects a meeting, then filters the resulting file list using available metadata.
4. **Text Extraction:** System extracts raw content from the filtered files into a text list.
#### Phase 2: Content Refinement (with Caching)
Raw text is processed into high-value summaries to reduce noise.
* **Cache Check:** Before processing, check the dataset for existing `(text_hash, refined_output)` pairs to prevent duplicate processing.
* **LLM Processing:** If not cached, pass text to the selected LLM (default provided, user-changeable).
* **Prompt Objective:**
1. Create information-dense chunks (minimize near-duplicates).
2. Generate a "What's New" paragraph wrapped in `SUGGESTION START` and `SUGGESTION END` tags.
* **Storage:** Save the input text and the LLM output to the dataset.
#### Phase 3: Pattern Analysis & Verification
Refined text is analyzed using specific user-defined patterns.
* **Pattern Selection:** User applies a specific prompt/pattern to the refined documents.
* **Cache Check:** Check the results database for existing `(document_id, pattern_id)` results.
* **Execution & Verification:**
* Run the selected pattern against the documents.
* **Verifier Mode:** Optionally execute the same input across multiple models simultaneously to compare results and ensure accuracy.
* **Storage:** Save the final analysis in the database to prevent future re-computation.