Spaces:

MedSwin
/

MedAI_Processing

Sleeping

App Files Files Community

LiamKhoaLe commited on Sep 14, 2025

Commit

80cb919

0 Parent(s):

Upd 14/9

Browse files

Files changed (21) hide show

.DS_Store +0 -0
.gitattributes +35 -0
.gitignore +4 -0
DATA_PROCESSING.md +250 -0
Dockerfile +31 -0
LICENSE.txt +201 -0
README.md +32 -0
REQUEST.md +156 -0
app.py +423 -0
mount_drive.py +9 -0
requirements.txt +13 -0
utils/ __init__.py +22 -0
utils/.DS_Store +0 -0
utils/augment.py +105 -0
utils/datasets.py +66 -0
utils/drive_saver.py +88 -0
utils/llm.py +186 -0
utils/processor.py +411 -0
utils/rag.py +345 -0
utils/schema.py +68 -0
utils/token.py +107 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,4 @@

+.env
+client1.json
+client2.json
+medai.json

DATA_PROCESSING.md ADDED Viewed

	@@ -0,0 +1,250 @@

+# 📊 MedAI Data Processing Techniques
+This document comprehensively outlines all the data processing techniques implemented in the MedAI Processing project for augmenting and centrally processing medical datasets for LLM fine-tuning.
+## 🎯 Project Overview
+The MedAI Processing system transforms raw medical datasets into a **centralized fine-tuning format** (JSONL + CSV) with comprehensive data augmentation capabilities. The system processes multiple medical dataset types and applies various enhancement techniques to improve data quality and diversity.
+## 🏗️ System Architecture
+### Core Components
+- **FastAPI Web Service**: RESTful API for dataset processing
+- **Multi-LLM Rotator**: NVIDIA API + Google Gemini integration
+- **Centralized Writer**: Parallel JSONL + CSV output generation
+- **Google Drive Integration**: Automated artifact storage
+- **Progress Monitoring**: Real-time job status tracking
+### Supported Datasets
+1. **HealthCareMagic** (100k medical dialogues)
+2. **iCliniq** (10k medical consultations)
+3. **PubMedQA-Labelled** (biomedical Q&A with answers)
+4. **PubMedQA-Unlabelled** (biomedical Q&A without answers)
+5. **PubMedQA-Map** (biomedical Q&A mapping format)
+## 🔧 Data Processing Pipeline
+### 1. Data Ingestion & Download
+- **Hugging Face Hub Integration**: Automatic dataset downloading
+- **Format Detection**: JSON/JSONL auto-detection and parsing
+- **Caching System**: Local storage with symlink optimization
+### 2. Data Cleaning & Preprocessing
+#### Text Normalization
+- **Unicode Fixing**: `ftfy` library for text encoding issues
+- **Whitespace Standardization**: Consistent spacing and line breaks
+- **Quote Canonicalization**: Standard quote character conversion
+- **Terminal Punctuation**: Ensures proper sentence endings
+#### Content Sanitization
+- **Length Capping**: Configurable maximum character limits (default: 5000)
+- **Language Detection**: English language validation using `langid`
+- **Content Truncation**: Smart sentence boundary cutting for long texts
+### 3. Data Augmentation Techniques
+#### LLM-Based Paraphrasing
+- **Multi-Model Rotation**: NVIDIA API (primary) + Gemini (fallback)
+- **Difficulty Levels**: Easy vs. Hard paraphrasing modes
+- **Medical Context Preservation**: Maintains clinical terminology accuracy
+- **Configurable Ratios**: User-defined augmentation percentages (0.0-1.0)
+#### Back-Translation Augmentation
+- **Multi-Language Support**: German as intermediate language
+- **Meaning Preservation**: Maintains semantic accuracy through translation cycles
+- **Fallback Mechanisms**: Automatic retry with alternative models
+- **Quality Control**: Length and content validation
+#### Style Standardization
+- **Clinical Voice Enforcement**: Neutral, professional medical tone
+- **Absolute Language Removal**: Replaces guarantees with probabilistic language
+- **Forum Sign-off Removal**: Eliminates informal communication patterns
+- **Consistent Punctuation**: Standardized sentence structure
+### 4. Data Quality Assurance
+#### De-identification (PHI Removal)
+- **Email Redaction**: `[REDACTED_EMAIL]` placeholder
+- **Phone Number Masking**: `[REDACTED_PHONE]` placeholder
+- **URL/IP Address Removal**: `[REDACTED_URL]` and `[REDACTED_IP]` placeholders
+- **Configurable Privacy**: Optional PHI removal per dataset
+#### Deduplication
+- **Fingerprinting Algorithm**: MD5-based content hashing
+- **Multi-Field Matching**: Instruction + Input + Output combination
+- **Normalized Comparison**: Case-insensitive, whitespace-normalized matching
+- **Performance Optimized**: In-memory set-based deduplication
+#### Consistency Validation
+- **LLM-Based QA Check**: Automated answer validation against context
+- **Configurable Sampling**: Ratio-based consistency checking (e.g., 0.01)
+- **Medical Safety Validation**: Ensures clinical accuracy and safety
+- **Failure Tagging**: Marks samples with consistency issues
+### 5. Advanced Augmentation Features
+#### Knowledge Distillation
+- **Pseudo-Label Generation**: Creates labels for unlabeled data
+- **Fractional Processing**: Configurable percentage for distillation
+- **Single-Prompt Approach**: Efficient single LLM call per sample
+- **Length Control**: Maintains reasonable output lengths
+#### Multi-Variant Generation
+- **Configurable Counts**: 1-3 augmented variants per sample
+- **Tagged Augmentations**: Tracks applied augmentation techniques
+- **Original Preservation**: Always maintains base sample
+- **Randomized IDs**: Unique identifiers for augmented variants
+### 6. Output Generation & Storage
+#### Centralized Format
+- **SFT Schema**: Standardized Supervised Fine-Tuning format
+- **Metadata Preservation**: Source, task type, and augmentation tags
+- **Dual Output**: Simultaneous JSONL and CSV generation
+- **Memory-Safe Streaming**: Handles large datasets efficiently
+#### Storage Integration
+- **Local Caching**: `cache/outputs/` directory storage
+- **Google Drive Upload**: Automated cloud storage integration
+- **Timestamped Naming**: Unique file identification
+- **MIME Type Handling**: Proper content type specification
+## ⚙️ Configuration Options
+### Augmentation Parameters
+```python
+class AugmentOptions:
+    paraphrase_ratio: float = 0.0          # 0.0-1.0
+    paraphrase_outputs: bool = False       # Augment model answers
+    backtranslate_ratio: float = 0.0       # 0.0-1.0
+    style_standardize: bool = True         # Enforce clinical style
+    deidentify: bool = True                # Remove PHI
+    dedupe: bool = True                    # Remove duplicates
+    max_chars: int = 5000                  # Text length limit
+    consistency_check_ratio: float = 0.0   # 0.0-1.0
+    distill_fraction: float = 0.0          # 0.0-1.0 for unlabeled
+    expand: bool = True                    # Enable augmentation
+    max_aug_per_sample: int = 2            # 1-3 variants
+```
+### Processing Parameters
+```python
+class ProcessParams:
+    augment: AugmentOptions                # Augmentation settings
+    sample_limit: Optional[int] = None     # Dataset sampling
+    seed: int = 42                        # Reproducibility
+```
+## 📈 Performance & Monitoring
+### Progress Tracking
+- **Real-time Updates**: Live progress percentage and status messages
+- **Background Processing**: Non-blocking job execution
+- **State Management**: Thread-safe status tracking
+- **Error Handling**: Comprehensive exception logging
+### Resource Management
+- **API Key Rotation**: Automatic fallback between multiple API keys
+- **Rate Limiting**: Configurable request throttling
+- **Memory Optimization**: Streaming processing for large datasets
+- **Concurrent Processing**: Background task execution
+## 🔒 Security & Privacy
+### Data Protection
+- **PHI Removal**: Automatic sensitive information redaction
+- **Secure Storage**: Google Drive integration with OAuth2
+- **Access Control**: Environment-based API key management
+- **Audit Logging**: Comprehensive processing logs
+### API Security
+- **OAuth2 Integration**: Google Drive authentication
+- **Token Management**: Secure credential handling
+- **Request Validation**: Pydantic model validation
+- **Error Sanitization**: Safe error message handling
+## 🚀 Usage Examples
+### Basic Processing
+```bash
+# Process HealthCareMagic with default settings
+curl -X POST \
+  -H "Content-Type: application/json" \
+  -d '{"augment": {"paraphrase_ratio": 0.1}}' \
+  https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic
+```
+### Advanced Augmentation
+```bash
+# Process with comprehensive augmentation
+curl -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+    "augment": {
+      "paraphrase_ratio": 0.2,
+      "backtranslate_ratio": 0.1,
+      "paraphrase_outputs": true,
+      "style_standardize": true,
+      "deidentify": true,
+      "dedupe": true,
+      "max_chars": 5000,
+      "consistency_check_ratio": 0.01,
+      "max_aug_per_sample": 3
+    },
+    "sample_limit": 1000,
+    "seed": 42
+  }' \
+  https://binkhoale1812-medai-processing.hf.space/process/icliniq
+```
+## 📊 Output Statistics
+### Processing Metrics
+- **Written Rows**: Total processed samples
+- **Paraphrased Inputs**: Count of augmented user inputs
+- **Paraphrased Outputs**: Count of augmented model responses
+- **Back-translated**: Count of translation-augmented samples
+- **Deduplication**: Count of skipped duplicate samples
+- **Consistency Failures**: Count of validation failures
+### File Outputs
+- **JSONL Format**: Structured fine-tuning data with metadata
+- **CSV Format**: Simplified tabular representation
+- **Google Drive**: Cloud storage with automatic upload
+- **Local Cache**: Persistent local storage
+## 🔮 Future Enhancements
+### Planned Features
+- **Additional Dataset Support**: More medical dataset types
+- **Advanced Augmentation**: More sophisticated LLM techniques
+- **Quality Metrics**: Automated data quality scoring
+- **Batch Processing**: Multiple dataset concurrent processing
+- **Custom Schemas**: User-defined output formats
+### Scalability Improvements
+- **Distributed Processing**: Multi-node processing support
+- **Streaming Augmentation**: Real-time data enhancement
+- **Caching Optimization**: Improved performance and cost efficiency
+- **API Rate Limiting**: Better resource management
+## 📚 Technical Dependencies
+### Core Libraries
+- **FastAPI**: Web framework for API development
+- **Hugging Face Hub**: Dataset downloading and management
+- **Google GenAI**: Gemini model integration
+- **ftfy**: Text encoding and normalization
+- **langid**: Language detection
+- **orjson**: High-performance JSON processing
+### External Services
+- **NVIDIA API**: Primary LLM service for paraphrasing
+- **Google Gemini**: Fallback LLM service
+- **Google Drive**: Cloud storage integration
+- **Hugging Face Spaces**: Deployment platform
+---
+*This document provides a comprehensive overview of all data processing techniques implemented in the MedAI Processing project. For specific implementation details, refer to the individual module files in the `utils/` directory.*

Dockerfile ADDED Viewed

	@@ -0,0 +1,31 @@

+FROM python:3.11-slim
+# Install system dependencies as root (no sudo!)
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    ca-certificates curl && rm -rf /var/lib/apt/lists/*
+# Create non-root user
+RUN useradd -m -u 1000 user
+ENV HOME=/home/user
+WORKDIR $HOME/app
+# Install Python dependencies first (better layer caching)
+COPY --chown=user requirements.txt .
+RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
+# Copy the application
+COPY --chown=user . .
+# Hugging Face cache setup
+ENV HF_HOME="$HOME/.cache/huggingface"
+ENV SENTENCE_TRANSFORMERS_HOME="$HOME/.cache/huggingface/sentence-transformers"
+ENV MEDGEMMA_HOME="$HOME/.cache/huggingface/sentence-transformers"
+# Prepare runtime dirs
+RUN mkdir -p $HOME/app/logs $HOME/app/cache $HOME/app/cache/hf $HOME/app/cache/outputs && \
+    chown -R user:user $HOME/app
+USER user
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

LICENSE.txt ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright 2025 Dang Khoa Le
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

README.md ADDED Viewed

	@@ -0,0 +1,32 @@

+---
+title: MedAI Processing
+emoji: ⚕️
+colorFrom: indigo
+colorTo: blue
+sdk: docker
+pinned: false
+license: apache-2.0
+short_description: Process and centralise medical doc for llm finetuning
+---
+## Quick Access:
+[HF Space](https://huggingface.co/spaces/BinKhoaLe1812/MedAI_Processing)
+[MedDialog-100k](https://huggingface.co/datasets/BinKhoaLe1812/MedDialog-EN-100k)
+[MedDialog-100k](https://huggingface.co/datasets/BinKhoaLe1812/MedDialog-EN-10k)
+[PubMedQA-Labelled](https://huggingface.co/datasets/BinKhoaLe1812/PubMedQA-L)
+[PubMedQA-Unlabelled](https://huggingface.co/datasets/BinKhoaLe1812/PubMedQA-U)
+[PubMedQA-Mapper](https://huggingface.co/datasets/BinKhoaLe1812/PubMedQA-MAP)
+## CURL Request Instruction
+[Request Doc](https://huggingface.co/spaces/MedAI-COS30018/MedAI_Processing/blob/main/REQUEST.md)
+## License
+[Apache-2.0 LICENSE](https://huggingface.co/spaces/BinKhoaLe1812/MedAI_Processing/blob/main/LICENSE.txt)

REQUEST.md ADDED Viewed

	@@ -0,0 +1,156 @@

+# 📑 MedAI Processing – Request Examples
+Base URL of the Space:
+**`https://binkhoale1812-medai-processing.hf.space`**
+This Space processes medical datasets into a centralised fine-tuning format (JSONL + CSV) with optional augmentations such as **paraphrasing**, **back-translation**, **style standardisation**, **de-identification**, and **deduplication**.
+---
+## 🔹 1. Process HealthCareMagic
+```bash
+curl -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+        "augment": {
+          "paraphrase_ratio": 0.1,
+          "backtranslate_ratio": 0.05,
+          "paraphrase_outputs": false,
+          "style_standardize": true,
+          "deidentify": true,
+          "dedupe": true,
+          "max_chars": 5000
+        },
+        "sample_limit": 2000,
+        "seed": 42
+      }' \
+  https://binkhoale1812-medai-processing.hf.space/process/healthcaremagic
+````
+---
+## 🔹 2. Process iCliniq
+```bash
+curl -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+        "augment": {
+          "paraphrase_ratio": 0.2,
+          "backtranslate_ratio": 0.1,
+          "paraphrase_outputs": true,
+          "style_standardize": true,
+          "deidentify": true,
+          "dedupe": true,
+          "max_chars": 5000
+        },
+        "sample_limit": 1500,
+        "seed": 123
+      }' \
+  https://binkhoale1812-medai-processing.hf.space/process/icliniq
+```
+---
+## 🔹 3. Process PubMedQA (Labelled)
+```bash
+curl -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+        "augment": {
+          "paraphrase_ratio": 0.05,
+          "backtranslate_ratio": 0.02,
+          "paraphrase_outputs": false,
+          "style_standardize": true,
+          "deidentify": false,
+          "dedupe": true,
+          "max_chars": 8000
+        },
+        "sample_limit": 1000,
+        "seed": 99
+      }' \
+  https://binkhoale1812-medai-processing.hf.space/process/pubmedqa_l
+```
+---
+## 🔹 4. Process PubMedQA (Unlabelled)
+```bash
+curl -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+        "augment": {
+          "paraphrase_ratio": 0.05,
+          "backtranslate_ratio": 0.05,
+          "paraphrase_outputs": false,
+          "style_standardize": true,
+          "deidentify": true,
+          "dedupe": true,
+          "max_chars": 7000,
+          "consistency_check_ratio": 0.01,
+          "distill_fraction": 0.1
+        },
+        "sample_limit": 500,
+        "seed": 7
+      }' \
+  https://binkhoale1812-medai-processing.hf.space/process/pubmedqa_u
+```
+---
+## 🔹 5. Process PubMedQA (Map)
+```bash
+curl -X POST \
+  -H "Content-Type: application/json" \
+  -d '{
+        "augment": {
+          "paraphrase_ratio": 0.1,
+          "backtranslate_ratio": 0.05,
+          "paraphrase_outputs": true,
+          "style_standardize": true,
+          "deidentify": true,
+          "dedupe": true,
+          "max_chars": 6000
+        },
+        "sample_limit": 1200,
+        "seed": 2024
+      }' \
+  https://binkhoale1812-medai-processing.hf.space/process/pubmedqa_map
+```
+---
+## 🔹 6. Check Current Job Status
+```bash
+curl https://binkhoale1812-medai-processing.hf.space/status
+```
+---
+## 🔹 7. List Generated Artifacts
+```bash
+curl https://binkhoale1812-medai-processing.hf.space/files
+```
+---
+# ✅ Notes
+* Each run outputs both `.jsonl` and `.csv` in `cache/outputs/` and also uploads them to Google Drive folder ID:
+  `1JvW7its63E58fLxurH8ZdhxzdpcMrMbt`
+* `augment` options can be adjusted per dataset:
+  * `paraphrase_ratio` – % of rows paraphrased (0–1)
+  * `backtranslate_ratio` – % of rows back-translated
+  * `paraphrase_outputs` – whether to also augment model answers
+  * `style_standardize` – enforce neutral, clinical style
+  * `deidentify` – redact PHI (emails, phones, URLs, IPs)
+  * `dedupe` – skip duplicate pairs
+  * `consistency_check_ratio` – run lightweight QA sanity check
+  * `distill_fraction` – generate pseudo-labels for unlabelled data

app.py ADDED Viewed

	@@ -0,0 +1,423 @@

+# Root FastAPI
+import os
+import json
+import time, logging
+import threading
+import datetime as dt
+from typing import Optional, Dict
+from fastapi import FastAPI, HTTPException, BackgroundTasks, Request
+from fastapi.responses import HTMLResponse, JSONResponse
+from pydantic import BaseModel
+from dotenv import load_dotenv
+from utils.datasets import resolve_dataset, hf_download_dataset
+from utils.processor import process_file_into_sft
+from utils.rag import process_file_into_rag
+from utils.drive_saver import DriveSaver
+from utils.llm import Paraphraser
+from utils.schema import CentralisedWriter
+from utils.token import get_credentials, exchange_code, build_auth_url
+# ────────── Log ───────────
+logger = logging.getLogger("app")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    handler = logging.StreamHandler()
+    logger.addHandler(handler)
+# ────────── Boot ──────────
+load_dotenv(override=True)
+SPACE_NAME = os.getenv("SPACE_NAME", "MedAI Processor")
+OUTPUT_DIR = os.path.abspath(os.getenv("OUTPUT_DIR", "cache/outputs"))
+LOG_DIR = os.path.abspath(os.getenv("LOG_DIR", "logs"))
+os.makedirs(OUTPUT_DIR, exist_ok=True)
+os.makedirs(LOG_DIR, exist_ok=True)
+# --- Bootstrap Google OAuth ---
+try:
+    creds = get_credentials()
+    if creds:
+        logger.info("✅ OAuth credentials loaded and valid")
+except Exception as e:
+    logger.warning(f"⚠️ OAuth not initialized yet: {e}")
+# --- Bootstrap Google Drive ---
+drive = DriveSaver(default_folder_id=os.getenv("GDRIVE_FOLDER_ID"))
+# LLM rotator with paraphraser nodes
+paraphraser = Paraphraser(
+    nvidia_model=os.getenv("NVIDIA_MODEL", "meta/llama-3.1-8b-instruct"),
+    gemini_model_easy=os.getenv("GEMINI_MODEL_EASY", "gemini-2.5-flash-lite"),
+    gemini_model_hard=os.getenv("GEMINI_MODEL_HARD", "gemini-2.5-flash"),
+)
+app = FastAPI(title="Medical Dataset Augmenter", version="1.1.0")
+STATE_LOCK = threading.Lock()
+STATE: Dict[str, object] = {
+    "running": False,
+    "dataset": None,
+    "started_at": None,
+    "progress": 0.0,
+    "message": "idle",
+    "last_result": None
+}
+class AugmentOptions(BaseModel):
+    # ratios are 0..1
+    paraphrase_ratio: float = 0.0
+    paraphrase_outputs: bool = False
+    backtranslate_ratio: float = 0.0
+    style_standardize: bool = True
+    deidentify: bool = True
+    dedupe: bool = True
+    max_chars: int = 5000                 # cap extremely long contexts
+    consistency_check_ratio: float = 0.0  # small ratio e.g. 0.01
+    # KD / distillation (optional, keeps default off)
+    distill_fraction: float = 0.0         # for unlabeled only
+    expand: bool = True                   # Enable back-translation and complex augmentation
+    max_aug_per_sample: int = 2           # Between 1-3, number of LLM call to augment/paraphrase data
+class ProcessParams(BaseModel):
+    augment: AugmentOptions = AugmentOptions()
+    sample_limit: Optional[int] = None    # Set data sampling if needed
+    seed: int = 42
+    rag_processing: bool = False          # Enable RAG-specific processing
+def set_state(**kwargs):
+    with STATE_LOCK:
+        STATE.update(kwargs)
+def now_iso():
+    return dt.datetime.utcnow().isoformat()
+# Instructional UI
+@app.get("/", response_class=HTMLResponse)
+def root():
+    return f"""
+    <html>
+    <head>
+      <title>{SPACE_NAME} – Medical Dataset Augmenter</title>
+      <style>
+        body {{ font-family: Arial, sans-serif; max-width: 900px; margin: 2rem auto; line-height: 1.5; }}
+        h1, h2 {{ color: #2c3e50; }}
+        button {{
+          background: #2d89ef; color: white; border: none; padding: 8px 16px;
+          border-radius: 5px; cursor: pointer; margin: 5px 0;
+        }}
+        button:hover {{ background: #1b5dab; }}
+        .section {{ margin-bottom: 2rem; }}
+        #log {{ background:#f5f5f5; padding:10px; border-radius:6px; margin-top:10px; font-size:0.9rem; }}
+        a {{ color:#2d89ef; text-decoration:none; }}
+        a:hover {{ text-decoration:underline; }}
+      </style>
+    </head>
+    <body>
+      <h1>📊 {SPACE_NAME} – Medical Dataset Augmenter</h1>
+      <p>This Hugging Face Space processes medical datasets into a <b>centralised fine-tuning format</b>
+         (JSONL + CSV), with optional <i>data augmentation</i>.</p>
+      <div class="section">
+        <h2>⚡ Quick Actions</h2>
+        <p>Click a button below to start processing a dataset with default augmentation parameters.</p>
+        <button onclick="startJob('healthcaremagic')">▶ProcAugment HealthCareMagic (100k)</button><br>
+        <button onclick="startJob('icliniq')">▶ProcAugment iCliniq (10k-derived)</button><br>
+        <button onclick="startJob('pubmedqa_l')">▶ProcAugment PubMedQA (Labelled)</button><br>
+        <button onclick="startJob('pubmedqa_u')">▶ProcAugment PubMedQA (Unlabelled)</button><br>
+        <button onclick="startJob('pubmedqa_map')">▶ProcAugment PubMedQA (Map)</button><br><br>
+        <div style="border-top: 1px solid #ddd; padding-top: 10px; margin-top: 10px;">
+          <strong>RAG Processing:</strong> - Convert to QCA format for RAG systems<br>
+          <button onclick="startRagJob('healthcaremagic')" style="background: #e74c3c;">▶ RAG HealthCareMagic (100k)</button><br>
+          <button onclick="startRagJob('icliniq')" style="background: #e74c3c;">▶ RAG iCliniq (10k-derived)</button><br>
+          <button onclick="startRagJob('pubmedqa_u')" style="background: #e74c3c;">▶ RAG PubMedQA (Unlabelled)</button><br>
+          <button onclick="startRagJob('pubmedqa_l')" style="background: #e74c3c;">▶ RAG PubMedQA (Labelled)</button><br>
+          <button onclick="startRagJob('pubmedqa_map')" style="background: #e74c3c;">▶ RAG PubMedQA (Map)</button>
+        </div>
+      </div>
+      <div class="section">
+        <h2>📂 Monitoring</h2>
+        <ul>
+          <li><a href="/status" target="_blank">Check current job status</a></li>
+          <li><a href="/files" target="_blank">List generated artifacts</a></li>
+          <li><a href="https://binkhoale1812-medai-processing.hf.space/oauth2/start" target="_blank">Authorize your GCS credential</a></li>
+          <li><a href="https://huggingface.co/spaces/BinKhoaLe1812/MedAI_Processing/blob/main/REQUEST.md" target="_blank">📑 Request Doc (all curl examples)</a></li>
+        </ul>
+      </div>
+      <div class="section">
+        <h2>📝 Log</h2>
+        <div id="log">Click a button above to run a job...</div>
+      </div>
+      <script>
+        async function startJob(dataset) {{
+          const log = document.getElementById("log");
+          const ragToggle = document.getElementById("ragToggle");
+          const isRagMode = ragToggle.checked;
+          log.innerHTML = "⏳ Starting " + (isRagMode ? "RAG " : "") + "job for <b>" + dataset + "</b>...";
+          try {{
+            const resp = await fetch("/process/" + dataset, {{
+              method: "POST",
+              headers: {{ "Content-Type": "application/json" }},
+              body: JSON.stringify({{
+                augment: {{
+                  paraphrase_ratio: 0.1,
+                  backtranslate_ratio: 0.00, // Increase to 0.05-0.1 for back-translation
+                  paraphrase_outputs: false,
+                  style_standardize: true,
+                  deidentify: true,
+                  dedupe: true,
+                  max_chars: 5000,
+                  expand: true,
+                  max_aug_per_sample: 2
+                }},
+                sample_limit: null,          // Sample down (currently disabled)
+                seed: 42,
+                rag_processing: isRagMode
+              }})
+            }});
+            const data = await resp.json();
+            if (resp.ok) {{
+              log.innerHTML = "✅ " + JSON.stringify(data);
+            }} else {{
+              log.innerHTML = "❌ Error: " + JSON.stringify(data);
+            }}
+          }} catch (err) {{
+            log.innerHTML = "❌ JS Error: " + err;
+          }}
+        }}
+        async function startRagJob(dataset) {{
+          const log = document.getElementById("log");
+          log.innerHTML = "⏳ Starting RAG processing for <b>" + dataset + "</b>...";
+          try {{
+            const resp = await fetch("/rag/" + dataset, {{
+              method: "POST",
+              headers: {{ "Content-Type": "application/json" }},
+              body: JSON.stringify({{
+                sample_limit: null,
+                seed: 42
+              }})
+            }});
+            const data = await resp.json();
+            if (resp.ok) {{
+              log.innerHTML = "✅ RAG Processing Started: " + JSON.stringify(data);
+            }} else {{
+              log.innerHTML = "❌ Error: " + JSON.stringify(data);
+            }}
+          }} catch (err) {{
+            log.innerHTML = "❌ JS Error: " + err;
+          }}
+        }}
+      </script>
+    </body>
+    </html>
+    """
+@app.get("/status")
+def status():
+    with STATE_LOCK:
+        return JSONResponse(STATE)
+# ──────── GCS token ────────
+@app.get("/oauth2/start")
+def oauth2_start(request: Request):
+    # Compute redirect URI dynamically from the actual host the Space is using
+    host = request.headers.get("x-forwarded-host") or request.headers.get("host")
+    scheme = "https"  # Spaces are HTTPS at the edge
+    redirect_uri = f"{scheme}://{host}/oauth2/callback"
+    try:
+        url = build_auth_url(redirect_uri)
+        return JSONResponse({"authorize_url": url})
+    except Exception as e:
+        raise HTTPException(500, f"OAuth init failed: {e}")
+# Display your token
+@app.get("/oauth2/callback")
+def oauth2_callback(request: Request, code: str = "", state: str = ""):
+    if not code:
+        raise HTTPException(400, "Missing 'code'")
+    # Send req
+    host = request.headers.get("x-forwarded-host") or request.headers.get("host")
+    scheme = "https"
+    redirect_uri = f"{scheme}://{host}/oauth2/callback"
+    # Parse and show token code
+    try:
+        creds = exchange_code(code, redirect_uri)
+        refresh = creds.refresh_token or os.getenv("GDRIVE_REFRESH_TOKEN", "")
+        # UI
+        html = f"""
+        <html>
+        <head>
+          <style>
+            body {{ font-family: sans-serif; margin: 2em; }}
+            .token-box {{
+              padding: 1em; border: 1px solid #ccc; border-radius: 6px;
+              background: #f9f9f9; font-family: monospace;
+              word-break: break-all; white-space: pre-wrap;
+            }}
+            .note {{ margin-top: 1em; color: #555; }}
+          </style>
+        </head>
+        <body>
+          <h2>✅ Google Drive Authorized</h2>
+          <p>Your refresh token is:</p>
+          <div class="token-box">{refresh}</div>
+          <p class="note">
+            👉 Copy this token and save it into your Hugging Face Space Secrets
+            as <code>GDRIVE_REFRESH_TOKEN</code>.
+            This ensures persistence across rebuilds.
+          </p>
+        </body>
+        </html>
+        """
+        return HTMLResponse(html)
+    except Exception as e:
+        raise HTTPException(500, f"OAuth exchange failed: {e}")
+@app.get("/files")
+def files():
+    out = []
+    for root, _, fns in os.walk(OUTPUT_DIR):
+        for fn in fns:
+            out.append(os.path.relpath(os.path.join(root, fn), OUTPUT_DIR))
+    return {"output_dir": OUTPUT_DIR, "files": sorted(out)}
+@app.post("/process/{dataset_key}")
+def process_dataset(dataset_key: str, params: ProcessParams, background: BackgroundTasks):
+    with STATE_LOCK:
+        if STATE["running"]:
+            logger.warning(
+                f"[JOB] Rejecting new job dataset={dataset_key} "
+                f"current={STATE['dataset']} started_at={STATE['started_at']}"
+            )
+            raise HTTPException(409, detail="Another job is running.")
+        STATE["running"] = True
+        STATE["dataset"] = dataset_key
+        STATE["started_at"] = now_iso()
+        STATE["progress"] = 0.0
+        STATE["message"] = "starting"
+        STATE["last_result"] = None
+        logger.info(
+            f"[JOB] Queued dataset={dataset_key} "
+            f"params={{'sample_limit': {params.sample_limit}, 'seed': {params.seed}, "
+            f"'rag_processing': {params.rag_processing}, 'augment': {params.augment.dict()} }}"
+        )
+    # Start job to background runner thread
+    logger.info(f"[JOB] Started dataset={dataset_key}")
+    background.add_task(_run_job, dataset_key, params)
+    return {"ok": True, "message": f"Job for '{dataset_key}' started."}
+@app.post("/rag/{dataset_key}")
+def process_rag_dataset(dataset_key: str, params: ProcessParams, background: BackgroundTasks):
+    """Dedicated RAG processing endpoint"""
+    # Force RAG processing mode
+    params.rag_processing = True
+    with STATE_LOCK:
+        if STATE["running"]:
+            logger.warning(
+                f"[RAG] Rejecting new RAG job dataset={dataset_key} "
+                f"current={STATE['dataset']} started_at={STATE['started_at']}"
+            )
+            raise HTTPException(409, detail="Another job is running.")
+        STATE["running"] = True
+        STATE["dataset"] = dataset_key
+        STATE["started_at"] = now_iso()
+        STATE["progress"] = 0.0
+        STATE["message"] = "starting RAG processing"
+        STATE["last_result"] = None
+        logger.info(
+            f"[RAG] Queued RAG dataset={dataset_key} "
+            f"params={{'sample_limit': {params.sample_limit}, 'seed': {params.seed} }}"
+        )
+    # Start job to background runner thread
+    logger.info(f"[RAG] Started RAG dataset={dataset_key}")
+    background.add_task(_run_job, dataset_key, params)
+    return {"ok": True, "message": f"RAG processing job for '{dataset_key}' started."}
+def _run_job(dataset_key: str, params: ProcessParams):
+    t0 = time.time()
+    try:
+        ds = resolve_dataset(dataset_key)
+        if not ds:
+            set_state(running=False, message="unknown dataset")
+            return
+        # Download HF Dataset and start processing units
+        set_state(message="downloading")
+        local_path = hf_download_dataset(ds["repo_id"], ds["filename"], ds["repo_type"])
+        logger.info(f"[JOB] Downloaded {ds['repo_id']}/{ds['filename']} → {local_path}")
+        # Prepare timestamp for fire writing
+        ts = dt.datetime.utcnow().strftime("%Y%m%d-%H%M%S")
+        mode_suffix = "rag" if params.rag_processing else "sft"
+        stem = f"{dataset_key}-{mode_suffix}-{ts}"
+        jsonl_path = os.path.join(OUTPUT_DIR, f"{stem}.jsonl")
+        csv_path   = os.path.join(OUTPUT_DIR, f"{stem}.csv")
+        # Change state
+        set_state(message="processing", progress=0.05)
+        # Writer
+        writer = CentralisedWriter(jsonl_path=jsonl_path, csv_path=csv_path)
+        if params.rag_processing:
+            # RAG processing mode
+            set_state(message="RAG processing", progress=0.1)
+            count, stats = process_file_into_rag(
+                dataset_key=dataset_key,
+                input_path=local_path,
+                writer=writer,
+                nvidia_model=os.getenv("NVIDIA_MODEL", "meta/llama-3.1-8b-instruct"),
+                sample_limit=params.sample_limit,
+                seed=params.seed,
+                progress_cb=lambda p, msg=None: set_state(progress=p, message=msg or STATE["message"])
+            )
+        else:
+            # Standard SFT processing mode
+            set_state(message="SFT processing", progress=0.1)
+            count, stats = process_file_into_sft(
+                dataset_key=dataset_key,
+                input_path=local_path,
+                writer=writer,
+                paraphraser=paraphraser,
+                augment_opts=params.augment.dict(),
+                sample_limit=params.sample_limit,
+                seed=params.seed,
+                progress_cb=lambda p, msg=None: set_state(progress=p, message=msg or STATE["message"])
+            )
+        logger.info(f"[JOB] Processed dataset={dataset_key} rows={count} stats={stats}")
+        writer.close()
+        # Upload to GDrive
+        set_state(message="uploading to Google Drive", progress=0.95)
+        up1 = drive.upload_file_to_drive(jsonl_path, mimetype="application/json")
+        up2 = drive.upload_file_to_drive(csv_path,   mimetype="text/csv")
+        logger.info(
+            f"[JOB] Uploads complete uploaded={bool(up1 and up2)} "
+            f"jsonl={jsonl_path} csv={csv_path}"
+        )
+        # Finalize a task
+        result = {
+            "dataset": dataset_key,
+            "processing_mode": "RAG" if params.rag_processing else "SFT",
+            "processed_rows": count,
+            "stats": stats,
+            "artifacts": {"jsonl": jsonl_path, "csv": csv_path},
+            "uploaded": bool(up1 and up2),
+            "duration_sec": round(time.time() - t0, 2)
+        }
+        set_state(message="done", progress=1.0, last_result=result, running=False)
+        logger.info(
+            f"[JOB] Finished dataset={dataset_key} "
+            f"duration_sec={round(time.time()-t0, 2)}"
+        )
+    except Exception as e:
+        logger.exception(f"[JOB] Error for dataset={dataset_key}: {e}")
+        set_state(message=f"error: {e}", running=False)

mount_drive.py ADDED Viewed

	@@ -0,0 +1,9 @@

+# Check Google Drive status
+from utils.drive_saver import DriveSaver
+if __name__ == "__main__":
+    ds = DriveSaver()
+    if ds.is_service_available():
+        print("Drive ready.")
+    else:
+        print("Drive NOT ready.")

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+fastapi
+uvicorn[standard]
+python-dotenv
+huggingface_hub
+requests
+google-genai
+google-api-python-client
+google-auth
+google-auth-httplib2
+google-auth-oauthlib
+orjson
+ftfy
+langid

utils/ __init__.py ADDED Viewed

	@@ -0,0 +1,22 @@

+"""
+Utility package for the Medical Dataset Augmenter Space.
+This package provides:
+- drive_saver: Google Drive upload helper
+- llm: API key rotation, paraphraser, translation/backtranslation
+- datasets: Hugging Face dataset resolver & downloader
+- processor: dataset-specific processing pipeline with augmentation
+- schema: centralised SFT writer (JSONL + CSV)
+- token: GCS project token refresher and authenticator
+- augment: low-level augmentation utilities (text cleanup, deid, paraphrase hooks)
+"""
+from . import drive_saver
+from . import llm
+from . import datasets
+from . import processor
+from . import schema
+from . import augment
+from . import token
+__all__ = ["drive_saver", "llm", "datasets", "processor", "schema", "augment"]

utils/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

utils/augment.py ADDED Viewed

	@@ -0,0 +1,105 @@

+# augmentation utility agent
+import re
+import random
+from typing import Dict, Tuple
+import ftfy
+import langid
+P_EMAIL = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
+P_PHONE = re.compile(r"(?:(?:\+?\d{1,3})?[\s-]?)?(?:\(?\d{2,4}\)?[\s-]?)?\d{3,4}[\s-]?\d{3,4}")
+P_URL   = re.compile(r"https?://\S+|www\.\S+")
+P_IP    = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
+def fix_unicode(s: str) -> str:
+    return ftfy.fix_text(s or "")
+def normalize_whitespace(s: str) -> str:
+    s = s.replace("\u00A0", " ")
+    s = re.sub(r"[ \t]+", " ", s)
+    s = re.sub(r"\s+\n", "\n", s)
+    s = re.sub(r"\n{3,}", "\n\n", s)
+    return s.strip()
+def canonicalize_quotes(s: str) -> str:
+    return s.replace("“", '"').replace("”", '"').replace("’", "'").replace("‘", "'")
+def ensure_terminal_punct(s: str) -> str:
+    if not s: return s
+    if s[-1] in ".!?": return s
+    return s + "."
+def deidentify(s: str) -> str:
+    s = P_EMAIL.sub("[REDACTED_EMAIL]", s)
+    s = P_PHONE.sub("[REDACTED_PHONE]", s)
+    s = P_URL.sub("[REDACTED_URL]", s)
+    s = P_IP.sub("[REDACTED_IP]", s)
+    return s
+def lang_is_english(s: str) -> bool:
+    try:
+        lang, _ = langid.classify((s or "")[:2000])
+        return lang == "en"
+    except Exception:
+        return True
+def length_cap(s: str, max_chars: int) -> str:
+    if len(s) <= max_chars:
+        return s
+    # try to cut at sentence boundary
+    cut = s[:max_chars]
+    last_dot = cut.rfind(". ")
+    if last_dot > 300:  # don't cut too aggressively
+        return cut[:last_dot+1] + " …"
+    return cut + " …"
+def fingerprint(instr: str, user: str, out: str) -> str:
+    # Simple, fast fingerprint for dedupe
+    def norm(x: str) -> str:
+        x = x.lower()
+        x = re.sub(r"[^a-z0-9]+", " ", x)
+        x = re.sub(r"\s+", " ", x).strip()
+        return x
+    core = "||".join([norm(instr), norm(user), norm(out)])
+    # lightweight hash
+    import hashlib
+    return hashlib.md5(core.encode("utf-8")).hexdigest()
+def style_standardize_answer(ans: str) -> str:
+    if not ans: return ans
+    ans = ans.strip()
+    # Gentle guardrails, neutral voice
+    prefix = ""
+    # Avoid absolute guarantees
+    ans = re.sub(r"\b(guarantee|100%|certainly|always|never)\b", "likely", ans, flags=re.I)
+    # Remove sign-offs typical of forums
+    ans = re.sub(r"\n*(thanks|thank you|regards|cheers)[^\n]*$", "", ans, flags=re.I)
+    return ensure_terminal_punct(ans)
+def base_cleanup(s: str, max_chars: int, do_deid: bool) -> str:
+    s = fix_unicode(s)
+    s = canonicalize_quotes(s)
+    s = normalize_whitespace(s)
+    if do_deid:
+        s = deidentify(s)
+    s = length_cap(s, max_chars)
+    return s
+def maybe_paraphrase(text: str, ratio: float, paraphraser, difficulty: str) -> Tuple[str, bool]:
+    if ratio <= 0 or not text: return text, False
+    if random.random() < ratio:
+        return paraphraser.paraphrase(text, difficulty=difficulty), True
+    return text, False
+def maybe_backtranslate(text: str, ratio: float, paraphraser) -> Tuple[str, bool]:
+    if ratio <= 0 or not text: return text, False
+    if random.random() < ratio:
+        bt = paraphraser.backtranslate(text, via_lang="de")
+        return bt if bt else text, bool(bt)
+    return text, False
+def consistency_ok(user: str, out: str, ratio: float, paraphraser) -> bool:
+    if ratio <= 0 or (not user) or (not out):
+        return True
+    if random.random() >= ratio:
+        return True
+    return paraphraser.consistency_check(user, out)

utils/datasets.py ADDED Viewed

	@@ -0,0 +1,66 @@

+# HF dataset download resolver + downloader
+import os
+from typing import Optional
+from huggingface_hub import hf_hub_download
+import logging
+# Logger
+logger = logging.getLogger("datasets")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    logger.addHandler(logging.StreamHandler())
+DATASETS = {
+    "healthcaremagic": {
+        "repo_id":  "BinKhoaLe1812/MedDialog-EN-100k",
+        "filename": "HealthCareMagic-100k.json",
+        "repo_type": "dataset"
+    },
+    "icliniq": {
+        "repo_id":  "BinKhoaLe1812/MedDialog-EN-10k",
+        "filename": "iCliniq.json",
+        "repo_type": "dataset"
+    },
+    "pubmedqa_l": {
+        "repo_id":  "BinKhoaLe1812/PubMedQA-L",
+        "filename": "ori_pqal.json",
+        "repo_type": "dataset"
+    },
+    "pubmedqa_u": {
+        "repo_id":  "BinKhoaLe1812/PubMedQA-U",
+        "filename": "ori_pqau.json",
+        "repo_type": "dataset"
+    },
+    "pubmedqa_map": {
+        "repo_id":  "BinKhoaLe1812/PubMedQA-Map",
+        "filename": "pubmed_qa_map.json",
+        "repo_type": "dataset"
+    }
+}
+def resolve_dataset(key: str) -> Optional[dict]:
+    return DATASETS.get(key.lower())
+def hf_download_dataset(repo_id: str, filename: str, repo_type: str = "dataset") -> str:
+    token = os.getenv("HF_TOKEN")
+    logger.info(
+        f"[HF] Download {repo_id}/{filename} (type={repo_type}) token={'yes' if token else 'no'}"
+    )
+    path = hf_hub_download(
+        repo_id=repo_id,
+        filename=filename,
+        repo_type=repo_type,
+        token=token,
+        local_dir=os.path.abspath("cache/hf"),
+        local_dir_use_symlinks=False
+    )
+    try:
+        size = os.path.getsize(path)
+        logger.info(f"[HF] Downloaded to {path} size={size} bytes")
+    except Exception:
+        logger.info(f"[HF] Downloaded to {path}")
+    return path

utils/drive_saver.py ADDED Viewed

	@@ -0,0 +1,88 @@

+# Save final post-process to Google Drive
+import os, json, logging
+from typing import Optional
+from google.oauth2 import service_account
+from googleapiclient.discovery import build
+from googleapiclient.http import MediaFileUpload
+from utils.token import get_credentials
+logger = logging.getLogger("dsaver")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    fmt = logging.Formatter("[%(levelname)s] %(asctime)s - %(message)s")
+    handler = logging.StreamHandler()
+    handler.setFormatter(fmt)
+    logger.addHandler(handler)
+class DriveSaver:
+    """Google Drive uploader. Prefers OAuth; optional SA fallback (Shared Drive only)."""
+    def __init__(self, default_folder_id: Optional[str] = None):
+        self.service = None
+        self.folder_id = default_folder_id or os.getenv("GDRIVE_FOLDER_ID")
+        self.supports_all_drives = os.getenv("GDRIVE_FOLDER_IS_SHARED", "false").lower() in ("1","true","yes")
+        self.allow_sa_fallback = os.getenv("GDRIVE_ALLOW_SA_FALLBACK", "false").lower() in ("1","true","yes")
+        if not self.folder_id:
+            logger.warning("📁 No GDRIVE_FOLDER_ID set; uploads must provide folder_id explicitly")
+        self._initialize_service()
+    def _initialize_service(self):
+        creds = get_credentials()
+        if creds:
+            logger.info("✅ Using OAuth credentials")
+        else:
+            # Optional SA fallback — ONLY valid for Shared Drives where SA is a member
+            if self.allow_sa_fallback:
+                creds_env = os.getenv("GDRIVE_CREDENTIALS_JSON")
+                if creds_env:
+                    try:
+                        info = json.loads(creds_env)
+                        if info.get("type") == "service_account":
+                            creds = service_account.Credentials.from_service_account_info(
+                                info, scopes=["https://www.googleapis.com/auth/drive"]
+                            )
+                            logger.info("✅ Using Service Account credentials (fallback)")
+                            if not self.supports_all_drives:
+                                logger.warning("⚠️ SA fallback without Shared Drive mode will likely fail (no quota). "
+                                               "Set GDRIVE_FOLDER_IS_SHARED=true and use a Shared Drive folder ID.")
+                        else:
+                            logger.error("❌ GDRIVE_CREDENTIALS_JSON is not a service account JSON")
+                    except Exception as e:
+                        logger.error(f"❌ Failed to init Service Account: {e}")
+            if not creds:
+                logger.error("❌ No valid Google credentials available (OAuth or SA).")
+                self.service = None
+                return
+        # Build Drive service
+        self.service = build("drive", "v3", credentials=creds)
+        logger.info("✅ Google Drive service initialized")
+    def upload_file_to_drive(self, file_path: str, folder_id: Optional[str] = None, mimetype: Optional[str] = None) -> bool:
+        if not self.service:
+            logger.error("❌ Drive service not initialized")
+            return False
+        try:
+            target_folder = folder_id or self.folder_id
+            name = os.path.basename(file_path)
+            media = MediaFileUpload(file_path, mimetype=mimetype or "application/octet-stream")
+            metadata = {"name": name, "parents": [target_folder]}
+            req = self.service.files().create(
+                body=metadata,
+                media_body=media,
+                fields="id",
+                supportsAllDrives=self.supports_all_drives
+            )
+            req.execute()
+            logger.info(f"✅ Uploaded '{name}' to Drive (folder: {target_folder})")
+            return True
+        except Exception as e:
+            logger.error(f"❌ Drive upload failed: {e}")
+            return False
+    def is_service_available(self) -> bool:
+        return self.service is not None
+    def set_folder_id(self, folder_id: str):
+        self.folder_id = folder_id
+        logger.info(f"📁 Default folder ID updated: {folder_id}")

utils/llm.py ADDED Viewed

	@@ -0,0 +1,186 @@

+# Round-robin rotator + paraphrasing + translation/backtranslation
+import os
+import logging
+import requests
+from typing import Optional
+from google import genai
+logger = logging.getLogger("llm")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    handler = logging.StreamHandler()
+    logger.addHandler(handler)
+# LLM parser limit text to log-out
+def snip(s: str, n: int = 12) -> str:
+    if not isinstance(s, str): return "∅"
+    parts = s.strip().split()
+    return " ".join(parts[:n]) + (" …" if len(parts) > n else "")
+class KeyRotator:
+    def __init__(self, env_prefix: str, max_keys: int = 5):
+        keys = []
+        for i in range(1, max_keys + 1):
+            v = os.getenv(f"{env_prefix}_{i}")
+            if v:
+                keys.append(v.strip())
+        if not keys:
+            logger.warning(f"[LLM] No keys found for prefix {env_prefix}_*")
+        self.keys = keys
+        self.dead = set()
+        self.idx = 0
+    def next_key(self) -> Optional[str]:
+        if not self.keys:
+            return None
+        for _ in range(len(self.keys)):
+            k = self.keys[self.idx % len(self.keys)]
+            self.idx += 1
+            if k not in self.dead:
+                return k
+        return None
+    def mark_bad(self, key: Optional[str]):
+        if key:
+            self.dead.add(key)
+            logger.warning(f"[LLM] Quarantined key (prefix hidden): {key[:6]}***")
+class GeminiClient:
+    def __init__(self, rotator: KeyRotator, default_model: str):
+        self.rotator = rotator
+        self.default_model = default_model
+    def generate(self, prompt: str, model: Optional[str] = None, temperature: float = 0.2, max_output_tokens: int = 512) -> Optional[str]:
+        key = self.rotator.next_key()
+        if not key:
+            return None
+        try:
+            client = genai.Client(api_key=key)
+            # NOTE: matches your required pattern/use
+            res = client.models.generate_content(
+                model=model or self.default_model,
+                contents=prompt
+            )
+            text = getattr(res, "text", None)
+            if text:
+                logger.info(f"[LLM][Gemini] out={snip(text)}")
+            return text
+        except Exception as e:
+            logger.error(f"[LLM][Gemini] {e}")
+            self.rotator.mark_bad(key)
+            return None
+class NvidiaClient:
+    def __init__(self, rotator: KeyRotator, default_model: str):
+        self.rotator = rotator
+        self.default_model = default_model
+        self.url = os.getenv("NVIDIA_API_URL", "https://integrate.api.nvidia.com/v1/chat/completions")
+    # Regex-based cleaning resp from quotes
+    def _clean_resp(self, resp: str) -> str:
+        if not resp: return resp
+        txt = resp.strip()
+        # Remove common boilerplate prefixes
+        for pat in [
+            r"^Here is (a|the) .*?:\s*",
+            r"^Paraphrased(?: version)?:\s*",
+            r"^Sure[,.]?\s*",
+            r"^Okay[,.]?\s*"
+        ]:
+            import re
+            txt = re.sub(pat, "", txt, flags=re.I)
+        return txt.strip()
+    def generate(self, prompt: str, model: Optional[str] = None, temperature: float = 0.2, max_tokens: int = 512) -> Optional[str]:
+        key = self.rotator.next_key()
+        if not key:
+            return None
+        try:
+            headers = {"Authorization": f"Bearer {key}", "Content-Type": "application/json"}
+            payload = {
+                "model": model or self.default_model,
+                "messages": [{"role": "user", "content": prompt}],
+                "temperature": temperature,
+                "max_tokens": max_tokens
+            }
+            r = requests.post(self.url, headers=headers, json=payload, timeout=45)
+            if r.status_code >= 400:
+                raise RuntimeError(f"HTTP {r.status_code}: {r.text[:200]}")
+            data = r.json()
+            text = data["choices"][0]["message"]["content"]
+            clean = self._clean_resp(text)
+            logger.info(f"[LLM][NVIDIA] out={snip(clean)}")
+            return clean
+        except Exception as e:
+            logger.error(f"[LLM][NVIDIA] {e}")
+            self.rotator.mark_bad(key)
+            return None
+class Paraphraser:
+    """Prefers NVIDIA (cheap), falls back to Gemini. Also offers translate/backtranslate and a tiny consistency judge."""
+    def __init__(self, nvidia_model: str, gemini_model_easy: str, gemini_model_hard: str):
+        self.nv = NvidiaClient(KeyRotator("NVIDIA_API"), nvidia_model)
+        self.gm_easy = GeminiClient(KeyRotator("GEMINI_API"), gemini_model_easy)
+        self.gm_hard = GeminiClient(KeyRotator("GEMINI_API"), gemini_model_hard)
+    # Regex-based cleaning resp from quotes
+    def _clean_resp(self, resp: str) -> str:
+        if not resp: return resp
+        txt = resp.strip()
+        # Remove common boilerplate prefixes
+        for pat in [
+            r"^Here is (a|the) .*?:\s*",
+            r"^Paraphrased(?: version)?:\s*",
+            r"^Sure[,.]?\s*",
+            r"^Okay[,.]?\s*"
+        ]:
+            import re
+            txt = re.sub(pat, "", txt, flags=re.I)
+        return txt.strip()
+    # ————— Paraphrase —————
+    def paraphrase(self, text: str, difficulty: str = "easy") -> str:
+        if not text or len(text) < 12:
+            return text
+        prompt = (
+            "Paraphrase the following medical text concisely, preserve meaning and clinical terms.\n"
+            "Do not fabricate or remove factual claims.\n"
+            "Return ONLY the rewritten text, without any introduction, commentary.\n"+ text
+        )
+        out = self.nv.generate(prompt, temperature=0.1, max_tokens=min(600, max(128, len(text)//2)))
+        if out: return self._clean_resp(out)
+        gm = self.gm_easy if difficulty == "easy" else self.gm_hard
+        out = gm.generate(prompt, max_output_tokens=min(600, max(128, len(text)//2)))
+        return self._clean_resp(out) if out else text
+    # ————— Translate & Backtranslate —————
+    def translate(self, text: str, target_lang: str = "de") -> Optional[str]:
+        if not text: return text
+        prompt = f"Translate to {target_lang}. Keep meaning exact, preserve medical terms:\n\n{text}"
+        out = self.nv.generate(prompt, temperature=0.0, max_tokens=min(800, len(text)+100))
+        if out: return out.strip()
+        return self.gm_easy.generate(prompt, max_output_tokens=min(800, len(text)+100))
+    def backtranslate(self, text: str, via_lang: str = "de") -> Optional[str]:
+        if not text: return text
+        mid = self.translate(text, target_lang=via_lang)
+        if not mid: return None
+        prompt = f"Translate the following {via_lang} text back to English, preserving the exact meaning:\n\n{mid}"
+        out = self.nv.generate(prompt, temperature=0.0, max_tokens=min(900, len(text)+150))
+        if out: return out.strip()
+        res = self.gm_easy.generate(prompt, max_output_tokens=min(900, len(text)+150))
+        return res.strip() if res else None
+    # ————— Consistency Judge (cheap, ratio-based) —————
+    def consistency_check(self, user: str, output: str) -> bool:
+        """Return True if 'output' appears supported by 'user' (context/question). Soft heuristic via LLM."""
+        prompt = (
+            "You are a strict medical QA validator. Given the USER input (question+context) "
+            "and the MODEL ANSWER, reply with exactly 'PASS' if the answer is supported and safe, "
+            "otherwise 'FAIL'. No extra text.\n\n"
+            f"USER:\n{user}\n\nANSWER:\n{output}"
+        )
+        out = self.nv.generate(prompt, temperature=0.0, max_tokens=3)
+        if not out:
+            out = self.gm_easy.generate(prompt, max_output_tokens=3)
+        return isinstance(out, str) and "PASS" in out.upper()

utils/processor.py ADDED Viewed

	@@ -0,0 +1,411 @@

+# Dataset-specific parsers + paraphrasing flow
+import json
+import random
+import hashlib
+import logging
+from typing import Callable, Optional, Dict, Tuple
+from utils.schema import sft_row
+from utils import augment as A
+# Logger
+logger = logging.getLogger("processor")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    logger.addHandler(logging.StreamHandler())
+def _hash_id(*parts) -> str:
+    h = hashlib.sha256()
+    for p in parts:
+        h.update(str(p).encode("utf-8"))
+    return h.hexdigest()[:16]
+def _iter_json_or_jsonl(path: str):
+    with open(path, "r", encoding="utf-8") as f:
+        first = f.read(1); f.seek(0)
+        if first == "[":
+            data = json.load(f)
+            for obj in data: yield obj
+        else:
+            for line in f:
+                line = line.strip()
+                if line: yield json.loads(line)
+def process_file_into_sft(
+    dataset_key: str,
+    input_path: str,
+    writer,
+    paraphraser,
+    augment_opts: Dict,
+    sample_limit: Optional[int],
+    seed: int,
+    progress_cb: Optional[Callable[[float, str], None]]
+) -> Tuple[int, Dict]:
+    random.seed(seed)
+    stats = {
+        "written": 0,
+        "paraphrased_input": 0,
+        "paraphrased_output": 0,
+        "backtranslated_input": 0,
+        "backtranslated_output": 0,
+        "dedup_skipped": 0,
+        "consistency_failed": 0
+    }
+    # Start processing SFT
+    key_summary = {k: augment_opts.get(k) for k in (
+        "paraphrase_ratio","backtranslate_ratio","paraphrase_outputs",
+        "style_standardize","deidentify","dedupe",
+        "consistency_check_ratio","distill_fraction"
+    )}
+    logger.info(
+        f"[PROC] Begin dataset={dataset_key} sample_limit={sample_limit} opts={key_summary}"
+    )
+    # If deduplicating enabled
+    dedupe_seen = set() if augment_opts.get("dedupe", True) else None
+    key = dataset_key.lower()
+    if key in ("healthcaremagic", "icliniq"):
+        count = _proc_med_dialog(source=key, path=input_path, writer=writer,
+                                 paraphraser=paraphraser, opts=augment_opts,
+                                 sample_limit=sample_limit, stats=stats, cb=progress_cb, dedupe_seen=dedupe_seen)
+    elif key == "pubmedqa_l":
+        count = _proc_pubmedqa_l(input_path, writer, paraphraser, augment_opts, sample_limit, stats, progress_cb, dedupe_seen=dedupe_seen)
+    elif key == "pubmedqa_u":
+        count = _proc_pubmedqa_u(input_path, writer, paraphraser, augment_opts, sample_limit, stats, progress_cb, dedupe_seen=dedupe_seen)
+    elif key == "pubmedqa_map":
+        count = _proc_pubmedqa_map(input_path, writer, paraphraser, augment_opts, sample_limit, stats, progress_cb, dedupe_seen=dedupe_seen)
+    else:
+        raise ValueError(f"Unknown dataset: {dataset_key}")
+    logger.info(f"[PROC] End dataset={dataset_key} stats={stats}")
+    return count, stats
+# ——————————— helpers ———————————
+def _build_variants(user: str, out: str, paraphraser, opts: Dict, stats: Dict):
+    """Return a list of (user_variant, out_variant, applied_tags) not including the original."""
+    variants = []
+    max_k = max(0, int(opts.get("max_aug_per_sample", 1)))
+    for _ in range(max_k):
+        applied = []
+        u2, did_p = A.maybe_paraphrase(user, opts.get("paraphrase_ratio", 0.0), paraphraser, "easy")
+        if did_p: applied.append("paraphrase_input"); stats["paraphrased_input"] += 1
+        u3, did_bt = A.maybe_backtranslate(u2, opts.get("backtranslate_ratio", 0.0), paraphraser)
+        if did_bt: applied.append("backtranslate_input"); stats["backtranslated_input"] += 1
+        o3 = out
+        if opts.get("paraphrase_outputs", False):
+            o2, did_p2 = A.maybe_paraphrase(out, opts.get("paraphrase_ratio", 0.0), paraphraser, "hard")
+            if did_p2: applied.append("paraphrase_output"); stats["paraphrased_output"] += 1
+            o3b, did_bt2 = A.maybe_backtranslate(o2, opts.get("backtranslate_ratio", 0.0), paraphraser)
+            if did_bt2: applied.append("backtranslate_output"); stats["backtranslated_output"] += 1
+            o3 = o3b
+        # If nothing applied, skip this variant
+        if not applied:
+            continue
+        # Style standardize and punctuation for the variant too
+        if opts.get("style_standardize", True):
+            o3 = A.style_standardize_answer(o3)
+        u3 = A.ensure_terminal_punct(u3) if u3 else u3
+        o3 = A.ensure_terminal_punct(o3) if o3 else o3
+        variants.append((u3, o3, applied))
+    return variants
+def _apply_aug(instr: str, user: str, out: str, source: str, opts: Dict, paraphraser, stats: Dict):
+    # Base cleanup & caps (returns cleaned strings)
+    user = A.base_cleanup(user, opts.get("max_chars", 5000), opts.get("deidentify", True))
+    out  = A.base_cleanup(out,  opts.get("max_chars", 5000), opts.get("deidentify", True))
+    instr = A.base_cleanup(instr, opts.get("max_chars", 5000), False)
+    # Language sanity (mostly English—skip aggressive transforms if not)
+    if not A.lang_is_english(user):  # very rare
+        return instr, user, out, []
+    # Stack list of entries that has been applied augmentation and stylings
+    applied = []
+    # Style standardizing the answer
+    if opts.get("style_standardize", True):
+        out = A.style_standardize_answer(out)
+        applied.append("style_standardize")
+    # Ensure punctuation/whitespace
+    user = A.ensure_terminal_punct(user) if user else user
+    out  = A.ensure_terminal_punct(out)  if out  else out
+    return instr, user, out, applied
+def _commit_row(writer, source, rid, task, instr, user, out, opts, stats, aug_applied, extra_meta=None, dedupe_seen=None):
+    # Dedup entry
+    if dedupe_seen is not None:
+        fp = A.fingerprint(instr, user, out)
+        if fp in dedupe_seen:
+            stats["dedup_skipped"] += 1
+            return False
+        dedupe_seen.add(fp)
+    meta = {"augmentations": aug_applied}
+    if extra_meta:
+        meta.update(extra_meta)
+    row = sft_row(instr, user, out, source=source, rid=rid, task=task, meta=meta)
+    writer.write(row)
+    stats["written"] += 1
+    return True
+# ——————————— dataset processors ———————————
+def _proc_med_dialog(source, path, writer, paraphraser, opts, sample_limit, stats, cb, dedupe_seen=None):
+    count = 0
+    written = 0
+    for i, obj in enumerate(_iter_json_or_jsonl(path), start=1):
+        try:
+            instr_raw = obj.get("instruction") or "Answer the patient's question like a clinician. Be concise and safe."
+            user_raw = obj.get("input") or ""
+            out_raw = obj.get("output") or ""
+            # Ensure we have string values
+            instr = str(instr_raw).strip()
+            user = str(user_raw).strip()
+            out = str(out_raw).strip()
+            rid = _hash_id(source, i, len(user), len(out))
+        except Exception as e:
+            logger.warning(f"[PROC] {source} error processing item {i}: {e}, item: {obj}")
+            continue
+        try:
+            instr, user, out, applied = _apply_aug(instr, user, out, source, opts, paraphraser, stats)
+            # 1) ALWAYS write the original (cleaned/style-standardised only)
+            # Optional consistency spot-check (cheap)
+            if not A.consistency_ok(user, out, opts.get("consistency_check_ratio", 0.0), paraphraser):
+                stats["consistency_failed"] += 1
+                # keep the sample but tag it
+                applied.append("consistency_flag")
+            # 2) If expansion is enabled, add augmented copies
+            _commit_row(writer, source, rid, "medical_dialogue", instr, user, out, opts, stats, ["base"] + applied, dedupe_seen=dedupe_seen)
+            # Add augmented copies if expand
+            if opts.get("expand", True):
+                for (u_aug, o_aug, aug_tags) in _build_variants(user, out, paraphraser, opts, stats):
+                    rid_aug = f"{rid}-aug{random.randint(1000,9999)}"
+                    _commit_row(writer, source, rid_aug, "medical_dialogue", instr, u_aug, o_aug, opts, stats, aug_tags, dedupe_seen=dedupe_seen)
+            # Increment count only on success
+            count += 1
+        except Exception as e:
+            logger.warning(f"[PROC] {source} error in processing/augmentation for item {i}: {e}")
+            continue
+        if sample_limit and count >= sample_limit:
+            break
+        if cb and i % 1000 == 0:
+            cb(min(0.9, 0.05 + i/200000), f"{source}: processed {i} rows")
+    if cb:
+        cb(0.92, f"{source} done ({count})")
+    logger.info(f"[PROC] {source} done count={count} written={stats['written']} dedup_skipped={stats['dedup_skipped']}")
+    return count
+def _proc_pubmedqa_l(path, writer, paraphraser, opts, sample_limit, stats, cb, dedupe_seen=None):
+    with open(path, "r", encoding="utf-8") as f:
+        data = json.load(f)
+    count = 0
+    for k, v in data.items():
+        try:
+            q_raw = v.get("QUESTION") or ""
+            ctx_list = v.get("CONTEXTS") or []
+            long_ans_raw = v.get("LONG_ANSWER") or ""
+            final_raw = v.get("final_decision") or ""
+            # Ensure we have string values
+            q = str(q_raw).strip() if q_raw else ""
+            if isinstance(ctx_list, list):
+                context = "\n".join(str(ctx) for ctx in ctx_list).strip()
+            else:
+                context = str(ctx_list).strip()
+            long_ans = str(long_ans_raw).strip() if long_ans_raw else ""
+            final = str(final_raw).strip() if final_raw else ""
+        except Exception as e:
+            logger.warning(f"[PROC] pubmedqa_l error processing item {k}: {e}, item: {v}")
+            continue
+        try:
+            instr = "Answer the biomedical question using the provided context. Include a concise rationale if possible."
+            user  = f"Question: {q}\n\nContext:\n{context}" if context else f"Question: {q}"
+            out   = long_ans if long_ans else final
+            rid   = str(k)
+            instr, user, out, applied = _apply_aug(instr, user, out, "pubmedqa_l", opts, paraphraser, stats)
+            _commit_row(writer, "pubmedqa_l", rid, "biomedical_qa", instr, user, out, opts, stats, applied,
+                        extra_meta={"year": v.get("YEAR"), "meshes": v.get("MESHES"), "labels": v.get("LABELS")}, dedupe_seen=dedupe_seen)
+            if opts.get("expand", True):
+                for (u_aug, o_aug, aug_tags) in _build_variants(user, out, paraphraser, opts, stats):
+                    rid_aug = f"{rid}-aug{random.randint(1000,9999)}"
+                    _commit_row(writer, "pubmedqa_l", rid_aug, "biomedical_qa",
+                                instr, u_aug, o_aug, opts, stats, aug_tags, dedupe_seen=dedupe_seen)
+            # Increment count only on success
+            count += 1
+        except Exception as e:
+            logger.warning(f"[PROC] pubmedqa_l error in processing/augmentation for item {k}: {e}")
+            continue
+        if sample_limit and count >= sample_limit:
+            break
+        if cb and count % 1000 == 0:
+            cb(min(0.9, 0.05 + count/60000), f"pubmedqa_l processed {count}")
+    if cb:
+        cb(0.93, f"pubmedqa_l done ({count})")
+    logger.info(f"[PROC] pubmedqa_l done count={count} written={stats['written']} dedup_skipped={stats['dedup_skipped']}")
+    return count
+def _proc_pubmedqa_u(path, writer, paraphraser, opts, sample_limit, stats, cb, dedupe_seen=None):
+    with open(path, "r", encoding="utf-8") as f:
+        data = json.load(f)
+    count = 0
+    for k, v in data.items():
+        try:
+            q_raw = v.get("QUESTION") or ""
+            ctx_list = v.get("CONTEXTS") or []
+            # Ensure we have string values
+            q = str(q_raw).strip() if q_raw else ""
+            if isinstance(ctx_list, list):
+                context = "\n".join(str(ctx) for ctx in ctx_list).strip()
+            else:
+                context = str(ctx_list).strip()
+        except Exception as e:
+            logger.warning(f"[PROC] pubmedqa_u error processing item {k}: {e}, item: {v}")
+            continue
+        try:
+            instr = "Rewrite the context into a succinct note, then answer the question. If unknown, say 'insufficient evidence'."
+            user  = f"Question: {q}\n\nContext:\n{context}" if context else f"Question: {q}"
+            out   = ""  # unlabeled
+            rid   = str(k)
+            # Optional KD/distillation for a small fraction
+            if opts.get("distill_fraction", 0.0) > 0.0 and random.random() < float(opts["distill_fraction"]):
+                prompt = f"{instr}\n\n{user}\n\nAnswer briefly and safely."
+                guess = paraphraser.paraphrase(prompt, difficulty="hard")  # cheap single call
+                if guess and len(guess) < 2000:
+                    out = guess.strip()
+            instr, user, out, applied = _apply_aug(instr, user, out, "pubmedqa_u", opts, paraphraser, stats)
+            _commit_row(writer, "pubmedqa_u", str(k), "biomedical_qa_unlabeled", instr, user, out, opts, stats, applied, dedupe_seen=dedupe_seen)
+            if opts.get("expand", True):
+                for (u_aug, o_aug, aug_tags) in _build_variants(user, out, paraphraser, opts, stats):
+                    rid_aug = f"{rid}-aug{random.randint(1000,9999)}"
+                    _commit_row(writer, "pubmedqa_u", rid_aug, "biomedical_qa",
+                                instr, u_aug, o_aug, opts, stats, aug_tags, dedupe_seen=dedupe_seen)
+            # Increment count only on success
+            count += 1
+        except Exception as e:
+            logger.warning(f"[PROC] pubmedqa_u error in processing/augmentation for item {k}: {e}")
+            continue
+        if sample_limit and count >= sample_limit:
+            break
+        if cb and count % 2000 == 0:
+            cb(min(0.9, 0.05 + count/80000), f"pubmedqa_u processed {count}")
+    if cb:
+        cb(0.94, f"pubmedqa_u done ({count})")
+    logger.info(f"[PROC] pubmedqa_u done count={count} written={stats['written']} dedup_skipped={stats['dedup_skipped']}")
+    return count
+def _proc_pubmedqa_map(path, writer, paraphraser, opts, sample_limit, stats, cb, dedupe_seen=None):
+    with open(path, "r", encoding="utf-8") as f:
+        obj = json.load(f)
+    # Log the structure for debugging
+    logger.info(f"[PROC] pubmedqa_map data type: {type(obj)}")
+    if isinstance(obj, dict):
+        logger.info(f"[PROC] pubmedqa_map dict keys: {list(obj.keys())}")
+        if len(obj) > 0:
+            sample_key = next(iter(obj.keys()))
+            sample_value = obj[sample_key]
+            logger.info(f"[PROC] pubmedqa_map sample value type: {type(sample_value)}")
+            if isinstance(sample_value, dict):
+                logger.info(f"[PROC] pubmedqa_map sample value keys: {list(sample_value.keys())}")
+    # Iteration of items
+    def iter_items():
+        try:
+            if isinstance(obj, list):
+                for it in obj:
+                    if isinstance(it, dict):
+                        yield it
+                    else:
+                        logger.warning(f"[PROC] pubmedqa_map skipping non-dict list item: {type(it)}")
+            elif isinstance(obj, dict):
+                qs, cs, ans = obj.get("question"), obj.get("context"), obj.get("answer")
+                if isinstance(qs, list) and isinstance(cs, list) and isinstance(ans, list):
+                    for i in range(min(len(qs), len(cs), len(ans))):
+                        yield {"question": qs[i], "context": cs[i], "answer": ans[i]}
+                else:
+                    # Handle case where values might be dictionaries or other objects
+                    for k, v in obj.items():
+                        if isinstance(v, dict):
+                            # If v is a dict, ensure it has the expected structure
+                            if "question" in v and "context" in v and "answer" in v:
+                                yield v
+                            else:
+                                # Try to map the keys to expected structure
+                                yield {
+                                    "question": v.get("question") or v.get("QUESTION") or str(k),
+                                    "context": v.get("context") or v.get("CONTEXT") or "",
+                                    "answer": v.get("answer") or v.get("ANSWER") or ""
+                                }
+                        else:
+                            # If v is not a dict, create a simple structure
+                            yield {"question": str(k), "context": str(v) if v else "", "answer": ""}
+            else:
+                logger.warning(f"[PROC] pubmedqa_map unexpected data type: {type(obj)}")
+        except Exception as e:
+            logger.error(f"[PROC] pubmedqa_map error in iter_items: {e}")
+            return
+    count = 0
+    for i, v in enumerate(iter_items(), start=1):
+        try:
+            # Ensure we have string values, convert if necessary
+            q_raw = v.get("question") or ""
+            c_raw = v.get("context") or ""
+            a_raw = v.get("answer") or ""
+            # Convert to string if not already
+            q = str(q_raw).strip() if q_raw else ""
+            c = str(c_raw).strip() if c_raw else ""
+            a = str(a_raw).strip() if a_raw else ""
+            instr = "Answer the biomedical question based on the context. Justify briefly."
+            user  = f"Question: {q}\n\nContext:\n{c}" if c else f"Question: {q}"
+            out   = a
+            rid   = _hash_id("pubmedqa_map", i, len(q))
+            # Process the item
+            instr, user, out, applied = _apply_aug(instr, user, out, "pubmedqa_map", opts, paraphraser, stats)
+            _commit_row(writer, "pubmedqa_map", rid, "biomedical_qa", instr, user, out, opts, stats, applied, dedupe_seen=dedupe_seen)
+            # Handle expansion if enabled
+            if opts.get("expand", True):
+                for (u_aug, o_aug, aug_tags) in _build_variants(user, out, paraphraser, opts, stats):
+                    rid_aug = f"{rid}-aug{random.randint(1000,9999)}"
+                    _commit_row(writer, "pubmedqa_map", rid_aug, "biomedical_qa",
+                                instr, u_aug, o_aug, opts, stats, aug_tags, dedupe_seen=dedupe_seen)
+            # Increment count only on success
+            count += 1
+        except Exception as e:
+            logger.warning(f"[PROC] pubmedqa_map error processing item {i}: {e}, item: {v}")
+            continue
+        # Check sample limit
+        if sample_limit and count >= sample_limit:
+            break
+        if cb and i % 2000 == 0:
+            cb(min(0.9, 0.05 + i/120000), f"pubmedqa_map processed {i}")
+    if cb:
+        cb(0.95, f"pubmedqa_map done ({count})")
+    logger.info(f"[PROC] pubmedqa_map done count={count} written={stats['written']} dedup_skipped={stats['dedup_skipped']}")
+    return count

utils/rag.py ADDED Viewed

	@@ -0,0 +1,345 @@

+# RAG-specific dataset processor
+import json
+import logging
+import hashlib
+import random
+from typing import Dict, List, Tuple, Optional, Callable
+from utils.schema import sft_row
+from utils.llm import NvidiaClient, KeyRotator
+# Logger
+logger = logging.getLogger("rag_processor")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    logger.addHandler(logging.StreamHandler())
+def _hash_id(*parts) -> str:
+    """Generate a hash ID for RAG entries"""
+    h = hashlib.sha256()
+    for p in parts:
+        h.update(str(p).encode("utf-8"))
+    return h.hexdigest()[:16]
+def _iter_json_or_jsonl(path: str):
+    """Iterate over JSON or JSONL files"""
+    with open(path, "r", encoding="utf-8") as f:
+        first = f.read(1)
+        f.seek(0)
+        if first == "[":
+            data = json.load(f)
+            for obj in data:
+                yield obj
+        else:
+            for line in f:
+                line = line.strip()
+                if line:
+                    yield json.loads(line)
+class RAGProcessor:
+    """Processes medical datasets into RAG-specific QCA (Question, Context, Answer) format"""
+    def __init__(self, nvidia_model: str):
+        self.nvidia_client = NvidiaClient(KeyRotator("NVIDIA_API"), nvidia_model)
+    def clean_conversational_content(self, text: str) -> str:
+        """Remove conversational elements and non-medical information using NVIDIA model"""
+        if not text or len(text.strip()) < 10:
+            return text
+        prompt = f"""
+        You are a medical data cleaning expert. Clean the following text by:
+        1. Remove conversational elements (greetings, pleasantries)
+        2. Remove non-medical small talk and social interactions
+        3. Keep only medically relevant information
+        4. Preserve clinical facts, symptoms, diagnoses, treatments, and medical advice
+        5. Maintain professional medical language
+        6. Return only cleaned medical content, only plain text, no special characters, or formatting.
+        Text to clean:
+        {text}
+        Cleaned medical content:"""
+        try:
+            cleaned = self.nvidia_client.generate(
+                prompt,
+                temperature=0.1,
+                max_tokens=min(1000, len(text) + 200)
+            )
+            return cleaned.strip() if cleaned else text
+        except Exception as e:
+            logger.warning(f"[RAG] Error cleaning text: {e}")
+            return text
+    def generate_context_from_qa(self, question: str, answer: str) -> str:
+        """Generate synthetic context from question and answer using NVIDIA model"""
+        if not question or not answer:
+            return ""
+        prompt = f"""You are a medical knowledge expert. Given a medical question and its answer, generate a brief relevant medical context that would help someone understand the answer better. Write about 2 sentences that provide relevant background information. Use only plain text without any formatting or symbols.
+        Question: {question}
+        Answer: {answer}
+        Generate a concise medical context:"""
+        try:
+            context = self.nvidia_client.generate(
+                prompt,
+                temperature=0.2,
+                max_tokens=200
+            )
+            return context.strip() if context else ""
+        except Exception as e:
+            logger.warning(f"[RAG] Error generating context: {e}")
+            return ""
+    def convert_to_qca_format(self, instruction: str, user_input: str, output: str) -> Tuple[str, str, str]:
+        """Convert SFT format to QCA (Question, Context, Answer) format"""
+        # Clean the content to remove conversational elements
+        cleaned_input = self.clean_conversational_content(user_input)
+        cleaned_output = self.clean_conversational_content(output)
+        # Extract question from user input
+        question = self.extract_question(cleaned_input)
+        # Extract or generate context
+        context = self.extract_context(cleaned_input, question, cleaned_output)
+        # Clean answer
+        answer = cleaned_output
+        return question, context, answer
+    def extract_question(self, user_input: str) -> str:
+        """Extract the main question from user input"""
+        if not user_input:
+            return ""
+        # Try to identify question patterns
+        lines = user_input.split('\n')
+        for line in lines:
+            line = line.strip()
+            if line.startswith('Question:') or line.startswith('Q:'):
+                return line.replace('Question:', '').replace('Q:', '').strip()
+            elif '?' in line and len(line) > 10:
+                return line
+        # If no clear question found, use the first meaningful line
+        for line in lines:
+            line = line.strip()
+            if len(line) > 10:
+                return line
+        return user_input
+    def extract_context(self, user_input: str, question: str, answer: str) -> str:
+        """Extract context from user input or generate synthetic context"""
+        # Look for context in the original input
+        context_candidates = []
+        lines = user_input.split('\n')
+        for line in lines:
+            line = line.strip()
+            if (line.startswith('Context:') or
+                line.startswith('Background:') or
+                line.startswith('Information:') or
+                (len(line) > 50 and not line.startswith('Question:') and '?' not in line)):
+                context_candidates.append(line)
+        if context_candidates:
+            # Clean and combine context candidates
+            context = ' '.join(context_candidates)
+            context = self.clean_conversational_content(context)
+            if len(context) > 20:  # Ensure we have meaningful context
+                return context
+        # Generate synthetic context if none found
+        if question and answer:
+            synthetic_context = self.generate_context_from_qa(question, answer)
+            if synthetic_context:
+                return synthetic_context
+        return ""
+    def process_medical_dialog(self, source: str, path: str, writer, sample_limit: Optional[int],
+                             stats: Dict, progress_cb: Optional[Callable], dedupe_seen: set = None) -> int:
+        """Process medical dialogue datasets into RAG format"""
+        count = 0
+        written = 0
+        for i, obj in enumerate(_iter_json_or_jsonl(path), start=1):
+            try:
+                instr_raw = obj.get("instruction") or "Answer the medical question based on the provided context."
+                user_raw = obj.get("input") or ""
+                out_raw = obj.get("output") or ""
+                instr = str(instr_raw).strip()
+                user = str(user_raw).strip()
+                out = str(out_raw).strip()
+                rid = _hash_id(source, i, len(user), len(out))
+                # Convert to QCA format
+                question, context, answer = self.convert_to_qca_format(instr, user, out)
+                if not question or not answer:
+                    continue
+                # Create RAG-specific instruction
+                rag_instruction = "Answer the medical question based on the provided context. If the context is insufficient, provide the best available medical information."
+                # Format user input as QCA
+                if context:
+                    rag_user = f"Question: {question}\n\nContext: {context}"
+                else:
+                    rag_user = f"Question: {question}"
+                # Commit the RAG-formatted row
+                if self._commit_rag_row(writer, source, rid, "rag_medical_qa",
+                                      rag_instruction, rag_user, answer,
+                                      stats, dedupe_seen=dedupe_seen):
+                    written += 1
+                count += 1
+            except Exception as e:
+                logger.warning(f"[RAG] {source} error processing item {i}: {e}")
+                continue
+            if sample_limit and count >= sample_limit:
+                break
+            if progress_cb and i % 1000 == 0:
+                progress_cb(min(0.9, 0.05 + i/200000), f"{source}: processed {i} rows for RAG")
+        if progress_cb:
+            progress_cb(0.92, f"{source} RAG processing done ({count})")
+        logger.info(f"[RAG] {source} RAG processing done count={count} written={written}")
+        return count
+    def process_pubmedqa(self, source: str, path: str, writer, sample_limit: Optional[int],
+                        stats: Dict, progress_cb: Optional[Callable], dedupe_seen: set = None) -> int:
+        """Process PubMedQA datasets into RAG format"""
+        with open(path, "r", encoding="utf-8") as f:
+            data = json.load(f)
+        count = 0
+        written = 0
+        for k, v in data.items():
+            try:
+                q_raw = v.get("QUESTION") or ""
+                ctx_list = v.get("CONTEXTS") or []
+                long_ans_raw = v.get("LONG_ANSWER") or ""
+                final_raw = v.get("final_decision") or ""
+                question = str(q_raw).strip() if q_raw else ""
+                if isinstance(ctx_list, list):
+                    context = "\n".join(str(ctx) for ctx in ctx_list).strip()
+                else:
+                    context = str(ctx_list).strip()
+                answer = str(long_ans_raw).strip() if long_ans_raw else str(final_raw).strip()
+                if not question or not answer:
+                    continue
+                # Clean the content
+                question = self.clean_conversational_content(question)
+                context = self.clean_conversational_content(context)
+                answer = self.clean_conversational_content(answer)
+                # Generate context if missing
+                if not context:
+                    context = self.generate_context_from_qa(question, answer)
+                rid = str(k)
+                rag_instruction = "Answer the biomedical question based on the provided context."
+                if context:
+                    rag_user = f"Question: {question}\n\nContext: {context}"
+                else:
+                    rag_user = f"Question: {question}"
+                # Commit the RAG-formatted row
+                if self._commit_rag_row(writer, source, rid, "rag_biomedical_qa",
+                                      rag_instruction, rag_user, answer,
+                                      stats, dedupe_seen=dedupe_seen):
+                    written += 1
+                count += 1
+            except Exception as e:
+                logger.warning(f"[RAG] {source} error processing item {k}: {e}")
+                continue
+            if sample_limit and count >= sample_limit:
+                break
+            if progress_cb and count % 1000 == 0:
+                progress_cb(min(0.9, 0.05 + count/60000), f"{source} RAG processed {count}")
+        if progress_cb:
+            progress_cb(0.93, f"{source} RAG processing done ({count})")
+        logger.info(f"[RAG] {source} RAG processing done count={count} written={written}")
+        return count
+    def _commit_rag_row(self, writer, source: str, rid: str, task: str,
+                       instruction: str, user_input: str, output: str,
+                       stats: Dict, dedupe_seen: set = None) -> bool:
+        """Commit a RAG-formatted row to the writer"""
+        # Simple deduplication based on content hash
+        if dedupe_seen is not None:
+            content_hash = hashlib.md5(f"{user_input}{output}".encode()).hexdigest()
+            if content_hash in dedupe_seen:
+                stats["dedup_skipped"] = stats.get("dedup_skipped", 0) + 1
+                return False
+            dedupe_seen.add(content_hash)
+        meta = {"rag_processing": True, "format": "qca"}
+        row = sft_row(instruction, user_input, output, source=source, rid=rid, task=task, meta=meta)
+        writer.write(row)
+        stats["written"] = stats.get("written", 0) + 1
+        return True
+def process_file_into_rag(
+    dataset_key: str,
+    input_path: str,
+    writer,
+    nvidia_model: str,
+    sample_limit: Optional[int],
+    seed: int,
+    progress_cb: Optional[Callable[[float, str], None]]
+) -> Tuple[int, Dict]:
+    """Main entry point for RAG processing"""
+    random.seed(seed)
+    stats = {
+        "written": 0,
+        "dedup_skipped": 0
+    }
+    logger.info(f"[RAG] Begin RAG processing dataset={dataset_key} sample_limit={sample_limit}")
+    # Initialize RAG processor
+    rag_processor = RAGProcessor(nvidia_model)
+    dedupe_seen = set()
+    key = dataset_key.lower()
+    if key in ("healthcaremagic", "icliniq"):
+        count = rag_processor.process_medical_dialog(
+            source=key, path=input_path, writer=writer,
+            sample_limit=sample_limit, stats=stats,
+            progress_cb=progress_cb, dedupe_seen=dedupe_seen
+        )
+    elif key in ("pubmedqa_l", "pubmedqa_u", "pubmedqa_map"):
+        count = rag_processor.process_pubmedqa(
+            source=key, path=input_path, writer=writer,
+            sample_limit=sample_limit, stats=stats,
+            progress_cb=progress_cb, dedupe_seen=dedupe_seen
+        )
+    else:
+        raise ValueError(f"Unknown dataset for RAG processing: {dataset_key}")
+    logger.info(f"[RAG] End RAG processing dataset={dataset_key} stats={stats}")
+    return count, stats

utils/schema.py ADDED Viewed

	@@ -0,0 +1,68 @@

+# Centralized SFT writer (JSONL + CSV)
+import csv
+import orjson
+from typing import Optional, Dict
+import logging
+# Logger
+logger = logging.getLogger("schema")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    logger.addHandler(logging.StreamHandler())
+def sft_row(instruction: str, user_input: str, output: str, source: str, rid: str, task: str, meta: Optional[dict] = None):
+    return {
+        "source": source,
+        "id": rid,
+        "task": task,
+        "sft": {
+            "instruction": instruction,
+            "input": user_input,
+            "output": output
+        },
+        "meta": meta or {}
+    }
+def is_valid_row(row: Dict, max_chars: int = 20000) -> bool:
+    s = row.get("sft", {})
+    instr = s.get("instruction", "")
+    inp = s.get("input", "")
+    out = s.get("output", "")
+    # basic sanity: non-empty input OR output; cap extremes
+    if not (inp or out): return False
+    if any(len(x) > max_chars for x in (instr, inp, out)): return False
+    return True
+class CentralisedWriter:
+    """Streams JSONL + CSV in parallel to stay memory-safe."""
+    def __init__(self, jsonl_path: str, csv_path: str):
+        self.jsonl_fp = open(jsonl_path, "wb")
+        self.csv_fp   = open(csv_path, "w", newline="", encoding="utf-8")
+        self.csv_wr   = csv.DictWriter(self.csv_fp, fieldnames=["instruction","input","output","source","id","task"])
+        self.csv_wr.writeheader()
+    def write(self, row: dict):
+        if not is_valid_row(row):
+            s = row.get("sft", {})
+            logger.warning(
+                f"[WRITER] Skipping invalid row id={row.get('id')} "
+                f"(len instr={len(s.get('instruction',''))}, input={len(s.get('input',''))}, output={len(s.get('output',''))})"
+            )
+            return
+        self.jsonl_fp.write(orjson.dumps(row))
+        self.jsonl_fp.write(b"\n")
+        s = row["sft"]
+        self.csv_wr.writerow({
+            "instruction": s.get("instruction",""),
+            "input": s.get("input",""),
+            "output": s.get("output",""),
+            "source": row.get("source",""),
+            "id": row.get("id",""),
+            "task": row.get("task","")
+        })
+    def close(self):
+        try:
+            self.jsonl_fp.close()
+        finally:
+            self.csv_fp.close()

utils/token.py ADDED Viewed

	@@ -0,0 +1,107 @@

+# GCS credential token refresher
+import os, json, logging
+from typing import Optional
+from google.oauth2.credentials import Credentials
+from google_auth_oauthlib.flow import Flow
+from google.auth.transport.requests import Request
+logger = logging.getLogger("token")
+if not logger.handlers:
+    logger.setLevel(logging.INFO)
+    handler = logging.StreamHandler()
+    logger.addHandler(handler)
+SCOPES = ["https://www.googleapis.com/auth/drive.file"]
+TOKEN_FILE = os.getenv("GDRIVE_TOKEN_FILE", "cache/secrets/gdrive_token.json")
+def _load_oauth_client_web():
+    cfg_env = os.getenv("GDRIVE_CREDENTIALS_JSON")
+    if not cfg_env:
+        return None
+    try:
+        cfg = json.loads(cfg_env)
+        return cfg.get("web")
+    except Exception as e:
+        logger.error(f"❌ Failed to parse GDRIVE_CREDENTIALS_JSON: {e}")
+        return None
+def _ensure_dirs():
+    base = os.path.dirname(TOKEN_FILE)
+    if base and not os.path.exists(base):
+        os.makedirs(base, exist_ok=True)
+def get_credentials() -> Optional[Credentials]:
+    # 1) Token file
+    if os.path.exists(TOKEN_FILE):
+        try:
+            with open(TOKEN_FILE, "r", encoding="utf-8") as f:
+                data = json.load(f)
+            creds = Credentials.from_authorized_user_info(data, scopes=SCOPES)
+            if creds and creds.expired and creds.refresh_token:
+                creds.refresh(Request())
+                logger.info("🔄 Refreshed access token from token file")
+            return creds
+        except Exception as e:
+            logger.warning(f"⚠️ Failed to load token file: {e}")
+    # 2) Refresh token in env
+    refresh = os.getenv("GDRIVE_REFRESH_TOKEN")
+    web = _load_oauth_client_web()
+    if refresh and web:
+        creds = Credentials(
+            None,
+            refresh_token=refresh,
+            token_uri="https://oauth2.googleapis.com/token",
+            client_id=web.get("client_id"),
+            client_secret=web.get("client_secret"),
+            scopes=SCOPES,
+        )
+        if creds and (creds.expired or not creds.valid):
+            try:
+                creds.refresh(Request())
+                logger.info("🔄 Refreshed access token from env refresh token")
+            except Exception as e:
+                logger.warning(f"⚠️ Refresh with env token failed: {e}")
+        return creds
+    # 3) Nothing available
+    return None
+def build_auth_url(redirect_uri: str) -> str:
+    web = _load_oauth_client_web()
+    if not web:
+        raise RuntimeError("GDRIVE_CREDENTIALS_JSON missing or invalid ('web' section required)")
+    flow = Flow.from_client_config({"web": web}, scopes=SCOPES, redirect_uri=redirect_uri)
+    auth_url, _ = flow.authorization_url(
+        prompt="consent",
+        access_type="offline",
+        include_granted_scopes="true"
+    )
+    return auth_url
+def exchange_code(code: str, redirect_uri: str) -> Credentials:
+    web = _load_oauth_client_web()
+    if not web:
+        raise RuntimeError("GDRIVE_CREDENTIALS_JSON missing or invalid ('web' section required)")
+    flow = Flow.from_client_config({"web": web}, scopes=SCOPES, redirect_uri=redirect_uri)
+    flow.fetch_token(code=code)
+    creds: Credentials = flow.credentials
+    info = {
+        "token": creds.token,
+        "refresh_token": creds.refresh_token,
+        "token_uri": "https://oauth2.googleapis.com/token",
+        "client_id": web.get("client_id"),
+        "client_secret": web.get("client_secret"),
+        "scopes": SCOPES,
+    }
+    _ensure_dirs()
+    with open(TOKEN_FILE, "w", encoding="utf-8") as f:
+        json.dump(info, f)
+    logger.info("✅ Saved Google refresh token to %s", TOKEN_FILE)
+    # also set env for current process
+    if creds.refresh_token:
+        os.environ["GDRIVE_REFRESH_TOKEN"] = creds.refresh_token
+    return creds