Text Classification
Transformers
Safetensors
Chinese
chinese
ai-text-detection
ensemble
bert
roberta
qwen
lora
research
dataset
Instructions to use LUCIFerace/enhanced-replica-model-pack with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LUCIFerace/enhanced-replica-model-pack with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="LUCIFerace/enhanced-replica-model-pack")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("LUCIFerace/enhanced-replica-model-pack", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| # Dataset Overview | |
| This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from. | |
| ## Data Layers | |
| The bundled data is organized into two layers: | |
| - `data/source-materials/` | |
| - Human text source pools | |
| - AI-generated text source pools | |
| - Prompt assets used to generate the AI text pools | |
| - `data/dataset/` | |
| - Five experiment-ready datasets that can be used directly by the packaged scripts | |
| - One central manifest at `data/dataset/90_manifests/dataset_manifests.json` | |
| ## Source Materials | |
| `data/source-materials/` keeps the upstream materials in a delivery-friendly layout: | |
| - `human-core-pool` | |
| - The main manually collected human text pool | |
| - This is the core human-side source used to build the downstream human datasets | |
| - `human-recovery-pool-v2` | |
| - High-confidence human texts recovered from the quarantine branch, version 2 | |
| - Used as a supplement to the main human pool | |
| - `human-recovery-pool-v3` | |
| - High-confidence human texts recovered from the quarantine branch, version 3 | |
| - Used as an additional supplement to the main human pool | |
| - `ai-generated-standard` | |
| - AI-generated texts under the standard generation style | |
| - Keeps the original topic/subtopic/prompt tree for traceability | |
| - `ai-generated-natural-v1` | |
| - AI-generated texts under the more natural writing-style branch | |
| - Also keeps the original topic/subtopic/prompt tree | |
| - `prompts-standard` | |
| - Prompt files corresponding to the standard AI generation branch | |
| - `prompts-natural-v1` | |
| - Prompt files corresponding to the natural-style AI generation branch | |
| Prompt files remain as `.txt` in the packaged data tree so the repository keeps documentation in `docs/` while the data area stays asset-oriented. | |
| ## Experiment-Ready Datasets | |
| `data/dataset/` includes five packaged datasets: | |
| - `DS04_Human_pools_merged_v1` | |
| - Pure human text pool | |
| - Works as the main human-side source dataset | |
| - All records are kept in `train.jsonl` | |
| - `DS11_Generated_AI_v1` | |
| - Standard-style AI text pool | |
| - Pairs naturally with DS04 when building a standard human-vs-AI setting | |
| - All records are kept in `train.jsonl` | |
| - `DS12_Generated_AI_natural_v1` | |
| - Natural-style AI text pool | |
| - Used for the harder, more natural writing branch | |
| - All records are kept in `train.jsonl` | |
| - `DS06_External_core_balanced_v1` | |
| - Balanced experiment set built from DS04 and DS11 | |
| - Includes `train/dev/test` and is suitable for direct evaluation and cross-domain experiments | |
| - `DS07_External_long_v1` | |
| - Balanced experiment set built from DS04 and DS12 | |
| - Includes `train/dev/test` and emphasizes the natural-style branch | |
| Each dataset directory keeps its own `train/dev/test.jsonl`, `manifest.json`, and `check_noise.py`, while the repository-level manifest provides one portable entry point for script loading. | |
| ## What Is Intentionally Not Bundled | |
| - Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages | |
| - Processed datasets that depend on the excluded public-source branches | |
| - The deprecated `DS10` branch | |
| This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration. | |