Instructions to use LUCIFerace/enhanced-replica-model-pack with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use LUCIFerace/enhanced-replica-model-pack with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="LUCIFerace/enhanced-replica-model-pack")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("LUCIFerace/enhanced-replica-model-pack", dtype="auto") - Notebooks
- Google Colab
- Kaggle
# Dataset Overview
This repository now bundles the selected final datasets that were actually kept for handoff, along with the upstream source materials and prompt assets needed to understand where those datasets came from.
Data Layers
The bundled data is organized into two layers:
data/source-materials/- Human text source pools
- AI-generated text source pools
- Prompt assets used to generate the AI text pools
data/dataset/- Five experiment-ready datasets that can be used directly by the packaged scripts
- One central manifest at
data/dataset/90_manifests/dataset_manifests.json
Source Materials
data/source-materials/ keeps the upstream materials in a delivery-friendly layout:
human-core-pool- The main manually collected human text pool
- This is the core human-side source used to build the downstream human datasets
human-recovery-pool-v2- High-confidence human texts recovered from the quarantine branch, version 2
- Used as a supplement to the main human pool
human-recovery-pool-v3- High-confidence human texts recovered from the quarantine branch, version 3
- Used as an additional supplement to the main human pool
ai-generated-standard- AI-generated texts under the standard generation style
- Keeps the original topic/subtopic/prompt tree for traceability
ai-generated-natural-v1- AI-generated texts under the more natural writing-style branch
- Also keeps the original topic/subtopic/prompt tree
prompts-standard- Prompt files corresponding to the standard AI generation branch
prompts-natural-v1- Prompt files corresponding to the natural-style AI generation branch
Prompt files remain as .txt in the packaged data tree so the repository keeps documentation in docs/ while the data area stays asset-oriented.
Experiment-Ready Datasets
data/dataset/ includes five packaged datasets:
DS04_Human_pools_merged_v1- Pure human text pool
- Works as the main human-side source dataset
- All records are kept in
train.jsonl
DS11_Generated_AI_v1- Standard-style AI text pool
- Pairs naturally with DS04 when building a standard human-vs-AI setting
- All records are kept in
train.jsonl
DS12_Generated_AI_natural_v1- Natural-style AI text pool
- Used for the harder, more natural writing branch
- All records are kept in
train.jsonl
DS06_External_core_balanced_v1- Balanced experiment set built from DS04 and DS11
- Includes
train/dev/testand is suitable for direct evaluation and cross-domain experiments
DS07_External_long_v1- Balanced experiment set built from DS04 and DS12
- Includes
train/dev/testand emphasizes the natural-style branch
Each dataset directory keeps its own train/dev/test.jsonl, manifest.json, and check_noise.py, while the repository-level manifest provides one portable entry point for script loading.
What Is Intentionally Not Bundled
- Raw public benchmark downloads such as NLPCC, HC3, CLTS, or Zhihu RLHF source packages
- Processed datasets that depend on the excluded public-source branches
- The deprecated
DS10branch
This repository is meant to be a focused research asset pack, not a mirror of every intermediate or publicly downloadable dataset used during exploration.