File size: 24,828 Bytes
44d0409 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 |
# TEXT-AUTH: System Architecture Documentation
> TEXT-AUTH is an evidence-first, domain-aware AI text detection system
> designed around independent signals, calibrated aggregation, and
> explainability rather than black-box classification.
---
## Table of Contents
1. [System Overview](#system-overview)
2. [High-Level Architecture](#high-level-architecture)
3. [Layer-by-Layer Architecture](#layer-by-layer-architecture)
4. [Data Flow](#data-flow)
5. [Technology Stack](#technology-stack)
---
## System Overview
**TEXT-AUTH** is a sophisticated AI text detection system that employs multiple machine learning metrics and ensemble methods to determine whether text is synthetically generated, authentically written, or hybrid content.
### Key Capabilities
- **Multi-Metric Analysis**: 6 independent detection metrics (Structural, Perplexity, Entropy, Semantic, Linguistic, Multi-Perturbation Stability)
- **Domain-Aware Calibration**: Adaptive thresholds for 16 text domains (Academic, Creative, Technical, etc.)
- **Ensemble Aggregation**: Confidence-weighted combination with uncertainty quantification
- **Sentence-Level Highlighting**: Visual feedback with probability scores
- **Comprehensive Reporting**: JSON and PDF reports with detailed analysis
### Design Principles
- **Modular Architecture**: Clean separation of concerns across layers
- **Fail-Safe Design**: Graceful degradation with fallback strategies
- **Parallel Processing**: Multi-threaded metric execution for performance
- **Domain Expertise**: Specialized thresholds calibrated per content type
## Why Multi-Metric Instead of a Single Classifier?
- Single classifiers overfit stylistic artifacts
- LLMs rapidly adapt to detectors
- Independent statistical signals decay slower
- Ensemble disagreement is itself evidence
---
## High-Level Architecture
```mermaid
graph TB
subgraph "Presentation Layer"
UI[Web Interface/API]
end
subgraph "Application Layer"
ORCH[Detection Orchestrator]
ORCH --> |coordinates| PIPE[Processing Pipeline]
end
subgraph "Service Layer"
ENSEMBLE[Ensemble Classifier]
HIGHLIGHT[Text Highlighter]
REASON[Reasoning Generator]
REPORT[Report Generator]
end
subgraph "Processing Layer"
EXTRACT[Document Extractor]
TEXTPROC[Text Processor]
DOMAIN[Domain Classifier]
LANG[Language Detector]
end
subgraph "Metrics Layer"
STRUCT[Structural Metric]
PERP[Perplexity Metric]
ENT[Entropy Metric]
SEM[Semantic Metric]
LING[Linguistic Metric]
MPS[Multi-Perturbation Stability]
end
subgraph "Model Layer"
MANAGER[Model Manager]
REGISTRY[Model Registry]
CACHE[(Model Cache)]
end
subgraph "Configuration Layer"
CONFIG[Settings]
ENUMS[Enums]
SCHEMAS[Data Schemas]
CONSTANTS[Constants]
THRESHOLDS[Domain Thresholds]
end
UI --> ORCH
ORCH --> EXTRACT
ORCH --> TEXTPROC
ORCH --> DOMAIN
ORCH --> LANG
ORCH --> STRUCT
ORCH --> PERP
ORCH --> ENT
ORCH --> SEM
ORCH --> LING
ORCH --> MPS
ORCH --> ENSEMBLE
ENSEMBLE --> HIGHLIGHT
ENSEMBLE --> REASON
ENSEMBLE --> REPORT
STRUCT --> MANAGER
PERP --> MANAGER
ENT --> MANAGER
SEM --> MANAGER
LING --> MANAGER
MPS --> MANAGER
DOMAIN --> MANAGER
LANG --> MANAGER
MANAGER --> REGISTRY
MANAGER --> CACHE
ORCH --> CONFIG
ENSEMBLE --> THRESHOLDS
style UI fill:#e1f5ff
style ORCH fill:#fff3e0
style ENSEMBLE fill:#f3e5f5
style MANAGER fill:#e8f5e9
style CONFIG fill:#fce4ec
```
---
## Layer-by-Layer Architecture
### 1. Configuration Layer (`config/`)
The foundation layer providing enums, schemas, constants, and domain-specific thresholds.
```mermaid
graph LR
subgraph "Configuration Layer"
direction TB
ENUMS["enums.py
Domain, Language, Script,
ModelType ConfidenceLevel"]
SCHEMAS["schemas.py
ModelConfig, ProcessedText, MetricResult, EnsembleResult,
DetectionResult"]
CONSTANTS["constants.py
TextProcessingParams, MetricParams,
EnsembleParams"]
THRESHOLDS["threshold_config.py
DomainThresholds 16,
Domain Configs MetricThresholds"]
MODELCFG["model_config.py
Model Registry, Model Groups, Default Weights"]
SETTINGS["settings.py
App Settings, Paths, Feature Flags"]
end
ENUMS -.->|used by| SCHEMAS
ENUMS -.->|used by| THRESHOLDS
SCHEMAS -.->|used by| CONSTANTS
THRESHOLDS -.->|imports| ENUMS
MODELCFG -.->|imports| ENUMS
style ENUMS fill:#ffebee
style SCHEMAS fill:#fff3e0
style CONSTANTS fill:#e8f5e9
style THRESHOLDS fill:#e1f5ff
style MODELCFG fill:#f3e5f5
style SETTINGS fill:#fce4ec
```
**Key Components:**
- **enums.py**: Core enumerations (Domain, Language, Script, ModelType, ConfidenceLevel)
- **schemas.py**: Data classes for structured data exchange
- **constants.py**: Frozen dataclasses with hyperparameters for each metric
- **threshold_config.py**: Domain-specific thresholds for 16 domains
- **model_config.py**: Model registry with download priorities and configurations
- **settings.py**: Application settings with Pydantic validation
---
### 2. Model Abstraction Layer (`models/`)
Conceptual model abstraction layer used by metrics for centralized loading and reuse - loading, caching, and providing unified access.
```mermaid
graph TB
subgraph "Model Layer"
direction TB
MANAGER["Model Manager
Singleton Pattern Lazy Loading"]
REGISTRY["Model Registry
10 Model Configs Priority Groups"]
subgraph "Model Cache"
direction LR
GPT2[GPT-2548MBPerplexity/MPS]
MINILM[MiniLM-L6-v280MBSemantic]
SPACY[spaCy sm13MBLinguistic]
ROBERTA[RoBERTa500MBDomain Classifier]
DISTIL[DistilRoBERTa330MBMPS Mask]
XLM[XLM-RoBERTa1100MBLanguage Detection]
end
STATS[Usage StatisticsTracking Performance Metrics]
end
MANAGER -->|loads from| REGISTRY
MANAGER -->|manages| GPT2
MANAGER -->|manages| MINILM
MANAGER -->|manages| SPACY
MANAGER -->|manages| ROBERTA
MANAGER -->|manages| DISTIL
MANAGER -->|manages| XLM
MANAGER -->|tracks| STATS
REGISTRY -.->|defines| GPT2
REGISTRY -.->|defines| MINILM
REGISTRY -.->|defines| SPACY
style MANAGER fill:#e3f2fd
style REGISTRY fill:#f3e5f5
style STATS fill:#fff3e0
```
**Key Features:**
- **Lazy Loading**: Models loaded on-demand
- **Caching Strategy**: LRU cache with max 5 models
- **Usage Tracking**: Statistics for optimization
- **Priority Groups**: Essential, Extended, Optional
- **Total Size**: ~2.8GB for all models
---
### 3. Processing Layer (`processors/`)
Handles document extraction, text preprocessing, domain classification, and language detection.
```mermaid
graph TB
subgraph "Processing Layer"
direction TB
subgraph "Document Extraction"
EXTRACT[Document Extractor]
EXTRACT -->|PDF| PYPDF[PyMuPDF Primary]
EXTRACT -->|PDF| PDFPLUMB[pdfplumber Fallback]
EXTRACT -->|PDF| PYPDF2[PyPDF2 Fallback]
EXTRACT -->|DOCX| DOCX[python-docx]
EXTRACT -->|HTML| BS4[BeautifulSoup4]
EXTRACT -->|RTF| RTF[Basic Parser]
EXTRACT -->|TXT| TXT[Chardet Encoding]
end
subgraph "Text Processing"
TEXTPROC[Text Processor]
TEXTPROC --> CLEAN[Unicode NormalizationURL/Email RemovalWhitespace Cleaning]
TEXTPROC --> SPLIT[Smart Sentence SplittingAbbreviation HandlingWord Tokenization]
TEXTPROC --> VALIDATE[Length ValidationQuality ChecksStatistics]
end
subgraph "Domain Classification"
DOMAIN[Domain Classifier]
DOMAIN --> ZERO[Heuristic + optional model-assisted domain inference RoBERTa/DeBERTa]
DOMAIN --> LABELS[16 Domain LabelsMulti-Label Candidates]
DOMAIN --> THRESH[Domain-SpecificThreshold Selection]
end
subgraph "Language Detection"
LANG[Language Detector]
LANG --> MODEL[XLM-RoBERTaChunk-Based Analysis]
LANG --> FALLBACK[langdetect Library]
LANG --> HEURISTIC[Script DetectionCharacter Analysis]
end
end
EXTRACT -->|ProcessedText| TEXTPROC
TEXTPROC -->|Cleaned Text| DOMAIN
TEXTPROC -->|Cleaned Text| LANG
style EXTRACT fill:#e8f5e9
style TEXTPROC fill:#fff3e0
style DOMAIN fill:#e1f5ff
style LANG fill:#f3e5f5
```
**Processing Pipeline:**
1. **Document Extraction**: Multi-format support with fallback strategies
2. **Text Cleaning**: Unicode normalization, noise removal, validation
3. **Domain Classification**: Zero-shot classification with confidence scores
4. **Language Detection**: Multi-strategy approach with script analysis
---
### 4. Metrics Layer (`metrics/`)
Six independent detection metrics analyzing different text characteristics.
```mermaid
graph TB
subgraph "Metrics Layer"
direction TB
BASE[Base MetricAbstract ClassCommon Interface]
subgraph "Statistical Metrics"
STRUCT[Structural MetricNo ML ModelStatistical Features]
STRUCT --> SF1[Sentence Length DistributionBurstiness ScoreReadability]
STRUCT --> SF2[N-gram DiversityType-Token RatioRepetition Patterns]
end
subgraph "ML-Based Metrics"
PERP[Perplexity MetricGPT-2 ModelText Predictability]
PERP --> PF1[Overall PerplexitySentence-Level PerplexityCross-Entropy]
PERP --> PF2[Chunk AnalysisVariance ScoringNormalization]
ENT[Entropy MetricGPT-2 TokenizerRandomness Analysis]
ENT --> EF1[Character EntropyWord EntropyToken Entropy]
ENT --> EF2[Token DiversitySequence UnpredictabilityPattern Detection]
SEM[Semantic MetricMiniLM EmbeddingsCoherence Analysis]
SEM --> SF3[Sentence SimilarityTopic ConsistencyCoherence Score]
SEM --> SF4[Repetition DetectionTopic DriftContextual Consistency]
LING[Linguistic MetricspaCy NLPGrammar Analysis]
LING --> LF1[POS DiversityPOS EntropySyntactic Complexity]
LING --> LF2[Grammatical PatternsWriting StylePattern Detection]
MPS[Multi-PerturbationGPT-2 + DistilRoBERTaStability Analysis]
MPS --> MF1[Text PerturbationLikelihood CalculationStability Score]
MPS --> MF2[Curvature AnalysisChunk StabilityVariance Scoring]
end
end
BASE -.->|inherited by| STRUCT
BASE -.->|inherited by| PERP
BASE -.->|inherited by| ENT
BASE -.->|inherited by| SEM
BASE -.->|inherited by| LING
BASE -.->|inherited by| MPS
style BASE fill:#ffebee
style STRUCT fill:#e8f5e9
style PERP fill:#fff3e0
style ENT fill:#e1f5ff
style SEM fill:#f3e5f5
style LING fill:#fce4ec
style MPS fill:#fff9c4
```
**Metric Characteristics:**
| Metric | Model Required | Complexity | Typical Influence Range (Indicative) |
|--------|---------------|------------|--------------|
| Structural | ❌ | Low | 15-20% |
| Perplexity | GPT-2 | Medium | 20-27% |
| Entropy | GPT-2 Tokenizer | Medium | 13-17% |
| Semantic | MiniLM | Medium | 18-20% |
| Linguistic | spaCy | Medium | 12-16% |
| MPS | GPT-2 + DistilRoBERTa | High | 8-10% |
> *Actual weights are dynamically calibrated per domain and configuration.*
---
### 5. Service Layer (`services/`)
Coordinates ensemble aggregation, highlighting, reasoning generation, and orchestration.
```mermaid
graph TB
subgraph "Service Layer"
direction TB
subgraph "Orchestrator"
ORCH[Detection OrchestratorPipeline Coordinator]
ORCH --> PIPE[Processing Pipeline6-Step Execution]
PIPE --> STEP1[1. Text Preprocessing]
PIPE --> STEP2[2. Language Detection]
PIPE --> STEP3[3. Domain Classification]
PIPE --> STEP4[4. Metric ExecutionParallel/Sequential]
PIPE --> STEP5[5. Ensemble Aggregation]
PIPE --> STEP6[6. Result Compilation]
end
subgraph "Ensemble Classifier"
ENSEMBLE[Ensemble ClassifierMulti-Strategy Aggregation]
ENSEMBLE --> METHOD1[Confidence CalibratedSigmoid Weighting]
ENSEMBLE --> METHOD2[Consensus BasedAgreement Rewards]
ENSEMBLE --> METHOD3[Domain WeightedStatic Weights]
ENSEMBLE --> METHOD4[Simple AverageFallback]
ENSEMBLE --> CALC[Uncertainty QuantificationConsensus AnalysisConfidence Scoring]
end
subgraph "Highlighter"
HIGHLIGHT[Text HighlighterSentence-Level Analysis]
HIGHLIGHT --> COLORS[4-Color SystemAuthentic/UncertainHybrid/Synthetic]
HIGHLIGHT --> SENTENCE[Sentence EnsembleDomain AdjustmentsTooltip Generation]
end
subgraph "Reasoning"
REASON[Reasoning GeneratorExplainable AI]
REASON --> SUMMARY[Executive SummaryVerdict Explanation]
REASON --> INDICATORS[Key IndicatorsMetric Breakdown]
REASON --> EVIDENCE[Supporting EvidenceContradicting Evidence]
REASON --> RECOM[RecommendationsUncertainty Analysis]
end
end
ORCH -->|coordinates| ENSEMBLE
ORCH -->|uses| HIGHLIGHT
ORCH -->|uses| REASON
ENSEMBLE -->|provides| HIGHLIGHT
ENSEMBLE -->|provides| REASON
style ORCH fill:#fff3e0
style ENSEMBLE fill:#e3f2fd
style HIGHLIGHT fill:#f3e5f5
style REASON fill:#e8f5e9
```
**Service Features:**
- **Parallel Execution**: ThreadPoolExecutor for metric computation
- **Ensemble Methods**: 4 aggregation strategies with fallbacks
- **Sentence Highlighting**: 4-category color system (Authentic/Uncertain/Hybrid/Synthetic)
- **Explainable AI**: Detailed reasoning with metric contributions
---
### 6. Reporter Layer (`reporter/`)
Generates comprehensive reports in multiple formats.
```mermaid
graph TB
subgraph "Reporter Layer"
direction TB
REPORT[Report Generator]
subgraph "JSON Report"
JSON[Structured JSON]
JSON --> META[Report MetadataTimestampVersion]
JSON --> RESULTS[Overall ResultsProbabilitiesConfidence]
JSON --> METRICS[Detailed MetricsSub-metricsWeights]
JSON --> REASONING[Detection ReasoningEvidenceRecommendations]
JSON --> HIGHLIGHT[Highlighted SentencesColor ClassesProbabilities]
JSON --> PERF[Performance MetricsExecution TimesWarnings/Errors]
end
subgraph "PDF Report"
PDF[Professional PDF]
PDF --> PAGE1[Page 1: Executive SummaryVerdict, Stats, Reasoning]
PDF --> PAGE2[Page 2: Content AnalysisDomain, Metrics, Weights]
PDF --> PAGE3[Page 3: Structural & Entropy]
PDF --> PAGE4[Page 4: Perplexity & Semantic]
PDF --> PAGE5[Page 5: Linguistic & MPS]
PDF --> PAGE6[Page 6: Recommendations]
STYLE[Premium Styling]
STYLE --> COLORS[Color SchemeBlue/Green/Red/Purple]
STYLE --> TABLES[Professional TablesCharts, Metrics]
STYLE --> LAYOUT[Multi-Page LayoutHeaders, Footers]
end
end
REPORT -->|generates| JSON
REPORT -->|generates| PDF
PDF -->|uses| STYLE
style REPORT fill:#fff3e0
style JSON fill:#e8f5e9
style PDF fill:#e3f2fd
style STYLE fill:#f3e5f5
```
**Report Formats:**
- **JSON**: Machine-readable with complete data
- **PDF**: Human-readable with professional formatting
- **Charts**: Pie charts for probability distribution
- **Tables**: Metric contributions, detailed sub-metrics
- **Styling**: Color-coded, multi-page layout with branding
---
## Data Flow
### Complete Detection Pipeline
```mermaid
sequenceDiagram
participant User
participant Orchestrator
participant Processors
participant Metrics
participant Ensemble
participant Services
participant Reporter
User->>Orchestrator: analyze(text)
Note over Orchestrator: Step 1: Preprocessing
Orchestrator->>Processors: TextProcessor.process()
Processors-->>Orchestrator: ProcessedText
Note over Orchestrator: Step 2: Language Detection
Orchestrator->>Processors: LanguageDetector.detect()
Processors-->>Orchestrator: LanguageResult
Note over Orchestrator: Step 3: Domain Classification
Orchestrator->>Processors: DomainClassifier.classify()
Processors-->>Orchestrator: DomainPrediction
Note over Orchestrator: Step 4: Parallel Metric Execution
par Structural
Orchestrator->>Metrics: Structural.compute()
Metrics-->>Orchestrator: MetricResult
and Perplexity
Orchestrator->>Metrics: Perplexity.compute()
Metrics-->>Orchestrator: MetricResult
and Entropy
Orchestrator->>Metrics: Entropy.compute()
Metrics-->>Orchestrator: MetricResult
and Semantic
Orchestrator->>Metrics: Semantic.compute()
Metrics-->>Orchestrator: MetricResult
and Linguistic
Orchestrator->>Metrics: Linguistic.compute()
Metrics-->>Orchestrator: MetricResult
and MPS
Orchestrator->>Metrics: MPS.compute()
Metrics-->>Orchestrator: MetricResult
end
Note over Orchestrator: Step 5: Ensemble Aggregation
Orchestrator->>Ensemble: predict(metric_results, domain)
Ensemble-->>Orchestrator: EnsembleResult
Note over Orchestrator: Step 6: Services
Orchestrator->>Services: generate_highlights()
Services-->>Orchestrator: HighlightedSentences
Orchestrator->>Services: generate_reasoning()
Services-->>Orchestrator: DetailedReasoning
Orchestrator->>Reporter: generate_report()
Reporter-->>Orchestrator: Report Files
Orchestrator-->>User: DetectionResult
```
### Ensemble Aggregation Flow
```mermaid
graph TD
START[Metric Results] --> FILTER[Filter Valid MetricsRemove Errors]
FILTER --> WEIGHTS[Get Domain WeightsBase Weights]
WEIGHTS --> METHOD{Primary Method?}
METHOD -->|Confidence Calibrated| CONF[Sigmoid ConfidenceAdjustment]
METHOD -->|Consensus Based| CONS[AgreementCalculation]
METHOD -->|Domain Weighted| DOMAIN[Static DomainWeights]
CONF --> AGGREGATE[Weighted Aggregation]
CONS --> AGGREGATE
DOMAIN --> AGGREGATE
AGGREGATE --> NORMALIZE[Normalize to 1.0]
NORMALIZE --> CALC[Calculate Metrics]
CALC --> CONFIDENCE[Overall ConfidenceBase + Agreement+ Certainty + Quality]
CALC --> UNCERTAINTY[Uncertainty ScoreVariance + Confidence+ Decision]
CALC --> CONSENSUS[Consensus LevelStd Dev Analysis]
CONFIDENCE --> THRESHOLD[Apply AdaptiveThreshold]
UNCERTAINTY --> THRESHOLD
THRESHOLD --> VERDICT{Verdict}
VERDICT -->|Synthetic >= 0.6| SYNTH[Synthetically-Generated]
VERDICT -->|Authentic >= 0.6| AUTH[Authentically-Written]
VERDICT -->|Hybrid > 0.25| HYBRID[Hybrid]
VERDICT -->|Uncertain| UNC[Uncertain]
SYNTH --> REASON[Generate Reasoning]
AUTH --> REASON
HYBRID --> REASON
UNC --> REASON
REASON --> RESULT[EnsembleResult]
style START fill:#e8f5e9
style RESULT fill:#e3f2fd
style SYNTH fill:#ffebee
style AUTH fill:#e8f5e9
style HYBRID fill:#fff3e0
style UNC fill:#f5f5f5
```
---
## Technology Stack
### Core Technologies
```mermaid
graph LR
subgraph "Language & Runtime"
PYTHON[Python 3.10+]
CONDA[Conda Environment]
end
subgraph "ML Frameworks"
TORCH[PyTorch]
HF[HuggingFace Transformers]
SPACY[spaCy]
SKLEARN[scikit-learn]
end
subgraph "NLP Models"
GPT2[GPT-2Perplexity/MPS]
MINILM[MiniLM-L6-v2Semantic]
ROBERTA[RoBERTaDomain Classify]
DISTIL[DistilRoBERTaMPS Mask]
XLM[XLM-RoBERTaLanguage Detect]
SPACYMODEL[en_core_web_smLinguistic]
end
subgraph "Document Processing"
PYMUPDF[PyMuPDF]
PDFPLUMBER[pdfplumber]
PYPDF2[PyPDF2]
DOCX[python-docx]
BS4[BeautifulSoup4]
end
subgraph "Utilities"
NUMPY[NumPy]
PYDANTIC[Pydantic]
LOGURU[Loguru]
REPORTLAB[ReportLab]
end
PYTHON --> TORCH
TORCH --> HF
HF --> GPT2
HF --> MINILM
HF --> ROBERTA
HF --> DISTIL
HF --> XLM
PYTHON --> SPACY
SPACY --> SPACYMODEL
style PYTHON fill:#306998
style TORCH fill:#ee4c2c
style HF fill:#ff6f00
style SPACY fill:#09a3d5
```
### Dependencies Summary
| Category | Libraries | Purpose |
|----------|-----------|---------|
| **ML Core** | PyTorch, Transformers, spaCy | Model execution, NLP |
| **Document** | PyMuPDF, pdfplumber, python-docx | Multi-format extraction |
| **Analysis** | NumPy, scikit-learn | Numerical computation |
| **Validation** | Pydantic | Data validation |
| **Logging** | Loguru | Structured logging |
| **Reporting** | ReportLab | PDF generation |
---
## Deployment Architecture
```mermaid
graph TB
subgraph "Deployment Options"
direction TB
subgraph "Standalone Application"
SCRIPT[Python Scripts]
end
subgraph "Web Application"
FASTAPI[FastAPI Server]
end
subgraph "API Service"
REST[REST API Endpoints]
BATCH[Batch Processing]
ASYNC[Async Workers]
end
subgraph "Infrastructure"
DOCKER[Docker Container]
GPU[GPU SupportOptional]
STORAGE[Model Cache2.8GB]
end
end
FASTAPI --> DOCKER
REST --> DOCKER
DOCKER --> GPU
DOCKER --> STORAGE
style FASTAPI fill:#e3f2fd
style DOCKER fill:#2496ed
style GPU fill:#76b900
```
### System Requirements
- **Python**: 3.10+
- **RAM**: 8GB minimum, 16GB recommended
- **Storage**: 5GB (models + data)
- **GPU**: Optional (CUDA/MPS for faster inference)
- **CPU**: 4+ cores for parallel execution
---
## Performance Characteristics
### Execution Modes
```mermaid
graph LR
subgraph "Sequential Mode"
S1[Metric 1] --> S2[Metric 2]
S2 --> S3[Metric 3]
S3 --> S4[Metric 4]
S4 --> S5[Metric 5]
S5 --> S6[Metric 6]
S6 --> SRESULT[~15-30s]
end
subgraph "Parallel Mode"
P1[Metric 1]
P2[Metric 2]
P3[Metric 3]
P4[Metric 4]
P5[Metric 5]
P6[Metric 6]
P1 --> PRESULT[~8-12s]
P2 --> PRESULT
P3 --> PRESULT
P4 --> PRESULT
P5 --> PRESULT
P6 --> PRESULT
end
style SRESULT fill:#ffebee
style PRESULT fill:#e8f5e9
```
### Metric Execution Times
| Metric | Avg Time | Complexity | Model Size |
|--------|----------|------------|------------|
| Structural | 0.5-1s | Low | 0MB |
| Perplexity | 2-4s | Medium | 548MB |
| Entropy | 1-2s | Medium | ~50MB (shared) |
| Semantic | 3-5s | Medium | 80MB |
| Linguistic | 2-3s | Medium | 13MB |
| MPS | 5-10s | High | 878MB (GPT-2 + DistilRoBERTa) |
**Total Sequential**: ~15-25 seconds
**Total Parallel**: ~8-12 seconds (limited by slowest metric)
---
## Security & Privacy
### Data Handling
```mermaid
graph TD
INPUT[Text Input] --> PROCESS[Processing]
PROCESS --> MEMORY[In-Memory Only]
MEMORY --> ANALYSIS[Analysis]
ANALYSIS --> CLEANUP[Auto Cleanup]
MODELS[Model Cache] -.->|Read Only| ANALYSIS
REPORTS[Optional Reports] --> STORAGE[Local Storage Only]
CLEANUP --> DISCARD[Data Discarded]
style INPUT fill:#e3f2fd
style MEMORY fill:#fff3e0
style CLEANUP fill:#e8f5e9
style DISCARD fill:#ffebee
```
### Security Features
- ✅ **No External Data Transmission**: All processing local
- ✅ **No Data Persistence**: Text data not stored by default
- ✅ **Model Integrity**: Checksums for downloaded models
- ✅ **Input Validation**: Pydantic schemas for all inputs
- ✅ **Error Isolation**: Graceful degradation, no information leakage
---
> This system does not claim ground truth authorship. It estimates probabilistic authenticity signals based on measurable text properties. |