Spaces:

satyaki-mitra
/

Text_Authenticator

Running

File size: 18,712 Bytes

44d0409

# TEXT-AUTH API Documentation

## Overview

The TEXT-AUTH API provides evidence-based text forensics and statistical consistency assessment through a RESTful interface. This document covers all endpoints, request/response formats, authentication, rate limiting, and integration examples.

**API Version:** 1.0.0  

---

## Table of Contents

1. [Authentication & Security](#authentication--security)
2. [Rate Limiting](#rate-limiting)
3. [Common Response Format](#common-response-format)
4. [Error Handling](#error-handling)
5. [Core Endpoints](#core-endpoints)
   - [Text Analysis](#text-analysis)
   - [File Analysis](#file-analysis)
   - [Batch Analysis](#batch-analysis)
6. [Report Endpoints](#report-endpoints)
7. [Utility Endpoints](#utility-endpoints)
8. [Best Practices](#best-practices)

---

## Authentication & Security

### API Key Authentication

*Authentication is not enforced in the current deployment. API key authentication may be added in future versions.*


## Rate Limiting

*Rate limiting is not enforced at the application level. Deployments should use an external gateway (NGINX, API Gateway, Cloudflare) to enforce rate limits if required.*

---
## Common Response Format

All successful responses follow this structure:

```json
{
  "status": "success",
  "analysis_id": "...",
  "detection_result": {...},
  "highlighted_html": "...",
  "reasoning": {...},
  "processing_time": 2.34,
  "timestamp": "..."
}
```

### HTTP Status Codes

| Code | Meaning | Description |
|------|---------|-------------|
| 200 | OK | Request succeeded |
| 201 | Created | Resource created successfully |
| 400 | Bad Request | Invalid request parameters |
| 404 | Not Found | Resource not found |
| 500 | Internal Server Error | Server error |
| 503 | Service Unavailable | Service temporarily unavailable |

---

## Error Handling

### Error Response Format

```json
{
  "status": "error",
  "error": "Invalid domain...",
  "timestamp": "..."
}
```

### Common Error Codes

| Code | Description | Resolution |
|------|-------------|------------|
| `TEXT_TOO_LONG` | Text exceeds maximum length (50,000 chars) | Split into multiple requests |
| `FILE_TOO_LARGE` | File exceeds size limit | Compress or split file |
| `UNSUPPORTED_FORMAT` | File format not supported | Use .txt, .pdf, .docx, .doc, or .md |
| `EXTRACTION_FAILED` | Document text extraction failed | Ensure file is not corrupted or password-protected |
| `MODEL_UNAVAILABLE` | Required model temporarily unavailable | Retry after a few minutes |

---

## Core Endpoints

### Text Analysis

**Endpoint:** `POST /api/analyze`

Analyze raw text for statistical consistency patterns and forensic signals.

#### Request

**Headers:**
```http
Content-Type: application/json
```

**Body:**
```json
{
  "text": "Your text content here...",
  "domain": "academic",
  "enable_highlighting": true,
  "skip_expensive_metrics": false,
  "use_sentence_level": true,
  "include_metrics_summary": true,
  "generate_report": false
}
```

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `text` | string | **Yes** | - | Text to analyze (50-50,000 chars) |
| `domain` | string | No | `null` (auto-detect) | Content domain (see [Domains](#supported-domains)) |
| `enable_highlighting` | boolean | No | `true` | Generate sentence-level highlights |
| `skip_expensive_metrics` | boolean | No | `false` | Skip computationally expensive metrics for faster results |
| `use_sentence_level` | boolean | No | `true` | Use sentence-level granularity for highlighting |
| `include_metrics_summary` | boolean | No | `true` | Include metric summaries in highlights |
| `generate_report` | boolean | No | `false` | Generate downloadable PDF/JSON report |

#### Response

```json
{
  "status": "success",
  "analysis_id": "analysis_1735555800000",
  "detection_result": {
    "ensemble_result": {
      "final_verdict": "Synthetic",
      "overall_confidence": 0.89,
      "synthetic_probability": 0.92,
      "authentic_probability": 0.08,
      "uncertainty_score": 0.23,
      "decision_boundary_distance": 0.42
    },
    "metric_results": {
      "perplexity": {
        "synthetic_probability": 0.94,
        "confidence": 0.91,
        "raw_score": 15.23,
        "evidence_strength": "strong"
      },
      "entropy": {
        "synthetic_probability": 0.88,
        "confidence": 0.85,
        "raw_score": 4.67,
        "evidence_strength": "moderate"
      },
      "structural": {
        "synthetic_probability": 0.91,
        "confidence": 0.87,
        "burstiness": -0.12,
        "uniformity": 0.85,
        "evidence_strength": "strong"
      },
      "linguistic": {
        "synthetic_probability": 0.86,
        "confidence": 0.82,
        "pos_diversity": 0.42,
        "mean_tree_depth": 4.2,
        "evidence_strength": "moderate"
      },
      "semantic": {
        "synthetic_probability": 0.93,
        "confidence": 0.88,
        "coherence_mean": 0.91,
        "coherence_variance": 0.03,
        "evidence_strength": "strong"
      },
      "multi_perturbation_stability": {
        "synthetic_probability": 0.89,
        "confidence": 0.84,
        "stability_score": 0.12,
        "evidence_strength": "moderate"
      }
    },
    "domain_prediction": {
      "primary_domain": "academic",
      "confidence": 0.94,
      "alternative_domains": [
        {"domain": "technical_doc", "probability": 0.23},
        {"domain": "science", "probability": 0.18}
      ]
    },
    "processed_text": {
      "word_count": 487,
      "sentence_count": 23,
      "paragraph_count": 5,
      "avg_sentence_length": 21.2,
      "language": "en"
    }
  },
  "highlighted_html": "<div class=\"text-forensics-highlight\">...</div>",
  "reasoning": {
    "summary": "The text exhibits strong statistical consistency patterns typical of language model generation...",
    "key_indicators": [
      "Unusually uniform sentence structure (burstiness: -0.12)",
      "High semantic coherence across all sentences (mean: 0.91)",
      "Low perplexity variance indicating predictable token sequences"
    ],
    "confidence_factors": {
      "supporting_evidence": [
        "6/6 metrics indicate synthetic patterns",
        "Strong cross-metric agreement (correlation: 0.87)"
      ],
      "uncertainty_sources": [
        "Domain-specific terminology may affect baseline expectations"
      ]
    },
    "metric_contributions": {
      "perplexity": 0.28,
      "entropy": 0.19,
      "structural": 0.16,
      "semantic": 0.17,
      "linguistic": 0.12,
      "multi_perturbation_stability": 0.08
    }
  },
  "report_files": null,
  "processing_time": 2.34,
  "timestamp": "2025-12-30T10:30:00Z"
}
```

#### Verdict Interpretation

| Verdict | Probability Range | Interpretation |
|---------|-------------------|----------------|
| **Synthetic** | > 0.70 | High consistency with language model generation patterns |
| **Likely Synthetic** | 0.55 - 0.70 | Moderate consistency with synthetic patterns |
| **Inconclusive** | 0.45 - 0.55 | Insufficient evidence for confident assessment |
| **Likely Authentic** | 0.30 - 0.45 | Moderate consistency with human authorship patterns |
| **Authentic** | < 0.30 | High consistency with human authorship patterns |

**Important:** These verdicts represent statistical consistency assessments, not definitive authorship claims.

#### Highlighting Color Key

| Color | Meaning | Probability Range |
|-------|---------|-------------------|
| 🔴 Red | Strong synthetic signals | > 0.80 |
| 🟠 Orange | Moderate synthetic signals | 0.60 - 0.80 |
| 🟡 Yellow | Weak signals | 0.40 - 0.60 |
| 🟢 Green | Authentic signals | < 0.40 |

---

### File Analysis

**Endpoint:** `POST /api/analyze/file`

Analyze uploaded documents (PDF, DOCX, DOC, TXT, MD).

#### Request

**Headers:**
```http
Content-Type: multipart/form-data
```

**Body (form-data):**
```
file: [binary file data]
domain: "academic"
skip_expensive_metrics: false
use_sentence_level: true
include_metrics_summary: true
generate_report: false
```

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `file` | file | **Yes** | - | Document file (max 25MB) |
| `domain` | string | No | `null` | Content domain override |
| `skip_expensive_metrics` | boolean | No | `false` | Skip expensive metrics |
| `use_sentence_level` | boolean | No | `true` | Sentence-level highlighting |
| `include_metrics_summary` | boolean | No | `true` | Include metric summaries |
| `generate_report` | boolean | No | `false` | Generate report |

#### Supported File Formats

| Format | Extensions | Max Size | Notes |
|--------|-----------|----------|-------|
| Plain Text | .txt, .md | 25MB | UTF-8 encoding recommended |
| PDF | .pdf | 25MB | Text-based PDFs; OCR not supported |
| Word | .docx, .doc | 25MB | Modern and legacy formats |

#### Response

Same structure as [Text Analysis](#text-analysis) with additional `file_info`:

```json
{
  "status": "success",
  "analysis_id": "file_1735555800000",
  "file_info": {
    "filename": "research_paper.pdf",
    "file_type": ".pdf",
    "pages": 12,
    "extraction_method": "pdfplumber",
    "highlighted_html": true
  },
  "detection_result": { /* same as text analysis */ },
  "highlighted_html": "...",
  "reasoning": { /* same as text analysis */ },
  "processing_time": 4.12,
  "timestamp": "2025-12-30T10:30:00Z"
}
```

#### cURL Example

```bash
curl -X POST https://your-domain.com/api/analyze/file \
  -F "file=@/path/to/document.pdf" \
  -F "domain=academic" \
  -F "generate_report=true"
```

---

### Batch Analysis

**Endpoint:** `POST /api/analyze/batch`

Analyze multiple texts in a single request for efficiency.

#### Request

```json
{
  "texts": [
    "First text to analyze...",
    "Second text to analyze...",
    "Third text to analyze..."
  ],
  "domain": "academic",
  "skip_expensive_metrics": true,
  "generate_reports": false
}
```

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `texts` | array[string] | **Yes** | - | 1-100 texts to analyze |
| `domain` | string | No | `null` | Apply same domain to all texts |
| `skip_expensive_metrics` | boolean | No | `true` | Skip expensive metrics (recommended for batch) |
| `generate_reports` | boolean | No | `false` | Generate reports for each text |

#### Response

```json
{
  "status": "success",
  "batch_id": "batch_1735555800000",
  "total": 3,
  "successful": 3,
  "failed": 0,
  "results": [
    {
      "index": 0,
      "status": "success",
      "detection": {
        "ensemble_result": { /* ... */ },
        "metric_results": { /* ... */ }
      },
      "reasoning": { /* ... */ },
      "report_files": null
    },
    {
      "index": 1,
      "status": "success",
      "detection": { /* ... */ }
    },
    {
      "index": 2,
      "status": "error",
      "error": "Text too short (minimum 50 characters)"
    }
  ],
  "processing_time": 8.92,
  "timestamp": "2025-12-30T10:30:00Z"
}
```

#### Performance Tips

- Set `skip_expensive_metrics: true` for faster batch processing
- Keep batch size under 50 texts for optimal performance
- Consider parallel API calls for batches > 100 texts
- Monitor `processing_time` to adjust batch sizes

---

## Report Endpoints

### Generate Report

**Endpoint:** `POST /api/report/generate`

Generate detailed PDF/JSON reports for cached analyses.

#### Request

**Headers:**
```http
Content-Type: application/x-www-form-urlencoded
```

**Body:**
```
analysis_id=analysis_1735555800000
formats=json,pdf
include_highlights=true
```

**Parameters:**

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `analysis_id` | string | **Yes** | - | Analysis ID from previous request |
| `formats` | string | No | `"json,pdf"` | Comma-separated formats |
| `include_highlights` | boolean | No | `true` | Include sentence highlights in report |

#### Response

```json
{
  "status": "success",
  "analysis_id": "analysis_1735555800000",
  "reports": {
    "json": "analysis_1735555800000.json",
    "pdf": "analysis_1735555800000.pdf"
  },
  "timestamp": "2025-12-30T10:30:00Z"
}
```

### Download Report

**Endpoint:** `GET /api/report/download/{filename}`

Download a generated report file.

#### Request

```http
GET /api/report/download/analysis_1735555800000.pdf
```

#### Response

Binary file download with appropriate `Content-Type` header.

**Headers:**
```http
Content-Type: application/pdf
Content-Disposition: attachment; filename="analysis_1735555800000.pdf"
Content-Length: 524288
```

---

## Utility Endpoints

### Health Check

**Endpoint:** `GET /health`

Check API health and model availability.

#### Response

```json
{
  "status": "healthy",
  "version": "1.0.0",
  "uptime": 86400.5,
  "models_loaded": {
    "orchestrator": true,
    "highlighter": true,
    "reporter": true,
    "reasoning_generator": true,
    "document_extractor": true,
    "analysis_cache": true,
    "parallel_executor": true
  }
}
```

### List Domains

**Endpoint:** `GET /api/domains`

Get all supported content domains with descriptions.

#### Response

```json
{
  "domains": [
    {
      "value": "general",
      "name": "General",
      "description": "General-purpose text without domain-specific structure"
    },
    {
      "value": "academic",
      "name": "Academic",
      "description": "Academic papers, essays, research"
    },
    {
      "value": "creative",
      "name": "Creative",
      "description": "Creative writing, fiction, poetry"
    },
    {
      "value": "technical_doc",
      "name": "Technical Doc",
      "description": "Technical documentation, manuals, specs"
    }
    // ... 12 more domains
  ]
}
```

### Supported Domains

| Domain | Use Cases | Threshold Adjustments |
|--------|-----------|----------------------|
| `general` | Default fallback | Balanced weights |
| `academic` | Research papers, essays | Higher linguistic weight |
| `creative` | Fiction, poetry | Higher entropy/structural |
| `ai_ml` | ML papers, technical AI content | Semantic prioritized |
| `software_dev` | Code docs, READMEs | Structural relaxed |
| `technical_doc` | Manuals, specs | Higher semantic weight |
| `engineering` | Technical reports | Balanced technical focus |
| `science` | Scientific papers | Academic-like calibration |
| `business` | Reports, proposals | Formal structure emphasis |
| `legal` | Contracts, court filings | Strict structural patterns |
| `medical` | Clinical notes, research | Domain-specific terminology |
| `journalism` | News articles | Balanced, lower burstiness |
| `marketing` | Ad copy, campaigns | Creative elements |
| `social_media` | Posts, casual writing | Relaxed metrics, high perplexity weight |
| `blog_personal` | Personal blogs, diaries | Creative + casual mix |
| `tutorial` | How-to guides | Instructional patterns |

### Cache Statistics

**Endpoint:** `GET /api/cache/stats`

Get analysis cache statistics (admin only).

#### Response

```json
{
  "cache_size": 42,
  "max_size": 100,
  "ttl_seconds": 3600
}
```

### Clear Cache

**Endpoint:** `POST /api/cache/clear`

Clear analysis cache (admin only).

#### Response

```json
{
  "status": "success",
  "message": "Cache cleared"
}
```

---

## Best Practices

### Optimization Tips

1. **Domain Selection**
   - Always specify domain when known for better accuracy
   - Use `/api/domains` to explore available options
   - Let system auto-detect only when domain is truly unknown

2. **Performance**
   - Set `skip_expensive_metrics: true` for faster results when speed matters
   - Use batch API for multiple texts instead of sequential single requests
   - Cache `analysis_id` to regenerate reports without reanalysis

3. **Accuracy**
   - Provide clean, well-formatted text (remove excessive whitespace)
   - Minimum 100 words recommended for reliable results
   - Avoid mixing languages in single analysis

4. **Rate Limiting**
   - Implement exponential backoff on 429 responses
   - Monitor `X-RateLimit-Remaining` header
   - Upgrade tier if consistently hitting limits

5. **Error Handling**
   - Always check `status` field in response
   - Log `request_id` for support requests
   - Implement retry logic with jitter for transient errors

### Security Recommendations

1. **API Key Management**
   - Rotate keys every 90 days
   - Use separate keys for dev/staging/production
   - Revoke compromised keys immediately

2. **Data Privacy**
   - Never send PII unless absolutely necessary
   - Use client-side redaction before API calls
   - Enable data retention policies in dashboard

3. **Input Validation**
   - Sanitize user input before sending to API
   - Validate file types client-side
   - Implement size limits before upload

---

## Version History:

- **1.0.0** (2025-12-30): Initial release
  - 6 forensic metrics
  - 16 domain support
  - PDF/JSON reporting
  - Batch processing

---

## Appendix

### Complete Domain List with Aliases

```python
DOMAIN_ALIASES = {
    'general': ['default', 'generic'],
    'academic': ['education', 'research', 'scholarly', 'university'],
    'creative': ['fiction', 'literature', 'story', 'narrative'],
    'ai_ml': ['ai', 'ml', 'machinelearning', 'neural'],
    'software_dev': ['software', 'code', 'programming', 'dev'],
    'technical_doc': ['technical', 'tech', 'documentation', 'manual'],
    'engineering': ['engineer'],
    'science': ['scientific'],
    'business': ['corporate', 'commercial', 'enterprise'],
    'legal': ['law', 'contract', 'court'],
    'medical': ['healthcare', 'clinical', 'medicine', 'health'],
    'journalism': ['news', 'reporting', 'media', 'press'],
    'marketing': ['advertising', 'promotional', 'brand', 'sales'],
    'social_media': ['social', 'casual', 'informal', 'posts'],
    'blog_personal': ['blog', 'personal', 'diary', 'lifestyle'],
    'tutorial': ['guide', 'howto', 'instructional', 'walkthrough']
}
```

### Metric Weight Defaults

```python
DEFAULT_WEIGHTS = {
    'perplexity': 0.25,
    'entropy': 0.20,
    'structural': 0.15,
    'semantic': 0.15,
    'linguistic': 0.15,
    'multi_perturbation_stability': 0.10
}
```

### Response Time Estimates

| Operation | Min | Avg | Max | P95 |
|-----------|-----|-----|-----|-----|
| Text Analysis (500 words) | 1.2s | 2.3s | 4.5s | 3.8s |
| File Analysis (PDF, 10 pages) | 2.5s | 4.1s | 8.2s | 6.9s |
| Batch (10 texts) | 5.8s | 9.2s | 15.3s | 13.1s |
| Report Generation | 0.3s | 0.8s | 2.1s | 1.5s |

---

*Last Updated: December 30, 2025*  
*API Version: 1.0.0*  
*Documentation Version: 1.0.0*