open-chatbot / README.md
romizone's picture
Upload folder using huggingface_hub
c730f0b verified
metadata
title: Open Chatbot
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860

icon Open Chatbot

Zero File Leakage AI Chatbot β€” Privacy by Design

Your documents never leave your server. Only text goes to the cloud.

This architectural approach adopts the principle of Privacy by Design, where all documents are processed locally on the user's server before interacting with AI models through an API. The original files β€” whether PDF, Word, Excel, or OCR-extracted text from images β€” are never transmitted to the AI provider. The system only sends parsed and sliced text segments structured as JSON. Only the textual content relevant for reasoning is transmitted to the API, without including raw files, full document structures, or any sensitive metadata.

Next.js React TypeScript AI SDK License

The Problem Β· The Solution Β· How It Works Β· Quick Start Β· Features


🚨 The Problem: Document Data Leakage

Many companies and individuals want to leverage AI to analyze internal documents β€” financial reports, contracts, HR data, medical records, legal documents β€” but face serious risks:

Risk Explanation
Files sent raw to cloud When uploading PDF/Word to ChatGPT, Claude, or other AI, the original file is sent to their servers
Binary files stored on AI servers .pdf, .docx, .xlsx files are stored temporarily or permanently on AI provider infrastructure
Metadata leakage Filenames, author info, revision history, hidden comments are all transmitted
No control Once a file is sent, you have no control over data retention and usage
Compliance violation Violates GDPR, HIPAA, SOC 2, or internal corporate policies

Real-world example: You upload a Q4 financial report to ChatGPT. The 5MB PDF is sent in full to OpenAI's servers β€” including metadata, embedded images, hidden text, and revision history. You have no idea how long that file is retained.


βœ… How Open Chatbot Solves It

Open Chatbot uses a Local Processing + Text-Only Inference architecture that ensures original files never leave your infrastructure:

  YOUR INFRASTRUCTURE (On-Premise / VPS)              CLOUD (AI Provider)
  ==========================================          =======================

  πŸ“„ PDF  πŸ“ DOCX  πŸ“Š XLSX  πŸ–ΌοΈ Image              OpenAI / Claude /
       |                                             DeepSeek API
       v
  +-----------------------+                          Only receives:
  | LOCAL FILE PROCESSOR  |                          βœ… Sliced text (JSON)
  | β€’ PDF β†’ pdftotext     |                          βœ… User questions
  | β€’ DOCX β†’ mammoth      |     text JSON            βœ… System prompt
  | β€’ XLSX β†’ SheetJS      | ─────────────────────►
  | β€’ Image β†’ Tesseract   |   (max 30KB/file)       Never receives:
  | β€’ OCR scanned docs    |                          ❌ Original files (binary)
  +-----------------------+                          ❌ Images / scans
       |                                             ❌ Document metadata
       v                                             ❌ Revision history
  πŸ—‘οΈ File deleted from                               ❌ Hidden content
     memory after                                    ❌ Embedded objects
     text extraction

Security Principles

Principle Implementation
Files never sent to AI Files are processed locally β†’ only extracted text is transmitted
Text is sliced Each file is capped at max 30KB of text before being sent as a JSON chunk
No server storage Files are buffered in memory, extracted, then immediately deleted from temp
Stateless API inference DeepSeek, OpenAI, Claude APIs do not store data from API calls
API keys in browser Keys stored in browser localStorage, never on the server
Zero binary transfer What's sent to AI: {"role":"system","content":"text..."} β€” not files

πŸ”¬ How It Works Technically

1. Upload & Local Processing

When a user uploads a file, all processing happens on your own server:

POST /api/upload  β†’  FormData { files: [File] }
                            |
                            v
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚  file-processor.ts  β”‚
               β”‚                     β”‚
               β”‚  PDF ──► pdftotext  β”‚  CLI tool (poppler)
               β”‚  PDF ──► OCR       β”‚  tesseract CLI (fallback for scanned docs)
               β”‚  DOCX ──► mammoth   β”‚  npm library
               β”‚  DOC ──► word-ext   β”‚  npm library
               β”‚  XLSX ──► SheetJS   β”‚  npm library
               β”‚  IMG ──► tesseract  β”‚  OCR engine
               β”‚  TXT ──► buffer     β”‚  native Node.js
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
                         v
               { id, filename, text, size }  ← JSON response
                                               (text only, not the file)

What happens in memory:

  1. File is buffered as a Buffer in RAM
  2. Text is extracted by the appropriate library
  3. Buffer and temp files are immediately deleted after extraction
  4. Only a string of text is returned to the client

2. Sending to AI Provider

When the user sends a message, only JSON text is sent to the AI API:

// What is ACTUALLY sent to the API (from chat/route.ts)
{
  "model": "deepseek-chat",
  "system": "You are an AI assistant...\n\n=== File: report.pdf ===\nExtracted text here (max 30KB)...\n=== End File ===",
  "messages": [
    { "role": "user", "content": "Analyze this financial report" }
  ],
  "max_tokens": 8192,
  "stream": true
}

Notice:

  • There is no file, attachment, or binary in the payload
  • The system field contains plain text from extraction, not a file
  • Each file context is truncated to max 30,000 characters (fc.text.slice(0, 30000))
  • Responses are streamed as text-delta chunks β€” not stored on the server

3. Why AI Providers Don't Store Your Data

Provider API Policy
OpenAI Data from API calls is not used for training and not permanently stored (unlike the ChatGPT web interface)
Anthropic API calls are not stored for model training. Limited retention for abuse monitoring only
DeepSeek API follows minimal data retention policy for inference

Note: This applies to usage via API (which Open Chatbot uses), not via web interfaces (ChatGPT/Claude web). Web interfaces have different policies.


πŸ“Š Comparison: Open Chatbot vs Direct Upload to AI

Aspect Upload to ChatGPT/Claude Web Open Chatbot
Original file sent to cloud βœ… Yes, full file ❌ No, text only
Binary/images transmitted βœ… Yes ❌ No
Document metadata transmitted βœ… Yes (author, revisions) ❌ No
Data used for training ⚠️ Possibly (depends on settings) ❌ No (API mode)
File stored on AI server ⚠️ Temporarily/permanently ❌ No file at all
Data retention control ❌ Minimal βœ… Full (self-hosted)
Compliance friendly ⚠️ Requires review βœ… Data stays on-premise
Data size transmitted Full file (MBs) Text only (max 30KB/file)

πŸš€ Quick Start

Prerequisites

  • Node.js 18+
  • An API key from at least one provider: DeepSeek, OpenAI, or Anthropic
  • (Optional) poppler-utils and tesseract-ocr for PDF OCR

Install & Run

# Clone the repository
git clone https://github.com/romizone/chatbot-next.git
cd chatbot-next

# Install dependencies
npm install

# Start development server
npm run dev

Open http://localhost:3000 β†’ click Settings β†’ select a provider β†’ enter your API key β†’ start chatting!

Environment Variables (Optional)

Create a .env.local file for server-side fallback keys:

DEEPSEEK_API_KEY=sk-...
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

Install OCR Tools (Optional, for scanned PDFs & images)

# macOS
brew install poppler tesseract tesseract-lang

# Ubuntu/Debian
sudo apt install poppler-utils tesseract-ocr tesseract-ocr-eng

# Windows (via chocolatey)
choco install poppler tesseract

🎯 Full Features

Multi-Provider AI Engine

Provider Models Max Output
DeepSeek DeepSeek Chat, DeepSeek Reasoner 8K - 16K tokens
OpenAI GPT-4o, GPT-4o Mini, GPT-4.1, GPT-4.1 Mini 16K - 32K tokens
Anthropic Claude Sonnet 4.5, Claude Haiku 4.5 8K - 16K tokens

Local Document Processing

Format Engine Capability
PDF pdftotext CLI + Tesseract OCR Text extraction + OCR for scanned PDFs (max 20 pages)
DOCX mammoth Full text and formatting extraction
DOC word-extractor Legacy Word document support
XLSX/XLS xlsx (SheetJS) Spreadsheet to structured text (CSV per sheet)
CSV xlsx (SheetJS) Direct parsing
Images tesseract CLI OCR: PNG, JPG, BMP, TIFF, WebP
Text Native Buffer TXT, MD, JSON, XML, HTML, source code (20+ formats)

Core Features

  • Zero File Leakage β€” Files processed locally, only text sent to AI via JSON
  • Per-Provider API Keys β€” Each provider has its own key slot in browser localStorage
  • Real-time Connection Indicator β€” Green/red status in sidebar
  • Auto-Continue β€” Detects truncated responses and automatically requests continuation
  • LaTeX Math (KaTeX) β€” Mathematical formula rendering: $inline$ and $$block$$
  • Syntax Highlighting β€” Code blocks with language detection (Prism theme)
  • Multi-Session β€” Chat history with multiple sessions
  • File Context Persistence β€” File contexts preserved per session
  • Responsive UI β€” Collapsible sidebar, Tailwind CSS + Radix UI
  • Streaming Response β€” Real-time responses via Vercel AI SDK streamText

πŸ—οΈ Architecture

chatbot-next/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”‚   β”œβ”€β”€ chat/route.ts          # Streaming chat endpoint (multi-provider)
β”‚   β”‚   β”‚   └── upload/route.ts        # Local file processing endpoint
β”‚   β”‚   β”œβ”€β”€ layout.tsx
β”‚   β”‚   └── page.tsx
β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”œβ”€β”€ chat/
β”‚   β”‚   β”‚   β”œβ”€β”€ chat-page.tsx          # Main orchestrator
β”‚   β”‚   β”‚   β”œβ”€β”€ chat-area.tsx          # Message display area
β”‚   β”‚   β”‚   β”œβ”€β”€ chat-input.tsx         # Input with file upload
β”‚   β”‚   β”‚   β”œβ”€β”€ chat-message.tsx       # Message bubble
β”‚   β”‚   β”‚   β”œβ”€β”€ markdown-renderer.tsx  # Markdown + KaTeX + syntax highlight
β”‚   β”‚   β”‚   β”œβ”€β”€ settings-dialog.tsx    # Provider & API key settings
β”‚   β”‚   β”‚   β”œβ”€β”€ sidebar.tsx            # Session list + connection indicator
β”‚   β”‚   β”‚   └── welcome-screen.tsx     # Landing screen
β”‚   β”‚   └── ui/                        # Radix UI primitives
β”‚   β”œβ”€β”€ hooks/
β”‚   β”‚   └── use-chat-store.ts          # State management (localStorage)
β”‚   └── lib/
β”‚       β”œβ”€β”€ constants.ts               # Model list, system prompt, defaults
β”‚       β”œβ”€β”€ file-processor.ts          # Local document processing engine
β”‚       └── types.ts                   # TypeScript interfaces
β”œβ”€β”€ public/
β”œβ”€β”€ package.json
└── README.md

Data Flow

User uploads file ─────────────────────────────────────────────────
       β”‚
       β–Ό
POST /api/upload
       β”‚
       β–Ό
file-processor.ts ── [PDF β†’ pdftotext] ── [DOCX β†’ mammoth]
       β”‚              [XLSX β†’ SheetJS]     [IMG β†’ tesseract OCR]
       β”‚
       β–Ό
JSON response: { filename, text, size }    ← text only, file deleted
       β”‚
       β–Ό
Browser stores text in memory
       β”‚
       β–Ό
User sends message
       β”‚
       β–Ό
POST /api/chat ──► { messages, provider, model, apiKey, fileContexts }
       β”‚
       β–Ό
Build system prompt + file text (sliced max 30KB/file)
       β”‚
       β–Ό
streamText() to AI Provider ──► text-delta chunks ──► UI
       β”‚
       β–Ό
AI only receives JSON text, never file binaries

🐳 Deployment

Self-Hosted (Recommended for Enterprise)

For maximum data privacy, deploy on your own infrastructure:

npm run build
npm start

Docker

FROM node:18-alpine

# Install OCR tools (optional)
RUN apk add --no-cache poppler-utils tesseract-ocr

WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build
EXPOSE 3000
CMD ["npm", "start"]
docker build -t open-chatbot .
docker run -p 3000:3000 open-chatbot

Vercel

npx vercel

Note: On Vercel (serverless), OCR features require the Tesseract binary which may not be available. Use Docker deployment for full OCR support.


πŸ”’ Security Summary

Aspect Detail
File Processing 100% local on your server, files deleted after extraction
Data to AI Provider Plain text JSON chunks only (max 30KB/file)
API Keys Stored in browser localStorage, sent per-request via HTTPS
File Binaries Never sent to AI providers
Document Metadata Stripped during extraction β€” only text content
Temp Files Auto-cleanup after processing (using finally blocks)
Data Retention Chat history in browser localStorage only
Network Payload JSON { role, content } β€” not multipart/form-data files

πŸ› οΈ Tech Stack

Layer Technology
Framework Next.js 16 (App Router, Turbopack)
UI React 19, Tailwind CSS 4, Radix UI, Lucide Icons
AI Integration Vercel AI SDK 6 (streamText, createUIMessageStream)
Document Processing pdftotext (poppler), mammoth, word-extractor, SheetJS, Tesseract
Math Rendering KaTeX, remark-math, rehype-katex
Code Highlighting react-syntax-highlighter (Prism)
Language TypeScript 5 (strict mode)

🀝 Contributing

Contributions are welcome! Feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License. See the LICENSE file for details.


Built to prevent document data leakage when using AI.

Your files stay on your server. Only text goes to the cloud.

Made with ❀️ by Romi Nur Ismanto