Spaces:

KashiAI
/

KCH

Sleeping

KCH

File size: 22,303 Bytes

c032460

# Course Project: FileOrganizer

**A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.**

---

## Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                     FileOrganizer CLI                            │
├─────────────────────────────────────────────────────────────────┤
│  $ fileorg scan ~/Downloads                                     │
│  $ fileorg organize ~/Papers --strategy=by-topic                │
│  $ fileorg deduplicate ~/Research --similarity=0.9              │
└─────────────────────────────────────────────────────────────────┘
```

---

## Architecture

```
Files ──► Content Analysis ──► AI Classification ──► Organized Structure
              │                        │
              ▼                        ▼
        PDF Extraction          Docker Model Runner
        Metadata Tools            (Local LLM)
              │                        │
              └────────►MCP◄───────────┘
```

### Data Flow

```
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Files/PDFs  │────►│   Content    │────►│  MCP Server  │
│   (Input)    │     │  Extraction  │     │   (Tools)    │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                  │
                                                  ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Organized   │◄────│  Agent Crew  │◄────│  Local LLM   │
│  Structure   │     │  (CrewAI)    │     │   (Docker)   │
└──────────────┘     └──────────────┘     └──────────────┘
```

---

## Agent System

| Agent | Role | Tools | Output |
|-------|------|-------|--------|
| **Scanner Agent** | Discovers files, extracts metadata | File I/O, PDF extraction, hash generation | File inventory, metadata catalog |
| **Classifier Agent** | Categorizes files by content and context | LLM analysis, embeddings, similarity | Category assignments, topic tags |
| **Organizer Agent** | Creates folder structure and moves files | File operations, naming strategies | Organized directory tree |
| **Deduplicator Agent** | Finds and handles duplicate files | Hash comparison, content similarity | Duplicate reports, cleanup actions |

### Agent Workflow

```
User Request: "Organize research papers by topic"
                    │
                    ▼
         ┌─────────────────────┐
         │   Scanner Agent     │
         │  "What files do we  │
         │   have and what     │
         │   are they about?"  │
         └──────────┬──────────┘
                    │ File Inventory
                    ▼
         ┌─────────────────────┐
         │  Classifier Agent   │
         │  "What topics and   │
         │   categories emerge │
         │   from the content?"│
         └──────────┬──────────┘
                    │ Categories
                    ▼
         ┌─────────────────────┐
         │  Organizer Agent    │
         │  "Create folder     │
         │   structure and     │
         │   move files"       │
         └──────────┬──────────┘
                    │ Organization Plan
                    ▼
         ┌─────────────────────┐
         │ Deduplicator Agent  │
         │  "Find and handle   │
         │   duplicate files"  │
         └──────────┬──────────┘
                    │
                    ▼
          Organized Directory
```

---

## CLI Commands

### `fileorg scan`

Scan a directory and analyze its contents.

```bash
# Scan a directory
fileorg scan ~/Downloads

# Scan with detailed analysis
fileorg scan ~/Papers --analyze-content

# Scan and export inventory
fileorg scan ~/Research --export inventory.json

# Scan specific file types
fileorg scan ~/Documents --types pdf,docx,txt
```

**Options:**

| Flag | Description | Default |
|------|-------------|---------|
| `--analyze-content` | Extract and analyze file contents | `false` |
| `--export` | Export inventory to JSON/CSV | None |
| `--types` | Comma-separated file extensions to scan | All |
| `--recursive` | Scan subdirectories | `true` |
| `--max-depth` | Maximum directory depth | `10` |

### `fileorg organize`

Organize files using AI-powered strategies.

```bash
# Organize by topic (AI-powered)
fileorg organize ~/Papers --strategy=by-topic

# Organize by date
fileorg organize ~/Photos --strategy=by-date --format="%Y/%m"

# Organize with custom naming
fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}"

# Dry run to preview changes
fileorg organize ~/Downloads --dry-run

# Interactive mode
fileorg organize ~/Research --interactive
```

**Options:**

| Flag | Description | Default |
|------|-------------|---------|
| `--strategy` | Organization strategy: `by-topic`, `by-date`, `by-type`, `by-author`, `smart` | `smart` |
| `--rename` | Rename files intelligently | `false` |
| `--pattern` | Naming pattern for renamed files | `{original}` |
| `--dry-run` | Preview changes without executing | `false` |
| `--interactive` | Confirm each action | `false` |
| `--output` | Output directory | Same as input |

### `fileorg deduplicate`

Find and handle duplicate files.

```bash
# Find duplicates by hash
fileorg deduplicate ~/Downloads

# Find similar files (content-based)
fileorg deduplicate ~/Papers --similarity=0.9

# Auto-delete duplicates (keep newest)
fileorg deduplicate ~/Photos --auto-delete --keep=newest

# Move duplicates to folder
fileorg deduplicate ~/Documents --move-to=./duplicates
```

**Options:**

| Flag | Description | Default |
|------|-------------|---------|
| `--similarity` | Similarity threshold (0.0-1.0) for content matching | `1.0` (exact) |
| `--method` | Detection method: `hash`, `content`, `metadata` | `hash` |
| `--auto-delete` | Automatically delete duplicates | `false` |
| `--keep` | Which to keep: `newest`, `oldest`, `largest`, `smallest` | `newest` |
| `--move-to` | Move duplicates to directory instead of deleting | None |

### `fileorg research`

Special commands for research paper management.

```bash
# Extract metadata from PDFs
fileorg research extract ~/Papers

# Generate bibliography
fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib

# Find related papers
fileorg research related "attention mechanisms" --in ~/Papers

# Create reading list
fileorg research reading-list ~/Papers --topic "transformers" --order=citations
```

**Options:**

| Flag | Description | Default |
|------|-------------|---------|
| `--format` | Bibliography format: `bibtex`, `apa`, `mla` | `bibtex` |
| `--output` | Output file path | `stdout` |
| `--order` | Sort order: `date`, `citations`, `relevance` | `relevance` |

### `fileorg config`

Manage configuration settings.

```bash
# Show current config
fileorg config show

# Set LLM model
fileorg config set llm.model "llama3.2:3b"

# Set default strategy
fileorg config set organize.default_strategy "by-topic"

# Reset to defaults
fileorg config reset
```

### `fileorg stats`

Show statistics about files and organization.

```bash
# Show directory statistics
fileorg stats ~/Papers

# Show organization suggestions
fileorg stats ~/Downloads --suggest

# Export statistics
fileorg stats ~/Research --export stats.json
```

---

## Configuration

Configuration is stored in `~/.config/fileorg/config.toml` or `./fileorg.toml` in the project directory.

```toml
[fileorg]
version = "1.0.0"

[llm]
provider = "docker"           # docker, ollama, openai
model = "llama3.2:3b"
temperature = 0.7
max_tokens = 4096
base_url = "http://localhost:11434"

[llm.docker]
runtime = "nvidia"            # nvidia, cpu
memory_limit = "8g"

[agents]
verbose = false
max_iterations = 10

[agents.scanner]
role = "File Scanner"
goal = "Discover and catalog all files with metadata"

[agents.classifier]
role = "Content Classifier"
goal = "Categorize files by content and context"

[agents.organizer]
role = "File Organizer"
goal = "Create optimal folder structure and organize files"

[agents.deduplicator]
role = "Duplicate Detector"
goal = "Find and handle duplicate files efficiently"

[organize]
default_strategy = "smart"
create_backups = true
backup_dir = "./.fileorg_backup"

[organize.naming]
sanitize = true
max_length = 255
replace_spaces = "_"

[research]
extract_metadata = true
auto_rename = true
naming_pattern = "{year}_{author}_{title}"
generate_bibliography = true

[deduplication]
default_method = "hash"
similarity_threshold = 0.95
auto_delete = false
keep_strategy = "newest"

[pdf]
extract_text = true
extract_metadata = true
ocr_enabled = false           # Enable OCR for scanned PDFs

[observability]
enabled = true
provider = "langfuse"         # langfuse, langsmith, console
trace_agents = true
log_tokens = true
```

---

## Docker Stack

### docker-compose.yml

```yaml
version: "3.9"

services:
  # Local LLM via Docker Model Runner
  llm:
    image: ollama/ollama:latest
    runtime: nvidia
    environment:
      - OLLAMA_HOST=0.0.0.0
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

  # MCP Server for file operations and PDF tools
  mcp-server:
    build:
      context: ./src/fileorg/mcp
      dockerfile: Dockerfile
    environment:
      - MCP_PORT=3000
    volumes:
      - ./workspace:/workspace
    ports:
      - "3000:3000"
    depends_on:
      - llm

  # Main application (for containerized usage)
  fileorg:
    build:
      context: .
      dockerfile: Dockerfile
    environment:
      - LLM_BASE_URL=http://llm:11434
      - MCP_SERVER_URL=http://mcp-server:3000
    volumes:
      - ./workspace:/workspace
      - ./config:/config:ro
    depends_on:
      llm:
        condition: service_healthy
      mcp-server:
        condition: service_started
    profiles:
      - cli

volumes:
  ollama_data:
```

### Running the Stack

```bash
# Start LLM and MCP server
docker compose up -d llm mcp-server

# Pull the model (first time only)
docker compose exec llm ollama pull llama3.2:3b

# Run FileOrganizer commands
docker compose run --rm fileorg scan /workspace/papers
docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic
docker compose run --rm fileorg deduplicate /workspace/downloads

# Or run locally with Docker backend
fileorg scan ~/Papers
fileorg organize ~/Papers --strategy=by-topic
fileorg deduplicate ~/Downloads
```

---

## Project Structure

```
fileorg/
├── pyproject.toml              # pixi/uv project config
├── pixi.lock
├── docker-compose.yml          # Full stack orchestration
├── Dockerfile
├── fileorg.toml                # Default configuration
├── README.md
│
├── src/
│   └── fileorg/
│       ├── __init__.py
│       ├── __main__.py         # Entry point
│       ├── cli.py              # Typer CLI commands
│       ├── config.py           # TOML configuration loader
│       │
│       ├── scanner/            # File discovery and analysis
│       │   ├── __init__.py
│       │   ├── discovery.py    # File system traversal
│       │   ├── metadata.py     # Metadata extraction
│       │   ├── pdf_reader.py   # PDF text/metadata extraction
│       │   └── hashing.py      # File hashing utilities
│       │
│       ├── classifier/         # Content classification
│       │   ├── __init__.py
│       │   ├── embeddings.py   # Generate embeddings
│       │   ├── clustering.py   # Topic clustering
│       │   ├── categorizer.py  # AI-powered categorization
│       │   └── similarity.py   # Content similarity
│       │
│       ├── organizer/          # File organization
│       │   ├── __init__.py
│       │   ├── strategies.py   # Organization strategies
│       │   ├── naming.py       # File naming logic
│       │   ├── structure.py    # Directory structure creation
│       │   └── mover.py        # Safe file operations
│       │
│       ├── deduplicator/       # Duplicate detection
│       │   ├── __init__.py
│       │   ├── hash_based.py   # Hash-based detection
│       │   ├── content_based.py # Content similarity detection
│       │   └── handler.py      # Duplicate handling
│       │
│       ├── research/           # Research paper tools
│       │   ├── __init__.py
│       │   ├── extractor.py    # PDF metadata extraction
│       │   ├── bibliography.py # Bibliography generation
│       │   ├── citation.py     # Citation parsing
│       │   └── scholar.py      # Academic search integration
│       │
│       ├── agents/             # CrewAI agents
│       │   ├── __init__.py
│       │   ├── crew.py         # Crew orchestration
│       │   ├── scanner.py      # Scanner agent
│       │   ├── classifier.py   # Classifier agent
│       │   ├── organizer.py    # Organizer agent
│       │   └── deduplicator.py # Deduplicator agent
│       │
│       ├── tools/              # Agent tools
│       │   ├── __init__.py
│       │   ├── file_tools.py   # File operation tools
│       │   ├── pdf_tools.py    # PDF processing tools
│       │   ├── search_tools.py # Search and query tools
│       │   └── analysis.py     # Content analysis tools
│       │
│       ├── mcp/                # MCP server
│       │   ├── __init__.py
│       │   ├── server.py       # MCP server implementation
│       │   ├── tools.py        # MCP tool definitions
│       │   └── Dockerfile      # MCP server container
│       │
│       ├── llm/                # LLM integration
│       │   ├── __init__.py
│       │   ├── client.py       # LLM client (Docker/Ollama/OpenAI)
│       │   └── prompts.py      # Prompt templates
│       │
│       └── observability/      # Logging & tracing
│           ├── __init__.py
│           ├── tracing.py      # Distributed tracing
│           └── metrics.py      # Token/cost tracking
│
├── tests/
│   ├── __init__.py
│   ├── conftest.py             # Pytest fixtures
│   ├── test_cli.py
│   ├── test_scanner.py
│   ├── test_classifier.py
│   ├── test_organizer.py
│   ├── test_deduplicator.py
│   ├── test_research.py
│   └── fixtures/
│       ├── sample_papers/
│       │   ├── paper1.pdf
│       │   ├── paper2.pdf
│       │   └── paper3.pdf
│       ├── sample_files/
│       └── expected_outputs/
│
├── workspace/                  # Working directory
│   └── .gitkeep
│
└── docs/                       # Documentation (Quarto)
    ├── _quarto.yml
    ├── index.qmd
    └── chapters/
```

---

## Technology Stack

| Category | Tools |
|----------|-------|
| **Package Management** | pixi, uv |
| **CLI Framework** | Typer, Rich |
| **Local LLM** | Docker Model Runner, Ollama |
| **LLM Framework** | LangChain |
| **Multi-Agent** | CrewAI |
| **MCP** | Docker MCP Toolkit |
| **PDF Processing** | PyPDF2, pdfplumber, pypdf |
| **Embeddings** | sentence-transformers |
| **File Operations** | pathlib, shutil |
| **Hashing** | hashlib, xxhash |
| **Metadata** | exifread, mutagen |
| **Similarity** | scikit-learn, faiss |
| **Observability** | Langfuse, OpenTelemetry |
| **Testing** | pytest, DeepEval |
| **Containerization** | Docker, Docker Compose |

---

## Example Usage

### End-to-End Workflow

```bash
# 1. Start the Docker stack
docker compose up -d

# 2. Scan your messy Downloads folder
fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json

# 3. Organize files by type and date
fileorg organize ~/Downloads --strategy=smart --dry-run
# Review the plan, then execute
fileorg organize ~/Downloads --strategy=smart

# 4. Organize research papers by topic
fileorg scan ~/Papers --types=pdf --analyze-content
fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}"

# 5. Find and handle duplicates
fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates

# 6. Extract metadata and generate bibliography
fileorg research extract ~/Papers
fileorg research bibliography ~/Papers --format=bibtex --output=references.bib

# 7. Create a reading list on a specific topic
fileorg research reading-list ~/Papers --topic "transformers" --order=citations

# 8. View statistics
fileorg stats ~/Papers
```

### Research Paper Organization Example

```bash
# Before:
~/Papers/
├── paper_final.pdf
├── attention_is_all_you_need.pdf
├── bert_paper.pdf
├── gpt3.pdf
├── vision_transformer.pdf
├── download (1).pdf
├── download (2).pdf
└── thesis_draft_v5.pdf

# Run organization
fileorg organize ~/Papers --strategy=by-topic --rename

# After:
~/Papers/
├── Natural_Language_Processing/
│   ├── Transformers/
│   │   ├── 2017_Vaswani_Attention_Is_All_You_Need.pdf
│   │   ├── 2018_Devlin_BERT_Pretraining.pdf
│   │   └── 2020_Brown_GPT3_Language_Models.pdf
│   └── Other/
│       └── 2023_Smith_Thesis_Draft.pdf
├── Computer_Vision/
│   └── Transformers/
│       └── 2020_Dosovitskiy_Vision_Transformer.pdf
└── Uncategorized/
    └── 2024_Unknown_Document.pdf
```

### Duplicate Detection Example

```bash
# Find exact duplicates
fileorg deduplicate ~/Downloads
# Found 15 duplicate files (45 MB)
# • download.pdf (3 copies)
# • image.jpg (2 copies)
# • report.docx (2 copies)

# Find similar papers (different versions)
fileorg deduplicate ~/Papers --similarity=0.9 --method=content
# Found 3 similar file groups:
# • attention_paper.pdf, attention_is_all_you_need.pdf (95% similar)
# • bert_preprint.pdf, bert_final.pdf (98% similar)

# Auto-cleanup (keep newest)
fileorg deduplicate ~/Downloads --auto-delete --keep=newest
# ✓ Deleted 15 duplicate files, freed 45 MB
```

---

## Learning Outcomes

By building FileOrganizer, learners will be able to:

1. ✅ Set up modern Python projects with pixi and reproducible environments
2. ✅ Build professional CLI tools with Typer and Rich
3. ✅ Run local LLMs using Docker Model Runner
4. ✅ Process and extract content from PDF files
5. ✅ Build MCP servers to connect AI agents to file systems
6. ✅ Design multi-agent systems with CrewAI
7. ✅ Implement content-based similarity and clustering
8. ✅ Generate embeddings for semantic search
9. ✅ Handle file operations safely with backups and dry-run modes
10. ✅ Implement observability for AI applications
11. ✅ Test non-deterministic systems effectively
12. ✅ Deploy self-hosted AI applications with Docker Compose

---

## Advanced Features

### Smart Organization Strategy

The `smart` strategy uses AI to analyze file content and context to determine the best organization approach:

```python
# Pseudocode for smart strategy
def smart_organize(files):
    # 1. Analyze file types and content
    file_analysis = scanner_agent.analyze(files)
    
    # 2. Determine optimal strategy
    if mostly_pdfs_with_academic_content:
        strategy = "by-topic-hierarchical"
    elif mostly_media_files:
        strategy = "by-date-and-type"
    elif mixed_work_documents:
        strategy = "by-project-and-date"
    
    # 3. Execute with AI-powered categorization
    classifier_agent.categorize(files, strategy)
    organizer_agent.execute(strategy)
```

### Research Paper Features

Special handling for academic PDFs:

- **Metadata Extraction**: Title, authors, year, abstract, keywords
- **Citation Parsing**: Extract and parse references
- **Smart Naming**: `{year}_{first_author}_{short_title}.pdf`
- **Topic Clustering**: Group papers by research area
- **Citation Network**: Identify related papers
- **Bibliography Generation**: BibTeX, APA, MLA formats

### Deduplication Strategies

Multiple methods for finding duplicates:

1. **Hash-based**: Exact file matches (fastest)
2. **Content-based**: Similar content using embeddings
3. **Metadata-based**: Same title/author but different files
4. **Fuzzy matching**: Handle renamed or modified files

---

*This project serves as the main example in the [Learning Path](../learning-path.md) for building AI-powered CLI tools.*