KCH / docs /projects /FileOrganizer.md
bsamadi's picture
Update to pixi env
c032460
# Course Project: FileOrganizer
**A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.**
---
## Overview
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ FileOrganizer CLI β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ $ fileorg scan ~/Downloads β”‚
β”‚ $ fileorg organize ~/Papers --strategy=by-topic β”‚
β”‚ $ fileorg deduplicate ~/Research --similarity=0.9 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Architecture
```
Files ──► Content Analysis ──► AI Classification ──► Organized Structure
β”‚ β”‚
β–Ό β–Ό
PDF Extraction Docker Model Runner
Metadata Tools (Local LLM)
β”‚ β”‚
└────────►MCPβ—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Data Flow
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Files/PDFs │────►│ Content │────►│ MCP Server β”‚
β”‚ (Input) β”‚ β”‚ Extraction β”‚ β”‚ (Tools) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Organized │◄────│ Agent Crew │◄────│ Local LLM β”‚
β”‚ Structure β”‚ β”‚ (CrewAI) β”‚ β”‚ (Docker) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
---
## Agent System
| Agent | Role | Tools | Output |
|-------|------|-------|--------|
| **Scanner Agent** | Discovers files, extracts metadata | File I/O, PDF extraction, hash generation | File inventory, metadata catalog |
| **Classifier Agent** | Categorizes files by content and context | LLM analysis, embeddings, similarity | Category assignments, topic tags |
| **Organizer Agent** | Creates folder structure and moves files | File operations, naming strategies | Organized directory tree |
| **Deduplicator Agent** | Finds and handles duplicate files | Hash comparison, content similarity | Duplicate reports, cleanup actions |
### Agent Workflow
```
User Request: "Organize research papers by topic"
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Scanner Agent β”‚
β”‚ "What files do we β”‚
β”‚ have and what β”‚
β”‚ are they about?" β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ File Inventory
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Classifier Agent β”‚
β”‚ "What topics and β”‚
β”‚ categories emerge β”‚
β”‚ from the content?"β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Categories
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Organizer Agent β”‚
β”‚ "Create folder β”‚
β”‚ structure and β”‚
β”‚ move files" β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Organization Plan
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Deduplicator Agent β”‚
β”‚ "Find and handle β”‚
β”‚ duplicate files" β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
Organized Directory
```
---
## CLI Commands
### `fileorg scan`
Scan a directory and analyze its contents.
```bash
# Scan a directory
fileorg scan ~/Downloads
# Scan with detailed analysis
fileorg scan ~/Papers --analyze-content
# Scan and export inventory
fileorg scan ~/Research --export inventory.json
# Scan specific file types
fileorg scan ~/Documents --types pdf,docx,txt
```
**Options:**
| Flag | Description | Default |
|------|-------------|---------|
| `--analyze-content` | Extract and analyze file contents | `false` |
| `--export` | Export inventory to JSON/CSV | None |
| `--types` | Comma-separated file extensions to scan | All |
| `--recursive` | Scan subdirectories | `true` |
| `--max-depth` | Maximum directory depth | `10` |
### `fileorg organize`
Organize files using AI-powered strategies.
```bash
# Organize by topic (AI-powered)
fileorg organize ~/Papers --strategy=by-topic
# Organize by date
fileorg organize ~/Photos --strategy=by-date --format="%Y/%m"
# Organize with custom naming
fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}"
# Dry run to preview changes
fileorg organize ~/Downloads --dry-run
# Interactive mode
fileorg organize ~/Research --interactive
```
**Options:**
| Flag | Description | Default |
|------|-------------|---------|
| `--strategy` | Organization strategy: `by-topic`, `by-date`, `by-type`, `by-author`, `smart` | `smart` |
| `--rename` | Rename files intelligently | `false` |
| `--pattern` | Naming pattern for renamed files | `{original}` |
| `--dry-run` | Preview changes without executing | `false` |
| `--interactive` | Confirm each action | `false` |
| `--output` | Output directory | Same as input |
### `fileorg deduplicate`
Find and handle duplicate files.
```bash
# Find duplicates by hash
fileorg deduplicate ~/Downloads
# Find similar files (content-based)
fileorg deduplicate ~/Papers --similarity=0.9
# Auto-delete duplicates (keep newest)
fileorg deduplicate ~/Photos --auto-delete --keep=newest
# Move duplicates to folder
fileorg deduplicate ~/Documents --move-to=./duplicates
```
**Options:**
| Flag | Description | Default |
|------|-------------|---------|
| `--similarity` | Similarity threshold (0.0-1.0) for content matching | `1.0` (exact) |
| `--method` | Detection method: `hash`, `content`, `metadata` | `hash` |
| `--auto-delete` | Automatically delete duplicates | `false` |
| `--keep` | Which to keep: `newest`, `oldest`, `largest`, `smallest` | `newest` |
| `--move-to` | Move duplicates to directory instead of deleting | None |
### `fileorg research`
Special commands for research paper management.
```bash
# Extract metadata from PDFs
fileorg research extract ~/Papers
# Generate bibliography
fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib
# Find related papers
fileorg research related "attention mechanisms" --in ~/Papers
# Create reading list
fileorg research reading-list ~/Papers --topic "transformers" --order=citations
```
**Options:**
| Flag | Description | Default |
|------|-------------|---------|
| `--format` | Bibliography format: `bibtex`, `apa`, `mla` | `bibtex` |
| `--output` | Output file path | `stdout` |
| `--order` | Sort order: `date`, `citations`, `relevance` | `relevance` |
### `fileorg config`
Manage configuration settings.
```bash
# Show current config
fileorg config show
# Set LLM model
fileorg config set llm.model "llama3.2:3b"
# Set default strategy
fileorg config set organize.default_strategy "by-topic"
# Reset to defaults
fileorg config reset
```
### `fileorg stats`
Show statistics about files and organization.
```bash
# Show directory statistics
fileorg stats ~/Papers
# Show organization suggestions
fileorg stats ~/Downloads --suggest
# Export statistics
fileorg stats ~/Research --export stats.json
```
---
## Configuration
Configuration is stored in `~/.config/fileorg/config.toml` or `./fileorg.toml` in the project directory.
```toml
[fileorg]
version = "1.0.0"
[llm]
provider = "docker" # docker, ollama, openai
model = "llama3.2:3b"
temperature = 0.7
max_tokens = 4096
base_url = "http://localhost:11434"
[llm.docker]
runtime = "nvidia" # nvidia, cpu
memory_limit = "8g"
[agents]
verbose = false
max_iterations = 10
[agents.scanner]
role = "File Scanner"
goal = "Discover and catalog all files with metadata"
[agents.classifier]
role = "Content Classifier"
goal = "Categorize files by content and context"
[agents.organizer]
role = "File Organizer"
goal = "Create optimal folder structure and organize files"
[agents.deduplicator]
role = "Duplicate Detector"
goal = "Find and handle duplicate files efficiently"
[organize]
default_strategy = "smart"
create_backups = true
backup_dir = "./.fileorg_backup"
[organize.naming]
sanitize = true
max_length = 255
replace_spaces = "_"
[research]
extract_metadata = true
auto_rename = true
naming_pattern = "{year}_{author}_{title}"
generate_bibliography = true
[deduplication]
default_method = "hash"
similarity_threshold = 0.95
auto_delete = false
keep_strategy = "newest"
[pdf]
extract_text = true
extract_metadata = true
ocr_enabled = false # Enable OCR for scanned PDFs
[observability]
enabled = true
provider = "langfuse" # langfuse, langsmith, console
trace_agents = true
log_tokens = true
```
---
## Docker Stack
### docker-compose.yml
```yaml
version: "3.9"
services:
# Local LLM via Docker Model Runner
llm:
image: ollama/ollama:latest
runtime: nvidia
environment:
- OLLAMA_HOST=0.0.0.0
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
# MCP Server for file operations and PDF tools
mcp-server:
build:
context: ./src/fileorg/mcp
dockerfile: Dockerfile
environment:
- MCP_PORT=3000
volumes:
- ./workspace:/workspace
ports:
- "3000:3000"
depends_on:
- llm
# Main application (for containerized usage)
fileorg:
build:
context: .
dockerfile: Dockerfile
environment:
- LLM_BASE_URL=http://llm:11434
- MCP_SERVER_URL=http://mcp-server:3000
volumes:
- ./workspace:/workspace
- ./config:/config:ro
depends_on:
llm:
condition: service_healthy
mcp-server:
condition: service_started
profiles:
- cli
volumes:
ollama_data:
```
### Running the Stack
```bash
# Start LLM and MCP server
docker compose up -d llm mcp-server
# Pull the model (first time only)
docker compose exec llm ollama pull llama3.2:3b
# Run FileOrganizer commands
docker compose run --rm fileorg scan /workspace/papers
docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic
docker compose run --rm fileorg deduplicate /workspace/downloads
# Or run locally with Docker backend
fileorg scan ~/Papers
fileorg organize ~/Papers --strategy=by-topic
fileorg deduplicate ~/Downloads
```
---
## Project Structure
```
fileorg/
β”œβ”€β”€ pyproject.toml # pixi/uv project config
β”œβ”€β”€ pixi.lock
β”œβ”€β”€ docker-compose.yml # Full stack orchestration
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ fileorg.toml # Default configuration
β”œβ”€β”€ README.md
β”‚
β”œβ”€β”€ src/
β”‚ └── fileorg/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ __main__.py # Entry point
β”‚ β”œβ”€β”€ cli.py # Typer CLI commands
β”‚ β”œβ”€β”€ config.py # TOML configuration loader
β”‚ β”‚
β”‚ β”œβ”€β”€ scanner/ # File discovery and analysis
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ discovery.py # File system traversal
β”‚ β”‚ β”œβ”€β”€ metadata.py # Metadata extraction
β”‚ β”‚ β”œβ”€β”€ pdf_reader.py # PDF text/metadata extraction
β”‚ β”‚ └── hashing.py # File hashing utilities
β”‚ β”‚
β”‚ β”œβ”€β”€ classifier/ # Content classification
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ embeddings.py # Generate embeddings
β”‚ β”‚ β”œβ”€β”€ clustering.py # Topic clustering
β”‚ β”‚ β”œβ”€β”€ categorizer.py # AI-powered categorization
β”‚ β”‚ └── similarity.py # Content similarity
β”‚ β”‚
β”‚ β”œβ”€β”€ organizer/ # File organization
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ strategies.py # Organization strategies
β”‚ β”‚ β”œβ”€β”€ naming.py # File naming logic
β”‚ β”‚ β”œβ”€β”€ structure.py # Directory structure creation
β”‚ β”‚ └── mover.py # Safe file operations
β”‚ β”‚
β”‚ β”œβ”€β”€ deduplicator/ # Duplicate detection
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ hash_based.py # Hash-based detection
β”‚ β”‚ β”œβ”€β”€ content_based.py # Content similarity detection
β”‚ β”‚ └── handler.py # Duplicate handling
β”‚ β”‚
β”‚ β”œβ”€β”€ research/ # Research paper tools
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ extractor.py # PDF metadata extraction
β”‚ β”‚ β”œβ”€β”€ bibliography.py # Bibliography generation
β”‚ β”‚ β”œβ”€β”€ citation.py # Citation parsing
β”‚ β”‚ └── scholar.py # Academic search integration
β”‚ β”‚
β”‚ β”œβ”€β”€ agents/ # CrewAI agents
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ crew.py # Crew orchestration
β”‚ β”‚ β”œβ”€β”€ scanner.py # Scanner agent
β”‚ β”‚ β”œβ”€β”€ classifier.py # Classifier agent
β”‚ β”‚ β”œβ”€β”€ organizer.py # Organizer agent
β”‚ β”‚ └── deduplicator.py # Deduplicator agent
β”‚ β”‚
β”‚ β”œβ”€β”€ tools/ # Agent tools
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ file_tools.py # File operation tools
β”‚ β”‚ β”œβ”€β”€ pdf_tools.py # PDF processing tools
β”‚ β”‚ β”œβ”€β”€ search_tools.py # Search and query tools
β”‚ β”‚ └── analysis.py # Content analysis tools
β”‚ β”‚
β”‚ β”œβ”€β”€ mcp/ # MCP server
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ server.py # MCP server implementation
β”‚ β”‚ β”œβ”€β”€ tools.py # MCP tool definitions
β”‚ β”‚ └── Dockerfile # MCP server container
β”‚ β”‚
β”‚ β”œβ”€β”€ llm/ # LLM integration
β”‚ β”‚ β”œβ”€β”€ __init__.py
β”‚ β”‚ β”œβ”€β”€ client.py # LLM client (Docker/Ollama/OpenAI)
β”‚ β”‚ └── prompts.py # Prompt templates
β”‚ β”‚
β”‚ └── observability/ # Logging & tracing
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ tracing.py # Distributed tracing
β”‚ └── metrics.py # Token/cost tracking
β”‚
β”œβ”€β”€ tests/
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ conftest.py # Pytest fixtures
β”‚ β”œβ”€β”€ test_cli.py
β”‚ β”œβ”€β”€ test_scanner.py
β”‚ β”œβ”€β”€ test_classifier.py
β”‚ β”œβ”€β”€ test_organizer.py
β”‚ β”œβ”€β”€ test_deduplicator.py
β”‚ β”œβ”€β”€ test_research.py
β”‚ └── fixtures/
β”‚ β”œβ”€β”€ sample_papers/
β”‚ β”‚ β”œβ”€β”€ paper1.pdf
β”‚ β”‚ β”œβ”€β”€ paper2.pdf
β”‚ β”‚ └── paper3.pdf
β”‚ β”œβ”€β”€ sample_files/
β”‚ └── expected_outputs/
β”‚
β”œβ”€β”€ workspace/ # Working directory
β”‚ └── .gitkeep
β”‚
└── docs/ # Documentation (Quarto)
β”œβ”€β”€ _quarto.yml
β”œβ”€β”€ index.qmd
└── chapters/
```
---
## Technology Stack
| Category | Tools |
|----------|-------|
| **Package Management** | pixi, uv |
| **CLI Framework** | Typer, Rich |
| **Local LLM** | Docker Model Runner, Ollama |
| **LLM Framework** | LangChain |
| **Multi-Agent** | CrewAI |
| **MCP** | Docker MCP Toolkit |
| **PDF Processing** | PyPDF2, pdfplumber, pypdf |
| **Embeddings** | sentence-transformers |
| **File Operations** | pathlib, shutil |
| **Hashing** | hashlib, xxhash |
| **Metadata** | exifread, mutagen |
| **Similarity** | scikit-learn, faiss |
| **Observability** | Langfuse, OpenTelemetry |
| **Testing** | pytest, DeepEval |
| **Containerization** | Docker, Docker Compose |
---
## Example Usage
### End-to-End Workflow
```bash
# 1. Start the Docker stack
docker compose up -d
# 2. Scan your messy Downloads folder
fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json
# 3. Organize files by type and date
fileorg organize ~/Downloads --strategy=smart --dry-run
# Review the plan, then execute
fileorg organize ~/Downloads --strategy=smart
# 4. Organize research papers by topic
fileorg scan ~/Papers --types=pdf --analyze-content
fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}"
# 5. Find and handle duplicates
fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates
# 6. Extract metadata and generate bibliography
fileorg research extract ~/Papers
fileorg research bibliography ~/Papers --format=bibtex --output=references.bib
# 7. Create a reading list on a specific topic
fileorg research reading-list ~/Papers --topic "transformers" --order=citations
# 8. View statistics
fileorg stats ~/Papers
```
### Research Paper Organization Example
```bash
# Before:
~/Papers/
β”œβ”€β”€ paper_final.pdf
β”œβ”€β”€ attention_is_all_you_need.pdf
β”œβ”€β”€ bert_paper.pdf
β”œβ”€β”€ gpt3.pdf
β”œβ”€β”€ vision_transformer.pdf
β”œβ”€β”€ download (1).pdf
β”œβ”€β”€ download (2).pdf
└── thesis_draft_v5.pdf
# Run organization
fileorg organize ~/Papers --strategy=by-topic --rename
# After:
~/Papers/
β”œβ”€β”€ Natural_Language_Processing/
β”‚ β”œβ”€β”€ Transformers/
β”‚ β”‚ β”œβ”€β”€ 2017_Vaswani_Attention_Is_All_You_Need.pdf
β”‚ β”‚ β”œβ”€β”€ 2018_Devlin_BERT_Pretraining.pdf
β”‚ β”‚ └── 2020_Brown_GPT3_Language_Models.pdf
β”‚ └── Other/
β”‚ └── 2023_Smith_Thesis_Draft.pdf
β”œβ”€β”€ Computer_Vision/
β”‚ └── Transformers/
β”‚ └── 2020_Dosovitskiy_Vision_Transformer.pdf
└── Uncategorized/
└── 2024_Unknown_Document.pdf
```
### Duplicate Detection Example
```bash
# Find exact duplicates
fileorg deduplicate ~/Downloads
# Found 15 duplicate files (45 MB)
# β€’ download.pdf (3 copies)
# β€’ image.jpg (2 copies)
# β€’ report.docx (2 copies)
# Find similar papers (different versions)
fileorg deduplicate ~/Papers --similarity=0.9 --method=content
# Found 3 similar file groups:
# β€’ attention_paper.pdf, attention_is_all_you_need.pdf (95% similar)
# β€’ bert_preprint.pdf, bert_final.pdf (98% similar)
# Auto-cleanup (keep newest)
fileorg deduplicate ~/Downloads --auto-delete --keep=newest
# βœ“ Deleted 15 duplicate files, freed 45 MB
```
---
## Learning Outcomes
By building FileOrganizer, learners will be able to:
1. βœ… Set up modern Python projects with pixi and reproducible environments
2. βœ… Build professional CLI tools with Typer and Rich
3. βœ… Run local LLMs using Docker Model Runner
4. βœ… Process and extract content from PDF files
5. βœ… Build MCP servers to connect AI agents to file systems
6. βœ… Design multi-agent systems with CrewAI
7. βœ… Implement content-based similarity and clustering
8. βœ… Generate embeddings for semantic search
9. βœ… Handle file operations safely with backups and dry-run modes
10. βœ… Implement observability for AI applications
11. βœ… Test non-deterministic systems effectively
12. βœ… Deploy self-hosted AI applications with Docker Compose
---
## Advanced Features
### Smart Organization Strategy
The `smart` strategy uses AI to analyze file content and context to determine the best organization approach:
```python
# Pseudocode for smart strategy
def smart_organize(files):
# 1. Analyze file types and content
file_analysis = scanner_agent.analyze(files)
# 2. Determine optimal strategy
if mostly_pdfs_with_academic_content:
strategy = "by-topic-hierarchical"
elif mostly_media_files:
strategy = "by-date-and-type"
elif mixed_work_documents:
strategy = "by-project-and-date"
# 3. Execute with AI-powered categorization
classifier_agent.categorize(files, strategy)
organizer_agent.execute(strategy)
```
### Research Paper Features
Special handling for academic PDFs:
- **Metadata Extraction**: Title, authors, year, abstract, keywords
- **Citation Parsing**: Extract and parse references
- **Smart Naming**: `{year}_{first_author}_{short_title}.pdf`
- **Topic Clustering**: Group papers by research area
- **Citation Network**: Identify related papers
- **Bibliography Generation**: BibTeX, APA, MLA formats
### Deduplication Strategies
Multiple methods for finding duplicates:
1. **Hash-based**: Exact file matches (fastest)
2. **Content-based**: Similar content using embeddings
3. **Metadata-based**: Same title/author but different files
4. **Fuzzy matching**: Handle renamed or modified files
---
*This project serves as the main example in the [Learning Path](../learning-path.md) for building AI-powered CLI tools.*