Course Project: FileOrganizer
A CLI tool that uses local LLMs and AI agents to intelligently organize files, with special focus on research paper management.
Overview
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FileOrganizer CLI β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β $ fileorg scan ~/Downloads β
β $ fileorg organize ~/Papers --strategy=by-topic β
β $ fileorg deduplicate ~/Research --similarity=0.9 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture
Files βββΊ Content Analysis βββΊ AI Classification βββΊ Organized Structure
β β
βΌ βΌ
PDF Extraction Docker Model Runner
Metadata Tools (Local LLM)
β β
ββββββββββΊMCPβββββββββββββ
Data Flow
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Files/PDFs ββββββΊβ Content ββββββΊβ MCP Server β
β (Input) β β Extraction β β (Tools) β
ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Organized βββββββ Agent Crew βββββββ Local LLM β
β Structure β β (CrewAI) β β (Docker) β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Agent System
| Agent | Role | Tools | Output |
|---|---|---|---|
| Scanner Agent | Discovers files, extracts metadata | File I/O, PDF extraction, hash generation | File inventory, metadata catalog |
| Classifier Agent | Categorizes files by content and context | LLM analysis, embeddings, similarity | Category assignments, topic tags |
| Organizer Agent | Creates folder structure and moves files | File operations, naming strategies | Organized directory tree |
| Deduplicator Agent | Finds and handles duplicate files | Hash comparison, content similarity | Duplicate reports, cleanup actions |
Agent Workflow
User Request: "Organize research papers by topic"
β
βΌ
βββββββββββββββββββββββ
β Scanner Agent β
β "What files do we β
β have and what β
β are they about?" β
ββββββββββββ¬βββββββββββ
β File Inventory
βΌ
βββββββββββββββββββββββ
β Classifier Agent β
β "What topics and β
β categories emerge β
β from the content?"β
ββββββββββββ¬βββββββββββ
β Categories
βΌ
βββββββββββββββββββββββ
β Organizer Agent β
β "Create folder β
β structure and β
β move files" β
ββββββββββββ¬βββββββββββ
β Organization Plan
βΌ
βββββββββββββββββββββββ
β Deduplicator Agent β
β "Find and handle β
β duplicate files" β
ββββββββββββ¬βββββββββββ
β
βΌ
Organized Directory
CLI Commands
fileorg scan
Scan a directory and analyze its contents.
# Scan a directory
fileorg scan ~/Downloads
# Scan with detailed analysis
fileorg scan ~/Papers --analyze-content
# Scan and export inventory
fileorg scan ~/Research --export inventory.json
# Scan specific file types
fileorg scan ~/Documents --types pdf,docx,txt
Options:
| Flag | Description | Default |
|---|---|---|
--analyze-content |
Extract and analyze file contents | false |
--export |
Export inventory to JSON/CSV | None |
--types |
Comma-separated file extensions to scan | All |
--recursive |
Scan subdirectories | true |
--max-depth |
Maximum directory depth | 10 |
fileorg organize
Organize files using AI-powered strategies.
# Organize by topic (AI-powered)
fileorg organize ~/Papers --strategy=by-topic
# Organize by date
fileorg organize ~/Photos --strategy=by-date --format="%Y/%m"
# Organize with custom naming
fileorg organize ~/Papers --rename --pattern="{year}_{author}_{title}"
# Dry run to preview changes
fileorg organize ~/Downloads --dry-run
# Interactive mode
fileorg organize ~/Research --interactive
Options:
| Flag | Description | Default |
|---|---|---|
--strategy |
Organization strategy: by-topic, by-date, by-type, by-author, smart |
smart |
--rename |
Rename files intelligently | false |
--pattern |
Naming pattern for renamed files | {original} |
--dry-run |
Preview changes without executing | false |
--interactive |
Confirm each action | false |
--output |
Output directory | Same as input |
fileorg deduplicate
Find and handle duplicate files.
# Find duplicates by hash
fileorg deduplicate ~/Downloads
# Find similar files (content-based)
fileorg deduplicate ~/Papers --similarity=0.9
# Auto-delete duplicates (keep newest)
fileorg deduplicate ~/Photos --auto-delete --keep=newest
# Move duplicates to folder
fileorg deduplicate ~/Documents --move-to=./duplicates
Options:
| Flag | Description | Default |
|---|---|---|
--similarity |
Similarity threshold (0.0-1.0) for content matching | 1.0 (exact) |
--method |
Detection method: hash, content, metadata |
hash |
--auto-delete |
Automatically delete duplicates | false |
--keep |
Which to keep: newest, oldest, largest, smallest |
newest |
--move-to |
Move duplicates to directory instead of deleting | None |
fileorg research
Special commands for research paper management.
# Extract metadata from PDFs
fileorg research extract ~/Papers
# Generate bibliography
fileorg research bibliography ~/Papers --format=bibtex --output=refs.bib
# Find related papers
fileorg research related "attention mechanisms" --in ~/Papers
# Create reading list
fileorg research reading-list ~/Papers --topic "transformers" --order=citations
Options:
| Flag | Description | Default |
|---|---|---|
--format |
Bibliography format: bibtex, apa, mla |
bibtex |
--output |
Output file path | stdout |
--order |
Sort order: date, citations, relevance |
relevance |
fileorg config
Manage configuration settings.
# Show current config
fileorg config show
# Set LLM model
fileorg config set llm.model "llama3.2:3b"
# Set default strategy
fileorg config set organize.default_strategy "by-topic"
# Reset to defaults
fileorg config reset
fileorg stats
Show statistics about files and organization.
# Show directory statistics
fileorg stats ~/Papers
# Show organization suggestions
fileorg stats ~/Downloads --suggest
# Export statistics
fileorg stats ~/Research --export stats.json
Configuration
Configuration is stored in ~/.config/fileorg/config.toml or ./fileorg.toml in the project directory.
[fileorg]
version = "1.0.0"
[llm]
provider = "docker" # docker, ollama, openai
model = "llama3.2:3b"
temperature = 0.7
max_tokens = 4096
base_url = "http://localhost:11434"
[llm.docker]
runtime = "nvidia" # nvidia, cpu
memory_limit = "8g"
[agents]
verbose = false
max_iterations = 10
[agents.scanner]
role = "File Scanner"
goal = "Discover and catalog all files with metadata"
[agents.classifier]
role = "Content Classifier"
goal = "Categorize files by content and context"
[agents.organizer]
role = "File Organizer"
goal = "Create optimal folder structure and organize files"
[agents.deduplicator]
role = "Duplicate Detector"
goal = "Find and handle duplicate files efficiently"
[organize]
default_strategy = "smart"
create_backups = true
backup_dir = "./.fileorg_backup"
[organize.naming]
sanitize = true
max_length = 255
replace_spaces = "_"
[research]
extract_metadata = true
auto_rename = true
naming_pattern = "{year}_{author}_{title}"
generate_bibliography = true
[deduplication]
default_method = "hash"
similarity_threshold = 0.95
auto_delete = false
keep_strategy = "newest"
[pdf]
extract_text = true
extract_metadata = true
ocr_enabled = false # Enable OCR for scanned PDFs
[observability]
enabled = true
provider = "langfuse" # langfuse, langsmith, console
trace_agents = true
log_tokens = true
Docker Stack
docker-compose.yml
version: "3.9"
services:
# Local LLM via Docker Model Runner
llm:
image: ollama/ollama:latest
runtime: nvidia
environment:
- OLLAMA_HOST=0.0.0.0
volumes:
- ollama_data:/root/.ollama
ports:
- "11434:11434"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
timeout: 10s
retries: 3
# MCP Server for file operations and PDF tools
mcp-server:
build:
context: ./src/fileorg/mcp
dockerfile: Dockerfile
environment:
- MCP_PORT=3000
volumes:
- ./workspace:/workspace
ports:
- "3000:3000"
depends_on:
- llm
# Main application (for containerized usage)
fileorg:
build:
context: .
dockerfile: Dockerfile
environment:
- LLM_BASE_URL=http://llm:11434
- MCP_SERVER_URL=http://mcp-server:3000
volumes:
- ./workspace:/workspace
- ./config:/config:ro
depends_on:
llm:
condition: service_healthy
mcp-server:
condition: service_started
profiles:
- cli
volumes:
ollama_data:
Running the Stack
# Start LLM and MCP server
docker compose up -d llm mcp-server
# Pull the model (first time only)
docker compose exec llm ollama pull llama3.2:3b
# Run FileOrganizer commands
docker compose run --rm fileorg scan /workspace/papers
docker compose run --rm fileorg organize /workspace/papers --strategy=by-topic
docker compose run --rm fileorg deduplicate /workspace/downloads
# Or run locally with Docker backend
fileorg scan ~/Papers
fileorg organize ~/Papers --strategy=by-topic
fileorg deduplicate ~/Downloads
Project Structure
fileorg/
βββ pyproject.toml # pixi/uv project config
βββ pixi.lock
βββ docker-compose.yml # Full stack orchestration
βββ Dockerfile
βββ fileorg.toml # Default configuration
βββ README.md
β
βββ src/
β βββ fileorg/
β βββ __init__.py
β βββ __main__.py # Entry point
β βββ cli.py # Typer CLI commands
β βββ config.py # TOML configuration loader
β β
β βββ scanner/ # File discovery and analysis
β β βββ __init__.py
β β βββ discovery.py # File system traversal
β β βββ metadata.py # Metadata extraction
β β βββ pdf_reader.py # PDF text/metadata extraction
β β βββ hashing.py # File hashing utilities
β β
β βββ classifier/ # Content classification
β β βββ __init__.py
β β βββ embeddings.py # Generate embeddings
β β βββ clustering.py # Topic clustering
β β βββ categorizer.py # AI-powered categorization
β β βββ similarity.py # Content similarity
β β
β βββ organizer/ # File organization
β β βββ __init__.py
β β βββ strategies.py # Organization strategies
β β βββ naming.py # File naming logic
β β βββ structure.py # Directory structure creation
β β βββ mover.py # Safe file operations
β β
β βββ deduplicator/ # Duplicate detection
β β βββ __init__.py
β β βββ hash_based.py # Hash-based detection
β β βββ content_based.py # Content similarity detection
β β βββ handler.py # Duplicate handling
β β
β βββ research/ # Research paper tools
β β βββ __init__.py
β β βββ extractor.py # PDF metadata extraction
β β βββ bibliography.py # Bibliography generation
β β βββ citation.py # Citation parsing
β β βββ scholar.py # Academic search integration
β β
β βββ agents/ # CrewAI agents
β β βββ __init__.py
β β βββ crew.py # Crew orchestration
β β βββ scanner.py # Scanner agent
β β βββ classifier.py # Classifier agent
β β βββ organizer.py # Organizer agent
β β βββ deduplicator.py # Deduplicator agent
β β
β βββ tools/ # Agent tools
β β βββ __init__.py
β β βββ file_tools.py # File operation tools
β β βββ pdf_tools.py # PDF processing tools
β β βββ search_tools.py # Search and query tools
β β βββ analysis.py # Content analysis tools
β β
β βββ mcp/ # MCP server
β β βββ __init__.py
β β βββ server.py # MCP server implementation
β β βββ tools.py # MCP tool definitions
β β βββ Dockerfile # MCP server container
β β
β βββ llm/ # LLM integration
β β βββ __init__.py
β β βββ client.py # LLM client (Docker/Ollama/OpenAI)
β β βββ prompts.py # Prompt templates
β β
β βββ observability/ # Logging & tracing
β βββ __init__.py
β βββ tracing.py # Distributed tracing
β βββ metrics.py # Token/cost tracking
β
βββ tests/
β βββ __init__.py
β βββ conftest.py # Pytest fixtures
β βββ test_cli.py
β βββ test_scanner.py
β βββ test_classifier.py
β βββ test_organizer.py
β βββ test_deduplicator.py
β βββ test_research.py
β βββ fixtures/
β βββ sample_papers/
β β βββ paper1.pdf
β β βββ paper2.pdf
β β βββ paper3.pdf
β βββ sample_files/
β βββ expected_outputs/
β
βββ workspace/ # Working directory
β βββ .gitkeep
β
βββ docs/ # Documentation (Quarto)
βββ _quarto.yml
βββ index.qmd
βββ chapters/
Technology Stack
| Category | Tools |
|---|---|
| Package Management | pixi, uv |
| CLI Framework | Typer, Rich |
| Local LLM | Docker Model Runner, Ollama |
| LLM Framework | LangChain |
| Multi-Agent | CrewAI |
| MCP | Docker MCP Toolkit |
| PDF Processing | PyPDF2, pdfplumber, pypdf |
| Embeddings | sentence-transformers |
| File Operations | pathlib, shutil |
| Hashing | hashlib, xxhash |
| Metadata | exifread, mutagen |
| Similarity | scikit-learn, faiss |
| Observability | Langfuse, OpenTelemetry |
| Testing | pytest, DeepEval |
| Containerization | Docker, Docker Compose |
Example Usage
End-to-End Workflow
# 1. Start the Docker stack
docker compose up -d
# 2. Scan your messy Downloads folder
fileorg scan ~/Downloads --analyze-content --export downloads_inventory.json
# 3. Organize files by type and date
fileorg organize ~/Downloads --strategy=smart --dry-run
# Review the plan, then execute
fileorg organize ~/Downloads --strategy=smart
# 4. Organize research papers by topic
fileorg scan ~/Papers --types=pdf --analyze-content
fileorg organize ~/Papers --strategy=by-topic --rename --pattern="{year}_{author}_{title}"
# 5. Find and handle duplicates
fileorg deduplicate ~/Papers --similarity=0.95 --move-to=./duplicates
# 6. Extract metadata and generate bibliography
fileorg research extract ~/Papers
fileorg research bibliography ~/Papers --format=bibtex --output=references.bib
# 7. Create a reading list on a specific topic
fileorg research reading-list ~/Papers --topic "transformers" --order=citations
# 8. View statistics
fileorg stats ~/Papers
Research Paper Organization Example
# Before:
~/Papers/
βββ paper_final.pdf
βββ attention_is_all_you_need.pdf
βββ bert_paper.pdf
βββ gpt3.pdf
βββ vision_transformer.pdf
βββ download (1).pdf
βββ download (2).pdf
βββ thesis_draft_v5.pdf
# Run organization
fileorg organize ~/Papers --strategy=by-topic --rename
# After:
~/Papers/
βββ Natural_Language_Processing/
β βββ Transformers/
β β βββ 2017_Vaswani_Attention_Is_All_You_Need.pdf
β β βββ 2018_Devlin_BERT_Pretraining.pdf
β β βββ 2020_Brown_GPT3_Language_Models.pdf
β βββ Other/
β βββ 2023_Smith_Thesis_Draft.pdf
βββ Computer_Vision/
β βββ Transformers/
β βββ 2020_Dosovitskiy_Vision_Transformer.pdf
βββ Uncategorized/
βββ 2024_Unknown_Document.pdf
Duplicate Detection Example
# Find exact duplicates
fileorg deduplicate ~/Downloads
# Found 15 duplicate files (45 MB)
# β’ download.pdf (3 copies)
# β’ image.jpg (2 copies)
# β’ report.docx (2 copies)
# Find similar papers (different versions)
fileorg deduplicate ~/Papers --similarity=0.9 --method=content
# Found 3 similar file groups:
# β’ attention_paper.pdf, attention_is_all_you_need.pdf (95% similar)
# β’ bert_preprint.pdf, bert_final.pdf (98% similar)
# Auto-cleanup (keep newest)
fileorg deduplicate ~/Downloads --auto-delete --keep=newest
# β Deleted 15 duplicate files, freed 45 MB
Learning Outcomes
By building FileOrganizer, learners will be able to:
- β Set up modern Python projects with pixi and reproducible environments
- β Build professional CLI tools with Typer and Rich
- β Run local LLMs using Docker Model Runner
- β Process and extract content from PDF files
- β Build MCP servers to connect AI agents to file systems
- β Design multi-agent systems with CrewAI
- β Implement content-based similarity and clustering
- β Generate embeddings for semantic search
- β Handle file operations safely with backups and dry-run modes
- β Implement observability for AI applications
- β Test non-deterministic systems effectively
- β Deploy self-hosted AI applications with Docker Compose
Advanced Features
Smart Organization Strategy
The smart strategy uses AI to analyze file content and context to determine the best organization approach:
# Pseudocode for smart strategy
def smart_organize(files):
# 1. Analyze file types and content
file_analysis = scanner_agent.analyze(files)
# 2. Determine optimal strategy
if mostly_pdfs_with_academic_content:
strategy = "by-topic-hierarchical"
elif mostly_media_files:
strategy = "by-date-and-type"
elif mixed_work_documents:
strategy = "by-project-and-date"
# 3. Execute with AI-powered categorization
classifier_agent.categorize(files, strategy)
organizer_agent.execute(strategy)
Research Paper Features
Special handling for academic PDFs:
- Metadata Extraction: Title, authors, year, abstract, keywords
- Citation Parsing: Extract and parse references
- Smart Naming:
{year}_{first_author}_{short_title}.pdf - Topic Clustering: Group papers by research area
- Citation Network: Identify related papers
- Bibliography Generation: BibTeX, APA, MLA formats
Deduplication Strategies
Multiple methods for finding duplicates:
- Hash-based: Exact file matches (fastest)
- Content-based: Similar content using embeddings
- Metadata-based: Same title/author but different files
- Fuzzy matching: Handle renamed or modified files
This project serves as the main example in the Learning Path for building AI-powered CLI tools.