# Data Models Reference

> **Last Updated**: 2025-12-06

This document describes all Pydantic models used in DeepBoner.

## Location

All core models are defined in `src/utils/models.py`.

## Type Definitions

### SourceName

```python
SourceName = Literal["pubmed", "clinicaltrials", "europepmc", "preprint", "openalex", "web"]
```

Centralized source type. Add new sources here when integrating new databases.

---

## Core Models

### Citation

Represents a citation to a source document.

```python
class Citation(BaseModel):
    source: SourceName          # Where this came from
    title: str                  # Title (1-500 chars)
    url: str                    # URL to source
    date: str                   # Publication date (YYYY-MM-DD or 'Unknown')
    authors: list[str]          # Author list

    MAX_AUTHORS_IN_CITATION: ClassVar[int] = 3

    @property
    def formatted(self) -> str:
        """Format as citation string."""
```

**Example:**
```python
citation = Citation(
    source="pubmed",
    title="Effects of testosterone on female libido",
    url="https://pubmed.ncbi.nlm.nih.gov/12345678",
    date="2024-01-15",
    authors=["Smith J", "Jones A", "Brown B"]
)
print(citation.formatted)
# "Smith J, Jones A, Brown B (2024-01-15). Effects of testosterone..."
```

---

### Evidence

A piece of evidence retrieved from search.

```python
class Evidence(BaseModel):
    content: str                # The actual text content (min 1 char)
    citation: Citation          # Source citation
    relevance: float            # Relevance score 0-1
    metadata: dict[str, Any]    # Additional metadata

    model_config = {"frozen": True}  # Immutable
```

**Metadata fields** (source-dependent):
- `cited_by_count` - Citation count
- `concepts` - Subject concepts
- `is_open_access` - OA status
- `pmid` - PubMed ID
- `doi` - Digital Object Identifier

**Example:**
```python
evidence = Evidence(
    content="The study found significant improvement...",
    citation=citation,
    relevance=0.85,
    metadata={"pmid": "12345678", "cited_by_count": 42}
)
```

---

### SearchResult

Result of a search operation.

```python
class SearchResult(BaseModel):
    query: str                      # Original query
    evidence: list[Evidence]        # Retrieved evidence
    sources_searched: list[SourceName]  # Which sources were queried
    total_found: int                # Total matches
    errors: list[str]               # Any errors encountered
```

---

## Assessment Models

### AssessmentDetails

Detailed assessment of evidence quality by the Judge.

```python
class AssessmentDetails(BaseModel):
    mechanism_score: int            # 0-10: How well explained
    mechanism_reasoning: str        # Explanation (min 10 chars)
    clinical_evidence_score: int    # 0-10: Clinical strength
    clinical_reasoning: str         # Explanation (min 10 chars)
    drug_candidates: list[str]      # Specific drugs mentioned
    key_findings: list[str]         # Key findings
```

---

### JudgeAssessment

Complete assessment from the Judge.

```python
class JudgeAssessment(BaseModel):
    details: AssessmentDetails
    sufficient: bool                # Is evidence sufficient?
    confidence: float               # 0-1 confidence
    recommendation: Literal["continue", "synthesize"]
    next_search_queries: list[str]  # If continue, what to search
    reasoning: str                  # Overall reasoning (min 20 chars)
```

**Decision Logic:**
- `recommendation="continue"` → More evidence needed, loop back
- `recommendation="synthesize"` → Ready to generate report

---

## Event Models

### AgentEvent

Event emitted by orchestrator for UI streaming.

```python
class AgentEvent(BaseModel):
    type: Literal[
        "started",
        "thinking",
        "searching",
        "search_complete",
        "judging",
        "judge_complete",
        "looping",
        "synthesizing",
        "complete",
        "error",
        "streaming",
        "hypothesizing",
        "analyzing",
        "analysis_complete",
        "progress",
    ]
    message: str
    data: Any = None
    timestamp: datetime
    iteration: int = 0

    def to_markdown(self) -> str:
        """Format event as markdown with emoji."""
```

**Event Types:**
| Type | Icon | Meaning |
|------|------|---------|
| `started` | 🚀 | Research started |
| `thinking` | ⏳ | Processing |
| `searching` | 🔍 | Searching databases |
| `search_complete` | 📚 | Search finished |
| `judging` | 🧠 | Evaluating evidence |
| `judge_complete` | ✅ | Judgment done |
| `looping` | 🔄 | Refining query |
| `synthesizing` | 📝 | Generating report |
| `complete` | 🎉 | Research complete |
| `error` | ❌ | Error occurred |
| `progress` | ⏱️ | Progress update |

---

## Hypothesis Models

### MechanismHypothesis

A scientific hypothesis about drug mechanism.

```python
class MechanismHypothesis(BaseModel):
    drug: str                       # Drug being studied
    target: str                     # Molecular target
    pathway: str                    # Biological pathway
    effect: str                     # Downstream effect
    confidence: float               # 0-1 confidence
    supporting_evidence: list[str]  # Supporting PMIDs/URLs
    contradicting_evidence: list[str]
    search_suggestions: list[str]

    def to_search_queries(self) -> list[str]:
        """Generate queries to test hypothesis."""
```

---

### HypothesisAssessment

Assessment of evidence against hypotheses.

```python
class HypothesisAssessment(BaseModel):
    hypotheses: list[MechanismHypothesis]
    primary_hypothesis: MechanismHypothesis | None
    knowledge_gaps: list[str]
    recommended_searches: list[str]
```

---

## Report Models

### ReportSection

A section of the research report.

```python
class ReportSection(BaseModel):
    title: str
    content: str
    citations: list[str] = []   # Reserved for inline citations
```

---

### ResearchReport

Structured scientific report (final output).

```python
class ResearchReport(BaseModel):
    title: str
    executive_summary: str          # 100-1000 chars
    research_question: str

    methodology: ReportSection
    hypotheses_tested: list[dict[str, Any]]

    mechanistic_findings: ReportSection
    clinical_findings: ReportSection

    drug_candidates: list[str]
    limitations: list[str]
    conclusion: str

    references: list[dict[str, str]]

    # Metadata
    sources_searched: list[str]
    total_papers_reviewed: int
    search_iterations: int
    confidence_score: float         # 0-1

    def to_markdown(self) -> str:
        """Render report as markdown."""
```

**Reference Format:**
```python
{
    "title": "Paper title",
    "authors": "Smith J et al.",
    "source": "pubmed",
    "date": "2024-01-15",
    "url": "https://..."
}
```

---

## Configuration Models

### OrchestratorConfig

Configuration for the orchestrator.

```python
class OrchestratorConfig(BaseModel):
    max_iterations: int = 10        # 1-20
    max_results_per_tool: int = 10  # 1-50
    search_timeout: float = 30.0    # 5-120 seconds
```

---

## Model Relationships

```
SearchResult
    └── Evidence[]
           └── Citation

JudgeAssessment
    └── AssessmentDetails

ResearchReport
    ├── ReportSection (methodology)
    ├── ReportSection (mechanistic_findings)
    ├── ReportSection (clinical_findings)
    └── HypothesisAssessment
           └── MechanismHypothesis[]
```

---

## Validation Notes

All models use Pydantic v2 with:

- **Field constraints** - `ge=0`, `le=1` for scores, `min_length` for strings
- **Frozen models** - Evidence is immutable (`frozen=True`)
- **Default factories** - Lists default to `[]` via `default_factory=list`

---

## Related Documentation

- [Component Inventory](component-inventory.md)
- [Exception Hierarchy](exception-hierarchy.md)
- [Architecture Overview](overview.md)