TerraFin / docs /data-layer.md
sk851's picture
docs(sec_edgar): document 8-K parsing + EX-99.x exhibit append
7ac9e43
metadata
title: Data Acquisition Layer
summary: >-
  How TerraFin resolves names, normalizes outputs, and fetches public and
  private data.
read_when:
  - Adding a new data provider or indicator
  - Debugging data fetch failures or cache issues
  - Understanding how DataFactory resolves a ticker or indicator name
  - Working with TimeSeriesDataFrame or PortfolioDataFrame

Data Acquisition Layer

The data layer lives under src/TerraFin/data/. Its job is to hide provider differences behind a small set of entry points and return predictable output shapes.

For deployment and upstream data-usage responsibilities, see License & Data Rights.

For most callers, the important pieces are:

  • DataFactory for fetching data
  • TimeSeriesDataFrame for chart-ready time series
  • HistoryChunk for progressive chart-history loading
  • PortfolioOutput for 13F holdings data

Architecture

Routes / Agent / Frontend
        ↓
   DataFactory ──── CacheManager (uniform cache for all sources)
        ↓
   Providers (all return predefined contracts only)
   β”œβ”€β”€ yfinance        (market data, fundamentals)
   β”œβ”€β”€ FRED            (economic indicators)
   β”œβ”€β”€ market indicator (VIX, MOVE, ...)
   β”œβ”€β”€ economic        (UNRATE, M2, derived indicators)
   β”œβ”€β”€ private_access  (CAPE, fear/greed, breadth, P/E spread β€” HTTP to sibling DataFactory)
   β”œβ”€β”€ SEC EDGAR       (filings, 13F)
   └── corporate       (yfinance fundamentals)

Rules of the road

These are hard rules. They apply to every provider, every route, every agent capability.

  • Contracts are the only return type. All providers return ONLY a predefined contract from src/TerraFin/data/contracts/. No ad-hoc dicts ({date, cape} etc.) β€” ever.
  • DataFactory is the single facade. Routes, the agent, and the frontend never call providers directly. They go through DataFactory.
  • Caching is unified. All caching goes through CacheManager. Providers themselves are pure fetchers.
  • private_access must shape to TerraFin contracts. The HTTP server in the sibling ~/Downloads/work/DataFactory repo MUST shape responses to match the contracts in this repo. Contract definitions in src/TerraFin/data/contracts/ are the source of truth.
  • Adding a new data type is a three-step pattern, no exceptions:
    1. Define a contract in data/contracts/ (or reuse an existing one).
    2. Write a provider that returns that contract.
    3. Register the provider with DataFactory.

Contracts

Source: src/TerraFin/data/contracts/ (canonical list in __init__.py)

Every provider's public return type is one of the contracts below. Each entry gives the file location, the key fields, the validation the contract enforces on construction, and one short example.

TimeSeriesDataFrame

  • Location: src/TerraFin/data/contracts/dataframes.py
  • Subclass of pd.DataFrame
  • Columns kept (in order): time, open, high, low, close, volume
  • Validation: close is required after normalization; time must parse as datetime; rows are sorted by time and de-duplicated; non-positive prices are dropped; column aliases (Date, datetime, Close, ...) are normalized; on failure the constructor returns an empty frame with the canonical schema.
  • Carries .name (series label) and .chart_meta (chart-side metadata).
  • Example: df = factory.get("AAPL") β†’ TimeSeriesDataFrame chart-ready.

HistoryChunk

  • Location: src/TerraFin/data/contracts/history.py
  • Dataclass with frame: TimeSeriesDataFrame, loaded_start, loaded_end, requested_period, is_complete, has_older, source_version.
  • Used for progressive chart loading. Bounds and flags let the frontend decide whether to backfill older history.
  • Example: chunk = factory.get_recent_history("S&P 500", period="3y") β†’ seed window plus has_older=True to drive the backfill request.

PortfolioDataFrame and PortfolioOutput

  • Location: src/TerraFin/data/contracts/dataframes.py
  • PortfolioDataFrame is a pd.DataFrame subclass with a make_figure() method that renders a Plotly treemap of 13F holdings.
  • PortfolioOutput (defined alongside DataFactory) bundles info: dict metadata and df: PortfolioDataFrame.
  • Validation: Stock / Ticker / % of Portfolio / Updated / Recent Activity columns are expected by make_figure().
  • Example: out = factory.get_portfolio_data("Warren Buffett") β†’ out.df.make_figure() for the treemap.

FinancialStatementFrame

  • Location: src/TerraFin/data/contracts/statements.py
  • pd.DataFrame subclass. Columns are reporting-period dates (ISO strings or pd.Timestamp); rows are line items.
  • Required at construction: statement_type ∈ {income, balance, cashflow}, period ∈ {annual, quarterly}, ticker. Column-shape validation rejects non-date columns.
  • Example: frame = factory.get_corporate_data("AAPL", statement_type="income", period="annual") β†’ income statement keyed by fiscal year.

CalendarEvent and EventList

  • Location: src/TerraFin/data/contracts/events.py
  • CalendarEvent is a frozen dataclass with id, title, start (timezone-aware datetime β€” enforced), category ∈ {macro, earning, fed, dividend, ipo}, importance ∈ {low, medium, high}, display_time, plus optional description, source, metadata.
  • EventList wraps list[CalendarEvent] and supports iteration and indexing.
  • Example: macro calendar provider returns an EventList of FOMC and release-date events; the calendar route serializes them directly.

TOCEntry and FilingDocument

  • Location: src/TerraFin/data/contracts/filings.py
  • TOCEntry: frozen dataclass β€” id, title, level, anchor.
  • FilingDocument: dataclass with ticker, filing_type ∈ {10-K, 10-Q, 8-K, 13F, S-1, DEF 14A}, accession, filing_date, markdown, toc: list[TOCEntry], optional metadata.
  • Example: SEC EDGAR provider returns a FilingDocument whose markdown body and toc drive the Stock Analysis filings panel.

IndicatorSnapshot

  • Location: src/TerraFin/data/contracts/indicators.py
  • Frozen dataclass: name, value (number or string), as_of, optional unit, change, change_pct, rating, metadata.
  • Use for single-value scalar indicators (current Fear & Greed score, latest CAPE, breadth-of-the-day) where a full time series isn't needed.
  • Example: dashboard fear/greed widget reads snapshot = factory.get_indicator("fear_greed") and renders snapshot.value and snapshot.rating.

chart_output

  • Location: src/TerraFin/data/contracts/markers.py
  • Decorator that normalizes the return of any time-series-shaped DataFactory method into a TimeSeriesDataFrame via _to_timeseries, tagging the source for cache and debug visibility.
  • Not a return type itself β€” a marker applied to factory methods that promise time-series output.

DataFactory

Source: src/TerraFin/data/factory.py

DataFactory(api_keys: dict[str, str] | None = None)

DataFactory is the main entry point. Use get(name) when you want TerraFin to decide where a name belongs, or call a domain-specific method when you already know the source.

Resolution order for get(name)

  1. Market indicator registry (VIX, treasuries, ...)
  2. Economic indicator registry (FRED series, macro, ...)
  3. Index map + yfinance (tickers, index names)

Which method to call

Method Return type Description
get(name) TimeSeriesDataFrame Universal lookup across market indicators, economic indicators, index aliases, and raw tickers
get_recent_history(name, period="3y") HistoryChunk Recent seed window used by progressive chart loading
get_full_history_backfill(name, loaded_start=None) HistoryChunk Older history to prepend onto an already-seeded chart
get_fred_data(indicator_name) TimeSeriesDataFrame Direct FRED lookup by FRED code such as "UNRATE"
get_economic_data(indicator_name) TimeSeriesDataFrame Human-readable economic lookup such as "Unemployment Rate"
get_market_data(ticker_or_name) TimeSeriesDataFrame Market lookup through the market provider layer
get_corporate_data(ticker, statement_type="income", period="annual") pd.DataFrame | None Company financials via TerraFin's yfinance-backed statement adapter.
get_portfolio_data(guru_name) PortfolioOutput Guru portfolio holdings via SEC EDGAR 13F filings

The time-series methods are normalized by the @chart_output decorator before they are returned.

Output type conveniences

The contract definitions above are the source of truth. A few notes about working with them in practice:

  • TimeSeriesDataFrame.make_empty() returns an empty frame with the canonical schema; .name and .chart_meta survive slicing and pandas operations.
  • FinancialStatementFrame.make_empty(statement_type, period, ticker) creates an empty statement frame with the right metadata for a missing source.
  • EventList.make_empty() and FilingDocument.make_empty(ticker, filing_type) exist for the same reason β€” empty results stay typed.

Provider map

Domain Backing source Typical access path Notes
Market prices yfinance get("AAPL"), get("S&P 500"), get("Shanghai Composite") Handles tickers and index aliases
Market indicators Registry-backed market series get("VIX"), get("MOVE"), get("Net Breadth") Mix of yfinance-backed and private-series-backed names resolved before raw tickers
Economic series FRED get_fred_data("UNRATE"), get("Unemployment Rate") Human-readable names map to FRED codes
Computed macro indicators FRED-derived logic get("Buffett Indicator") Built from public series
Credit and risk indicators FRED and FRED-derived get("High Yield Spread"), get("Net Liquidity") HY spread, RRP, net liquidity, 18M forward rate spread, credit spread
Corporate fundamentals yfinance statement adapter get_corporate_data("AAPL") Returns a plain pandas frame
SEC filings SEC EDGAR get_sec_data(ticker, filing_type), fetch_and_parse_filing(cik, accession, doc, form, include_images) Parses 10-K / 10-Q / 8-K HTML into markdown + TOC. For 8-K (and 8-K/A), EX-99.x exhibits (earnings press release, CFO commentary) are fetched alongside the cover doc and appended as ## Exhibit 99.x β€” <label> sections so the substantive content is reachable. Cached 30 days under the sec.* namespaces.
Guru portfolios SEC EDGAR 13F get_portfolio_data("Warren Buffett") Returns PortfolioOutput
Private dashboard data Private endpoint with fallbacks dashboard and market-insights APIs Watchlist, breadth, trailing-forward P/E spread, CAPE, calendar, fear/greed, top companies
Macro events FRED plus yfinance calendar API Fetched locally, but managed through the private-data cache lifecycle

Registry locations:

  • Market indicators: src/TerraFin/data/providers/market/market_indicator.py
  • Economic indicators: src/TerraFin/data/providers/economic/registry.py
  • Guru portfolio registry: src/TerraFin/data/providers/corporate/filings/sec_edgar/guru_cik.json

The supported guru names for the 13F feature are maintained in the JSON registry above rather than hardcoded inline in Python, so additions and edits can stay data-backed and easier to review.

If TERRAFIN_SEC_USER_AGENT is missing, TerraFin still exposes the supported guru list but treats SEC-backed holdings as disabled. The interface and agent API return explicit configuration errors instead of silently falling back to third-party proxies.

Private access

Private-access features are TerraFin's bridge beyond the public core. They let the same public interfaces connect to deployment-specific data and operator-side workflows without making those extensions part of the default open-source path.

These are intentional private-access extensions, not arbitrary hidden features. They provide one authenticated boundary where operator-managed deployments can attach broader workflow-specific data while public/demo deployments continue to run on public providers and safe fallbacks.

Private-access features provide proprietary or deployment-specific data behind an authenticated endpoint. They are optional: if the endpoint is unavailable, TerraFin may fall back to local file cache first and then to bundled fixtures or empty defaults, depending on the resource. This means local or private installs can continue to function without private credentials, with reduced coverage for private dashboard data. That fallback behavior should be treated as an operational convenience for controlled deployments, not as a blanket permission to serve cached restricted data publicly.

Configuration via env vars:

Variable Description
TERRAFIN_PRIVATE_SOURCE_ENDPOINT Base endpoint URL for the private source
TERRAFIN_PRIVATE_SOURCE_ACCESS_KEY Header name used for authentication
TERRAFIN_PRIVATE_SOURCE_ACCESS_VALUE Header value used for authentication
TERRAFIN_PRIVATE_SOURCE_TIMEOUT_SECONDS HTTP timeout (default: 10)
TERRAFIN_SEC_USER_AGENT Required SEC EDGAR user-agent string for filings and 13F access
TERRAFIN_MONGODB_URI / MONGODB_URI Optional MongoDB backend for watchlist write mode

Implementation lives under src/TerraFin/data/providers/private_access/.

The private endpoint currently backs these dashboard and market-insight resources:

  • watchlist
  • market breadth
  • trailing-forward P/E spread
  • CAPE
  • calendar data
  • fear/greed
  • top companies

Private series vs private widget

Private-source data in TerraFin should be classified one of two ways.

PrivateSeries

Use this when the data should behave like a real TerraFin series.

Requirements:

  • normalize to TimeSeriesDataFrame
  • be usable through DataFactory
  • support HistoryChunk semantics if optimized chart serving is needed
  • share the same cache and progressive-history contract as other chartable series

Examples:

  • Fear & Greed when used as a chart/searchable series
  • Net Breadth as a chart/searchable breadth history series
  • future chartable private series such as CAPE or Trailing-Forward P/E Spread, if promoted into the chart/search flow

PrivateWidget

Use this when the data is only a dashboard or page payload.

Characteristics:

  • arbitrary JSON/dict/list response shape
  • simple cache and refresh behavior
  • no DataFactory or chart progressive-history requirement

Examples:

  • top-companies payloads
  • dashboard-only summaries that are not intended to become chart series

The rule is:

  • if a private-source feature wants TerraFin's optimized chart serving, it must enter the system as TimeSeriesDataFrame
  • otherwise it remains a widget payload and should not be forced into the chart pipeline

If TerraFin is deployed publicly, keep those private-source resources behind the authenticated endpoint and treat fallback caches as an operational convenience, not as redistribution permission. Public/demo deployments should rely on public providers and bundled public-safe fixtures, not warmed private-source caches.

Watchlist write mode

The watchlist page always remains available in read-only sample mode. Writable watchlist CRUD is optional and only turns on when MongoDB is configured through:

  • TERRAFIN_MONGODB_URI or MONGODB_URI
  • TERRAFIN_WATCHLIST_MONGODB_DATABASE
  • TERRAFIN_WATCHLIST_MONGODB_COLLECTION
  • TERRAFIN_WATCHLIST_DOCUMENT_ID

Without those settings, TerraFin keeps the page visible and serves bundled sample data instead of failing startup.

Macro events and the private-data lifecycle

Macro calendar events are fetched by TerraFin itself from public sources, not from the private endpoint. They still participate in the same PrivateDataService refresh and fallback flow as private data so the interface has one consistent cache lifecycle.

Module Path Responsibility
Macro calendar src/TerraFin/data/providers/economic/macro_calendar.py Fetches release dates from FRED API
Macro values src/TerraFin/data/providers/economic/macro_values.py Enriches events with Latest/Previous from FRED series observations
Cache source private.macro in CacheManager Daily refresh via PrivateDataService.refresh_macro()

Current limitation: macro events do not yet carry a reliable consensus expected value. The current enrichment step only guarantees actual and prior values.

Caching

Provider caches are described in caching.md. The short version:

  • public providers such as yfinance and FRED use in-memory plus file cache
  • yfinance also exposes progressive-history helpers backed by yfinance_v2 columnar artifacts for 3Y seed + full-history backfill flows
  • guru portfolios use file cache and now participate in manager-driven invalidation
  • private-access resources are also registered with the background cache manager
  • file cache sits under ~/.terrafin/cache/

When deploying TerraFin publicly, review private-access cache usage carefully. Local cache can preserve previously fetched restricted data; that does not make the data public-safe to serve. If a deployment mixes public traffic with private-source access, treat cache contents as potentially restricted unless the upstream terms clearly allow that storage and display pattern.

Adding a provider

Use this checklist when extending the data layer:

  1. Add a provider function under the correct domain package.
  2. Return TimeSeriesDataFrame for chartable time series, or a clearly different type when the data is not time-series shaped.
  3. Register the name in the market or economic registry if it should be discoverable through DataFactory.get(...).
  4. Add cache behavior only if the source benefits from reuse or background refresh.
  5. For private-source features, decide explicitly whether the feature is a PrivateSeries or a PrivateWidget before wiring UI, chart, or agent surfaces.

See also