title: Data Acquisition Layer
summary: >-
How TerraFin resolves names, normalizes outputs, and fetches public and
private data.
read_when:
- Adding a new data provider or indicator
- Debugging data fetch failures or cache issues
- Understanding how DataFactory resolves a ticker or indicator name
- Working with TimeSeriesDataFrame or PortfolioDataFrame
Data Acquisition Layer
The data layer lives under src/TerraFin/data/. Its job is to hide provider
differences behind a small set of entry points and return predictable output
shapes.
For deployment and upstream data-usage responsibilities, see License & Data Rights.
For most callers, the important pieces are:
DataFactoryfor fetching dataTimeSeriesDataFramefor chart-ready time seriesHistoryChunkfor progressive chart-history loadingPortfolioOutputfor 13F holdings data
Architecture
Routes / Agent / Frontend
β
DataFactory ββββ CacheManager (uniform cache for all sources)
β
Providers (all return predefined contracts only)
βββ yfinance (market data, fundamentals)
βββ FRED (economic indicators)
βββ market indicator (VIX, MOVE, ...)
βββ economic (UNRATE, M2, derived indicators)
βββ private_access (CAPE, fear/greed, breadth, P/E spread β HTTP to sibling DataFactory)
βββ SEC EDGAR (filings, 13F)
βββ corporate (yfinance fundamentals)
Rules of the road
These are hard rules. They apply to every provider, every route, every agent capability.
- Contracts are the only return type. All providers return ONLY a
predefined contract from
src/TerraFin/data/contracts/. No ad-hoc dicts ({date, cape}etc.) β ever. DataFactoryis the single facade. Routes, the agent, and the frontend never call providers directly. They go throughDataFactory.- Caching is unified. All caching goes through
CacheManager. Providers themselves are pure fetchers. private_accessmust shape to TerraFin contracts. The HTTP server in the sibling~/Downloads/work/DataFactoryrepo MUST shape responses to match the contracts in this repo. Contract definitions insrc/TerraFin/data/contracts/are the source of truth.- Adding a new data type is a three-step pattern, no exceptions:
- Define a contract in
data/contracts/(or reuse an existing one). - Write a provider that returns that contract.
- Register the provider with
DataFactory.
- Define a contract in
Contracts
Source: src/TerraFin/data/contracts/ (canonical list in
__init__.py)
Every provider's public return type is one of the contracts below. Each entry gives the file location, the key fields, the validation the contract enforces on construction, and one short example.
TimeSeriesDataFrame
- Location:
src/TerraFin/data/contracts/dataframes.py - Subclass of
pd.DataFrame - Columns kept (in order):
time,open,high,low,close,volume - Validation:
closeis required after normalization;timemust parse as datetime; rows are sorted bytimeand de-duplicated; non-positive prices are dropped; column aliases (Date,datetime,Close, ...) are normalized; on failure the constructor returns an empty frame with the canonical schema. - Carries
.name(series label) and.chart_meta(chart-side metadata). - Example:
df = factory.get("AAPL")βTimeSeriesDataFramechart-ready.
HistoryChunk
- Location:
src/TerraFin/data/contracts/history.py - Dataclass with
frame: TimeSeriesDataFrame,loaded_start,loaded_end,requested_period,is_complete,has_older,source_version. - Used for progressive chart loading. Bounds and flags let the frontend decide whether to backfill older history.
- Example:
chunk = factory.get_recent_history("S&P 500", period="3y")β seed window plushas_older=Trueto drive the backfill request.
PortfolioDataFrame and PortfolioOutput
- Location:
src/TerraFin/data/contracts/dataframes.py PortfolioDataFrameis apd.DataFramesubclass with amake_figure()method that renders a Plotly treemap of 13F holdings.PortfolioOutput(defined alongsideDataFactory) bundlesinfo: dictmetadata anddf: PortfolioDataFrame.- Validation:
Stock/Ticker/% of Portfolio/Updated/Recent Activitycolumns are expected bymake_figure(). - Example:
out = factory.get_portfolio_data("Warren Buffett")βout.df.make_figure()for the treemap.
FinancialStatementFrame
- Location:
src/TerraFin/data/contracts/statements.py pd.DataFramesubclass. Columns are reporting-period dates (ISO strings orpd.Timestamp); rows are line items.- Required at construction:
statement_typeβ {income,balance,cashflow},periodβ {annual,quarterly},ticker. Column-shape validation rejects non-date columns. - Example:
frame = factory.get_corporate_data("AAPL", statement_type="income", period="annual")β income statement keyed by fiscal year.
CalendarEvent and EventList
- Location:
src/TerraFin/data/contracts/events.py CalendarEventis a frozen dataclass withid,title,start(timezone-aware datetime β enforced),categoryβ {macro,earning,fed,dividend,ipo},importanceβ {low,medium,high},display_time, plus optionaldescription,source,metadata.EventListwrapslist[CalendarEvent]and supports iteration and indexing.- Example: macro calendar provider returns an
EventListof FOMC and release-date events; the calendar route serializes them directly.
TOCEntry and FilingDocument
- Location:
src/TerraFin/data/contracts/filings.py TOCEntry: frozen dataclass βid,title,level,anchor.FilingDocument: dataclass withticker,filing_typeβ {10-K,10-Q,8-K,13F,S-1,DEF 14A},accession,filing_date,markdown,toc: list[TOCEntry], optionalmetadata.- Example: SEC EDGAR provider returns a
FilingDocumentwhosemarkdownbody andtocdrive the Stock Analysis filings panel.
IndicatorSnapshot
- Location:
src/TerraFin/data/contracts/indicators.py - Frozen dataclass:
name,value(number or string),as_of, optionalunit,change,change_pct,rating,metadata. - Use for single-value scalar indicators (current Fear & Greed score, latest CAPE, breadth-of-the-day) where a full time series isn't needed.
- Example: dashboard fear/greed widget reads
snapshot = factory.get_indicator("fear_greed")and renderssnapshot.valueandsnapshot.rating.
chart_output
- Location:
src/TerraFin/data/contracts/markers.py - Decorator that normalizes the return of any time-series-shaped
DataFactorymethod into aTimeSeriesDataFramevia_to_timeseries, tagging the source for cache and debug visibility. - Not a return type itself β a marker applied to factory methods that promise time-series output.
DataFactory
Source: src/TerraFin/data/factory.py
DataFactory(api_keys: dict[str, str] | None = None)
DataFactory is the main entry point. Use get(name) when you want TerraFin
to decide where a name belongs, or call a domain-specific method when you
already know the source.
Resolution order for get(name)
- Market indicator registry (VIX, treasuries, ...)
- Economic indicator registry (FRED series, macro, ...)
- Index map + yfinance (tickers, index names)
Which method to call
| Method | Return type | Description |
|---|---|---|
get(name) |
TimeSeriesDataFrame |
Universal lookup across market indicators, economic indicators, index aliases, and raw tickers |
get_recent_history(name, period="3y") |
HistoryChunk |
Recent seed window used by progressive chart loading |
get_full_history_backfill(name, loaded_start=None) |
HistoryChunk |
Older history to prepend onto an already-seeded chart |
get_fred_data(indicator_name) |
TimeSeriesDataFrame |
Direct FRED lookup by FRED code such as "UNRATE" |
get_economic_data(indicator_name) |
TimeSeriesDataFrame |
Human-readable economic lookup such as "Unemployment Rate" |
get_market_data(ticker_or_name) |
TimeSeriesDataFrame |
Market lookup through the market provider layer |
get_corporate_data(ticker, statement_type="income", period="annual") |
pd.DataFrame | None |
Company financials via TerraFin's yfinance-backed statement adapter. |
get_portfolio_data(guru_name) |
PortfolioOutput |
Guru portfolio holdings via SEC EDGAR 13F filings |
The time-series methods are normalized by the @chart_output decorator before
they are returned.
Output type conveniences
The contract definitions above are the source of truth. A few notes about working with them in practice:
TimeSeriesDataFrame.make_empty()returns an empty frame with the canonical schema;.nameand.chart_metasurvive slicing and pandas operations.FinancialStatementFrame.make_empty(statement_type, period, ticker)creates an empty statement frame with the right metadata for a missing source.EventList.make_empty()andFilingDocument.make_empty(ticker, filing_type)exist for the same reason β empty results stay typed.
Provider map
| Domain | Backing source | Typical access path | Notes |
|---|---|---|---|
| Market prices | yfinance | get("AAPL"), get("S&P 500"), get("Shanghai Composite") |
Handles tickers and index aliases |
| Market indicators | Registry-backed market series | get("VIX"), get("MOVE"), get("Net Breadth") |
Mix of yfinance-backed and private-series-backed names resolved before raw tickers |
| Economic series | FRED | get_fred_data("UNRATE"), get("Unemployment Rate") |
Human-readable names map to FRED codes |
| Computed macro indicators | FRED-derived logic | get("Buffett Indicator") |
Built from public series |
| Credit and risk indicators | FRED and FRED-derived | get("High Yield Spread"), get("Net Liquidity") |
HY spread, RRP, net liquidity, 18M forward rate spread, credit spread |
| Corporate fundamentals | yfinance statement adapter | get_corporate_data("AAPL") |
Returns a plain pandas frame |
| SEC filings | SEC EDGAR | get_sec_data(ticker, filing_type), fetch_and_parse_filing(cik, accession, doc, form, include_images) |
Parses 10-K / 10-Q / 8-K HTML into markdown + TOC. For 8-K (and 8-K/A), EX-99.x exhibits (earnings press release, CFO commentary) are fetched alongside the cover doc and appended as ## Exhibit 99.x β <label> sections so the substantive content is reachable. Cached 30 days under the sec.* namespaces. |
| Guru portfolios | SEC EDGAR 13F | get_portfolio_data("Warren Buffett") |
Returns PortfolioOutput |
| Private dashboard data | Private endpoint with fallbacks | dashboard and market-insights APIs | Watchlist, breadth, trailing-forward P/E spread, CAPE, calendar, fear/greed, top companies |
| Macro events | FRED plus yfinance | calendar API | Fetched locally, but managed through the private-data cache lifecycle |
Registry locations:
- Market indicators:
src/TerraFin/data/providers/market/market_indicator.py - Economic indicators:
src/TerraFin/data/providers/economic/registry.py - Guru portfolio registry:
src/TerraFin/data/providers/corporate/filings/sec_edgar/guru_cik.json
The supported guru names for the 13F feature are maintained in the JSON registry above rather than hardcoded inline in Python, so additions and edits can stay data-backed and easier to review.
If TERRAFIN_SEC_USER_AGENT is missing, TerraFin still exposes the supported
guru list but treats SEC-backed holdings as disabled. The interface and agent
API return explicit configuration errors instead of silently falling back to
third-party proxies.
Private access
Private-access features are TerraFin's bridge beyond the public core. They let the same public interfaces connect to deployment-specific data and operator-side workflows without making those extensions part of the default open-source path.
These are intentional private-access extensions, not arbitrary hidden features. They provide one authenticated boundary where operator-managed deployments can attach broader workflow-specific data while public/demo deployments continue to run on public providers and safe fallbacks.
Private-access features provide proprietary or deployment-specific data behind an authenticated endpoint. They are optional: if the endpoint is unavailable, TerraFin may fall back to local file cache first and then to bundled fixtures or empty defaults, depending on the resource. This means local or private installs can continue to function without private credentials, with reduced coverage for private dashboard data. That fallback behavior should be treated as an operational convenience for controlled deployments, not as a blanket permission to serve cached restricted data publicly.
Configuration via env vars:
| Variable | Description |
|---|---|
TERRAFIN_PRIVATE_SOURCE_ENDPOINT |
Base endpoint URL for the private source |
TERRAFIN_PRIVATE_SOURCE_ACCESS_KEY |
Header name used for authentication |
TERRAFIN_PRIVATE_SOURCE_ACCESS_VALUE |
Header value used for authentication |
TERRAFIN_PRIVATE_SOURCE_TIMEOUT_SECONDS |
HTTP timeout (default: 10) |
TERRAFIN_SEC_USER_AGENT |
Required SEC EDGAR user-agent string for filings and 13F access |
TERRAFIN_MONGODB_URI / MONGODB_URI |
Optional MongoDB backend for watchlist write mode |
Implementation lives under src/TerraFin/data/providers/private_access/.
The private endpoint currently backs these dashboard and market-insight resources:
- watchlist
- market breadth
- trailing-forward P/E spread
- CAPE
- calendar data
- fear/greed
- top companies
Private series vs private widget
Private-source data in TerraFin should be classified one of two ways.
PrivateSeries
Use this when the data should behave like a real TerraFin series.
Requirements:
- normalize to
TimeSeriesDataFrame - be usable through
DataFactory - support
HistoryChunksemantics if optimized chart serving is needed - share the same cache and progressive-history contract as other chartable series
Examples:
Fear & Greedwhen used as a chart/searchable seriesNet Breadthas a chart/searchable breadth history series- future chartable private series such as
CAPEorTrailing-Forward P/E Spread, if promoted into the chart/search flow
PrivateWidget
Use this when the data is only a dashboard or page payload.
Characteristics:
- arbitrary JSON/dict/list response shape
- simple cache and refresh behavior
- no
DataFactoryor chart progressive-history requirement
Examples:
- top-companies payloads
- dashboard-only summaries that are not intended to become chart series
The rule is:
- if a private-source feature wants TerraFin's optimized chart serving, it must
enter the system as
TimeSeriesDataFrame - otherwise it remains a widget payload and should not be forced into the chart pipeline
If TerraFin is deployed publicly, keep those private-source resources behind the authenticated endpoint and treat fallback caches as an operational convenience, not as redistribution permission. Public/demo deployments should rely on public providers and bundled public-safe fixtures, not warmed private-source caches.
Watchlist write mode
The watchlist page always remains available in read-only sample mode. Writable watchlist CRUD is optional and only turns on when MongoDB is configured through:
TERRAFIN_MONGODB_URIorMONGODB_URITERRAFIN_WATCHLIST_MONGODB_DATABASETERRAFIN_WATCHLIST_MONGODB_COLLECTIONTERRAFIN_WATCHLIST_DOCUMENT_ID
Without those settings, TerraFin keeps the page visible and serves bundled sample data instead of failing startup.
Macro events and the private-data lifecycle
Macro calendar events are fetched by TerraFin itself from public sources, not
from the private endpoint. They still participate in the same
PrivateDataService refresh and fallback flow as private data so the interface
has one consistent cache lifecycle.
| Module | Path | Responsibility |
|---|---|---|
| Macro calendar | src/TerraFin/data/providers/economic/macro_calendar.py |
Fetches release dates from FRED API |
| Macro values | src/TerraFin/data/providers/economic/macro_values.py |
Enriches events with Latest/Previous from FRED series observations |
| Cache source | private.macro in CacheManager |
Daily refresh via PrivateDataService.refresh_macro() |
Current limitation: macro events do not yet carry a reliable consensus
expected value. The current enrichment step only guarantees actual and prior
values.
Caching
Provider caches are described in caching.md. The short version:
- public providers such as yfinance and FRED use in-memory plus file cache
- yfinance also exposes progressive-history helpers backed by
yfinance_v2columnar artifacts for3Yseed + full-history backfill flows - guru portfolios use file cache and now participate in manager-driven invalidation
- private-access resources are also registered with the background cache manager
- file cache sits under
~/.terrafin/cache/
When deploying TerraFin publicly, review private-access cache usage carefully. Local cache can preserve previously fetched restricted data; that does not make the data public-safe to serve. If a deployment mixes public traffic with private-source access, treat cache contents as potentially restricted unless the upstream terms clearly allow that storage and display pattern.
Adding a provider
Use this checklist when extending the data layer:
- Add a provider function under the correct domain package.
- Return
TimeSeriesDataFramefor chartable time series, or a clearly different type when the data is not time-series shaped. - Register the name in the market or economic registry if it should be
discoverable through
DataFactory.get(...). - Add cache behavior only if the source benefits from reuse or background refresh.
- For private-source features, decide explicitly whether the feature is a
PrivateSeriesor aPrivateWidgetbefore wiring UI, chart, or agent surfaces.
See also
- feature-integration.md for the cross-layer checklist when a new data capability becomes public
- interface.md for the API layer built on top of these outputs
- chart-architecture.md for the shared chart session and progressive-history contract
- analytics.md for modules that consume
TimeSeriesDataFrame - caching.md for refresh policies and file-cache behavior