File size: 18,985 Bytes
36dada9 9fe4fad 36dada9 085d910 36dada9 085d910 36dada9 085d910 36dada9 085d910 36dada9 7ac9e43 36dada9 be672e2 36dada9 be672e2 36dada9 782ac40 36dada9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 | ---
title: Data Acquisition Layer
summary: How TerraFin resolves names, normalizes outputs, and fetches public and private data.
read_when:
- Adding a new data provider or indicator
- Debugging data fetch failures or cache issues
- Understanding how DataFactory resolves a ticker or indicator name
- Working with TimeSeriesDataFrame or PortfolioDataFrame
---
# Data Acquisition Layer
The data layer lives under `src/TerraFin/data/`. Its job is to hide provider
differences behind a small set of entry points and return predictable output
shapes.
For deployment and upstream data-usage responsibilities, see
[License & Data Rights](./legal.md).
For most callers, the important pieces are:
- `DataFactory` for fetching data
- `TimeSeriesDataFrame` for chart-ready time series
- `HistoryChunk` for progressive chart-history loading
- `PortfolioOutput` for 13F holdings data
## Architecture
```
Routes / Agent / Frontend
β
DataFactory ββββ CacheManager (uniform cache for all sources)
β
Providers (all return predefined contracts only)
βββ yfinance (market data, fundamentals)
βββ FRED (economic indicators)
βββ market indicator (VIX, MOVE, ...)
βββ economic (UNRATE, M2, derived indicators)
βββ private_access (CAPE, fear/greed, breadth, P/E spread β HTTP to sibling DataFactory)
βββ SEC EDGAR (filings, 13F)
βββ corporate (yfinance fundamentals)
```
## Rules of the road
These are hard rules. They apply to every provider, every route, every agent
capability.
- **Contracts are the only return type.** All providers return ONLY a
predefined contract from `src/TerraFin/data/contracts/`. No ad-hoc dicts
(`{date, cape}` etc.) β ever.
- **`DataFactory` is the single facade.** Routes, the agent, and the frontend
never call providers directly. They go through `DataFactory`.
- **Caching is unified.** All caching goes through `CacheManager`. Providers
themselves are pure fetchers.
- **`private_access` must shape to TerraFin contracts.** The HTTP server in
the sibling `~/Downloads/work/DataFactory` repo MUST shape responses to
match the contracts in this repo. Contract definitions in
`src/TerraFin/data/contracts/` are the source of truth.
- **Adding a new data type is a three-step pattern, no exceptions:**
1. Define a contract in `data/contracts/` (or reuse an existing one).
2. Write a provider that returns that contract.
3. Register the provider with `DataFactory`.
## Contracts
Source: `src/TerraFin/data/contracts/` (canonical list in
[`__init__.py`](https://github.com/KiUngSong/TerraFin/blob/main/src/TerraFin/data/contracts/__init__.py))
Every provider's public return type is one of the contracts below. Each entry
gives the file location, the key fields, the validation the contract enforces
on construction, and one short example.
### `TimeSeriesDataFrame`
- Location: `src/TerraFin/data/contracts/dataframes.py`
- Subclass of `pd.DataFrame`
- Columns kept (in order): `time`, `open`, `high`, `low`, `close`, `volume`
- Validation: `close` is required after normalization; `time` must parse as
datetime; rows are sorted by `time` and de-duplicated; non-positive prices
are dropped; column aliases (`Date`, `datetime`, `Close`, ...) are
normalized; on failure the constructor returns an empty frame with the
canonical schema.
- Carries `.name` (series label) and `.chart_meta` (chart-side metadata).
- Example: `df = factory.get("AAPL")` β `TimeSeriesDataFrame` chart-ready.
### `HistoryChunk`
- Location: `src/TerraFin/data/contracts/history.py`
- Dataclass with `frame: TimeSeriesDataFrame`, `loaded_start`, `loaded_end`,
`requested_period`, `is_complete`, `has_older`, `source_version`.
- Used for progressive chart loading. Bounds and flags let the frontend
decide whether to backfill older history.
- Example: `chunk = factory.get_recent_history("S&P 500", period="3y")` β
seed window plus `has_older=True` to drive the backfill request.
### `PortfolioDataFrame` and `PortfolioOutput`
- Location: `src/TerraFin/data/contracts/dataframes.py`
- `PortfolioDataFrame` is a `pd.DataFrame` subclass with a `make_figure()`
method that renders a Plotly treemap of 13F holdings.
- `PortfolioOutput` (defined alongside `DataFactory`) bundles `info: dict`
metadata and `df: PortfolioDataFrame`.
- Validation: `Stock` / `Ticker` / `% of Portfolio` / `Updated` /
`Recent Activity` columns are expected by `make_figure()`.
- Example: `out = factory.get_portfolio_data("Warren Buffett")` β
`out.df.make_figure()` for the treemap.
### `FinancialStatementFrame`
- Location: `src/TerraFin/data/contracts/statements.py`
- `pd.DataFrame` subclass. Columns are reporting-period dates (ISO strings or
`pd.Timestamp`); rows are line items.
- Required at construction: `statement_type` β {`income`, `balance`,
`cashflow`}, `period` β {`annual`, `quarterly`}, `ticker`. Column-shape
validation rejects non-date columns.
- Example: `frame = factory.get_corporate_data("AAPL",
statement_type="income", period="annual")` β income statement keyed by
fiscal year.
### `CalendarEvent` and `EventList`
- Location: `src/TerraFin/data/contracts/events.py`
- `CalendarEvent` is a frozen dataclass with `id`, `title`, `start`
(timezone-aware datetime β enforced), `category` β {`macro`, `earning`,
`fed`, `dividend`, `ipo`}, `importance` β {`low`, `medium`, `high`},
`display_time`, plus optional `description`, `source`, `metadata`.
- `EventList` wraps `list[CalendarEvent]` and supports iteration and
indexing.
- Example: macro calendar provider returns an `EventList` of FOMC and
release-date events; the calendar route serializes them directly.
### `TOCEntry` and `FilingDocument`
- Location: `src/TerraFin/data/contracts/filings.py`
- `TOCEntry`: frozen dataclass β `id`, `title`, `level`, `anchor`.
- `FilingDocument`: dataclass with `ticker`, `filing_type` β {`10-K`,
`10-Q`, `8-K`, `13F`, `S-1`, `DEF 14A`}, `accession`, `filing_date`,
`markdown`, `toc: list[TOCEntry]`, optional `metadata`.
- Example: SEC EDGAR provider returns a `FilingDocument` whose `markdown`
body and `toc` drive the Stock Analysis filings panel.
### `IndicatorSnapshot`
- Location: `src/TerraFin/data/contracts/indicators.py`
- Frozen dataclass: `name`, `value` (number or string), `as_of`,
optional `unit`, `change`, `change_pct`, `rating`, `metadata`.
- Use for single-value scalar indicators (current Fear & Greed score, latest
CAPE, breadth-of-the-day) where a full time series isn't needed.
- Example: dashboard fear/greed widget reads
`snapshot = factory.get_indicator("fear_greed")` and renders
`snapshot.value` and `snapshot.rating`.
### `chart_output`
- Location: `src/TerraFin/data/contracts/markers.py`
- Decorator that normalizes the return of any time-series-shaped
`DataFactory` method into a `TimeSeriesDataFrame` via `_to_timeseries`,
tagging the source for cache and debug visibility.
- Not a return type itself β a marker applied to factory methods that
promise time-series output.
## DataFactory
Source: `src/TerraFin/data/factory.py`
```python
DataFactory(api_keys: dict[str, str] | None = None)
```
`DataFactory` is the main entry point. Use `get(name)` when you want TerraFin
to decide where a name belongs, or call a domain-specific method when you
already know the source.
### Resolution order for `get(name)`
1. Market indicator registry (VIX, treasuries, ...)
2. Economic indicator registry (FRED series, macro, ...)
3. Index map + yfinance (tickers, index names)
### Which method to call
| Method | Return type | Description |
|--------|-------------|-------------|
| `get(name)` | `TimeSeriesDataFrame` | Universal lookup across market indicators, economic indicators, index aliases, and raw tickers |
| `get_recent_history(name, period="3y")` | `HistoryChunk` | Recent seed window used by progressive chart loading |
| `get_full_history_backfill(name, loaded_start=None)` | `HistoryChunk` | Older history to prepend onto an already-seeded chart |
| `get_fred_data(indicator_name)` | `TimeSeriesDataFrame` | Direct FRED lookup by FRED code such as `"UNRATE"` |
| `get_economic_data(indicator_name)` | `TimeSeriesDataFrame` | Human-readable economic lookup such as `"Unemployment Rate"` |
| `get_market_data(ticker_or_name)` | `TimeSeriesDataFrame` | Market lookup through the market provider layer |
| `get_corporate_data(ticker, statement_type="income", period="annual")` | `pd.DataFrame \| None` | Company financials via TerraFin's yfinance-backed statement adapter. |
| `get_portfolio_data(guru_name)` | `PortfolioOutput` | Guru portfolio holdings via SEC EDGAR 13F filings |
The time-series methods are normalized by the `@chart_output` decorator before
they are returned.
## Output type conveniences
The contract definitions above are the source of truth. A few notes about
working with them in practice:
- `TimeSeriesDataFrame.make_empty()` returns an empty frame with the canonical
schema; `.name` and `.chart_meta` survive slicing and pandas operations.
- `FinancialStatementFrame.make_empty(statement_type, period, ticker)`
creates an empty statement frame with the right metadata for a missing
source.
- `EventList.make_empty()` and `FilingDocument.make_empty(ticker, filing_type)`
exist for the same reason β empty results stay typed.
## Provider map
| Domain | Backing source | Typical access path | Notes |
|--------|----------------|---------------------|-------|
| Market prices | yfinance | `get("AAPL")`, `get("S&P 500")`, `get("Shanghai Composite")` | Handles tickers and index aliases |
| Market indicators | Registry-backed market series | `get("VIX")`, `get("MOVE")`, `get("Net Breadth")` | Mix of yfinance-backed and private-series-backed names resolved before raw tickers |
| Economic series | FRED | `get_fred_data("UNRATE")`, `get("Unemployment Rate")` | Human-readable names map to FRED codes |
| Computed macro indicators | FRED-derived logic | `get("Buffett Indicator")` | Built from public series |
| Credit and risk indicators | FRED and FRED-derived | `get("High Yield Spread")`, `get("Net Liquidity")` | HY spread, RRP, net liquidity, 18M forward rate spread, credit spread |
| Corporate fundamentals | yfinance statement adapter | `get_corporate_data("AAPL")` | Returns a plain pandas frame |
| SEC filings | SEC EDGAR | `get_sec_data(ticker, filing_type)`, `fetch_and_parse_filing(cik, accession, doc, form, include_images)` | Parses 10-K / 10-Q / 8-K HTML into markdown + TOC. For 8-K (and 8-K/A), EX-99.x exhibits (earnings press release, CFO commentary) are fetched alongside the cover doc and appended as `## Exhibit 99.x β <label>` sections so the substantive content is reachable. Cached 30 days under the `sec.*` namespaces. |
| Guru portfolios | SEC EDGAR 13F | `get_portfolio_data("Warren Buffett")` | Returns `PortfolioOutput` |
| Private dashboard data | Private endpoint with fallbacks | dashboard and market-insights APIs | Watchlist, breadth, trailing-forward P/E spread, CAPE, calendar, fear/greed, top companies |
| Macro events | FRED plus yfinance | calendar API | Fetched locally, but managed through the private-data cache lifecycle |
Registry locations:
- Market indicators: `src/TerraFin/data/providers/market/market_indicator.py`
- Economic indicators: `src/TerraFin/data/providers/economic/registry.py`
- Guru portfolio registry: `src/TerraFin/data/providers/corporate/filings/sec_edgar/guru_cik.json`
The supported guru names for the 13F feature are maintained in the JSON
registry above rather than hardcoded inline in Python, so additions and edits
can stay data-backed and easier to review.
If `TERRAFIN_SEC_USER_AGENT` is missing, TerraFin still exposes the supported
guru list but treats SEC-backed holdings as disabled. The interface and agent
API return explicit configuration errors instead of silently falling back to
third-party proxies.
### Private access
Private-access features are TerraFin's bridge beyond the public core. They let
the same public interfaces connect to deployment-specific data and
operator-side workflows without making those extensions part of the default
open-source path.
These are intentional private-access extensions, not arbitrary hidden features.
They provide one authenticated boundary where operator-managed deployments can
attach broader workflow-specific data while public/demo deployments continue to
run on public providers and safe fallbacks.
Private-access features provide proprietary or deployment-specific data behind
an authenticated endpoint. They are optional: if the endpoint is unavailable,
TerraFin may fall back to local file cache first and then to bundled fixtures
or empty defaults, depending on the resource. This means local or private
installs can continue to function without private credentials, with reduced
coverage for private dashboard data. That fallback behavior should be treated
as an operational convenience for controlled deployments, not as a blanket
permission to serve cached restricted data publicly.
Configuration via env vars:
| Variable | Description |
|----------|-------------|
| `TERRAFIN_PRIVATE_SOURCE_ENDPOINT` | Base endpoint URL for the private source |
| `TERRAFIN_PRIVATE_SOURCE_ACCESS_KEY` | Header name used for authentication |
| `TERRAFIN_PRIVATE_SOURCE_ACCESS_VALUE` | Header value used for authentication |
| `TERRAFIN_PRIVATE_SOURCE_TIMEOUT_SECONDS` | HTTP timeout (default: 10) |
| `TERRAFIN_SEC_USER_AGENT` | Required SEC EDGAR user-agent string for filings and 13F access |
| `TERRAFIN_MONGODB_URI` / `MONGODB_URI` | Optional MongoDB backend for watchlist write mode |
Implementation lives under `src/TerraFin/data/providers/private_access/`.
The private endpoint currently backs these dashboard and market-insight
resources:
- watchlist
- market breadth
- trailing-forward P/E spread
- CAPE
- calendar data
- fear/greed
- top companies
### Private series vs private widget
Private-source data in TerraFin should be classified one of two ways.
#### PrivateSeries
Use this when the data should behave like a real TerraFin series.
Requirements:
- normalize to `TimeSeriesDataFrame`
- be usable through `DataFactory`
- support `HistoryChunk` semantics if optimized chart serving is needed
- share the same cache and progressive-history contract as other chartable series
Examples:
- `Fear & Greed` when used as a chart/searchable series
- `Net Breadth` as a chart/searchable breadth history series
- future chartable private series such as `CAPE` or
`Trailing-Forward P/E Spread`, if promoted into the chart/search flow
#### PrivateWidget
Use this when the data is only a dashboard or page payload.
Characteristics:
- arbitrary JSON/dict/list response shape
- simple cache and refresh behavior
- no `DataFactory` or chart progressive-history requirement
Examples:
- top-companies payloads
- dashboard-only summaries that are not intended to become chart series
The rule is:
- if a private-source feature wants TerraFin's optimized chart serving, it must
enter the system as `TimeSeriesDataFrame`
- otherwise it remains a widget payload and should not be forced into the chart
pipeline
If TerraFin is deployed publicly, keep those private-source resources behind the
authenticated endpoint and treat fallback caches as an operational convenience,
not as redistribution permission. Public/demo deployments should rely on public
providers and bundled public-safe fixtures, not warmed private-source caches.
### Watchlist write mode
The watchlist page always remains available in read-only sample mode. Writable
watchlist CRUD is optional and only turns on when MongoDB is configured
through:
- `TERRAFIN_MONGODB_URI` or `MONGODB_URI`
- `TERRAFIN_WATCHLIST_MONGODB_DATABASE`
- `TERRAFIN_WATCHLIST_MONGODB_COLLECTION`
- `TERRAFIN_WATCHLIST_DOCUMENT_ID`
Without those settings, TerraFin keeps the page visible and serves bundled
sample data instead of failing startup.
### Macro events and the private-data lifecycle
Macro calendar events are fetched by TerraFin itself from public sources, not
from the private endpoint. They still participate in the same
`PrivateDataService` refresh and fallback flow as private data so the interface
has one consistent cache lifecycle.
| Module | Path | Responsibility |
|--------|------|----------------|
| Macro calendar | `src/TerraFin/data/providers/economic/macro_calendar.py` | Fetches release dates from FRED API |
| Macro values | `src/TerraFin/data/providers/economic/macro_values.py` | Enriches events with Latest/Previous from FRED series observations |
| Cache source | `private.macro` in CacheManager | Daily refresh via `PrivateDataService.refresh_macro()` |
Current limitation: macro events do not yet carry a reliable consensus
`expected` value. The current enrichment step only guarantees actual and prior
values.
## Caching
Provider caches are described in [caching.md](./caching.md). The short version:
- public providers such as yfinance and FRED use in-memory plus file cache
- yfinance also exposes progressive-history helpers backed by `yfinance_v2`
columnar artifacts for `3Y` seed + full-history backfill flows
- guru portfolios use file cache and now participate in manager-driven invalidation
- private-access resources are also registered with the background cache manager
- file cache sits under `~/.terrafin/cache/`
When deploying TerraFin publicly, review private-access cache usage carefully.
Local cache can preserve previously fetched restricted data; that does not make
the data public-safe to serve. If a deployment mixes public traffic with
private-source access, treat cache contents as potentially restricted unless the
upstream terms clearly allow that storage and display pattern.
## Adding a provider
Use this checklist when extending the data layer:
1. Add a provider function under the correct domain package.
2. Return `TimeSeriesDataFrame` for chartable time series, or a clearly
different type when the data is not time-series shaped.
3. Register the name in the market or economic registry if it should be
discoverable through `DataFactory.get(...)`.
4. Add cache behavior only if the source benefits from reuse or background
refresh.
5. For private-source features, decide explicitly whether the feature is a
`PrivateSeries` or a `PrivateWidget` before wiring UI, chart, or agent
surfaces.
## See also
- [feature-integration.md](./feature-integration.md) for the cross-layer checklist when a new data capability becomes public
- [interface.md](./interface.md) for the API layer built on top of these outputs
- [chart-architecture.md](./chart-architecture.md) for the shared chart session and progressive-history contract
- [analytics.md](./analytics.md) for modules that consume `TimeSeriesDataFrame`
- [caching.md](./caching.md) for refresh policies and file-cache behavior
|