File size: 18,985 Bytes
36dada9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9fe4fad
 
36dada9
 
 
 
 
 
 
 
085d910
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36dada9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
085d910
36dada9
085d910
 
36dada9
085d910
 
 
 
 
 
 
36dada9
 
 
 
 
 
 
 
 
 
 
7ac9e43
36dada9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be672e2
 
 
 
 
 
 
 
 
 
 
 
36dada9
 
 
be672e2
 
36dada9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
782ac40
 
 
36dada9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
---
title: Data Acquisition Layer
summary: How TerraFin resolves names, normalizes outputs, and fetches public and private data.
read_when:
  - Adding a new data provider or indicator
  - Debugging data fetch failures or cache issues
  - Understanding how DataFactory resolves a ticker or indicator name
  - Working with TimeSeriesDataFrame or PortfolioDataFrame
---

# Data Acquisition Layer

The data layer lives under `src/TerraFin/data/`. Its job is to hide provider
differences behind a small set of entry points and return predictable output
shapes.

For deployment and upstream data-usage responsibilities, see
[License & Data Rights](./legal.md).

For most callers, the important pieces are:

- `DataFactory` for fetching data
- `TimeSeriesDataFrame` for chart-ready time series
- `HistoryChunk` for progressive chart-history loading
- `PortfolioOutput` for 13F holdings data

## Architecture

```
Routes / Agent / Frontend
        ↓
   DataFactory ──── CacheManager (uniform cache for all sources)
        ↓
   Providers (all return predefined contracts only)
   β”œβ”€β”€ yfinance        (market data, fundamentals)
   β”œβ”€β”€ FRED            (economic indicators)
   β”œβ”€β”€ market indicator (VIX, MOVE, ...)
   β”œβ”€β”€ economic        (UNRATE, M2, derived indicators)
   β”œβ”€β”€ private_access  (CAPE, fear/greed, breadth, P/E spread β€” HTTP to sibling DataFactory)
   β”œβ”€β”€ SEC EDGAR       (filings, 13F)
   └── corporate       (yfinance fundamentals)
```

## Rules of the road

These are hard rules. They apply to every provider, every route, every agent
capability.

- **Contracts are the only return type.** All providers return ONLY a
  predefined contract from `src/TerraFin/data/contracts/`. No ad-hoc dicts
  (`{date, cape}` etc.) β€” ever.
- **`DataFactory` is the single facade.** Routes, the agent, and the frontend
  never call providers directly. They go through `DataFactory`.
- **Caching is unified.** All caching goes through `CacheManager`. Providers
  themselves are pure fetchers.
- **`private_access` must shape to TerraFin contracts.** The HTTP server in
  the sibling `~/Downloads/work/DataFactory` repo MUST shape responses to
  match the contracts in this repo. Contract definitions in
  `src/TerraFin/data/contracts/` are the source of truth.
- **Adding a new data type is a three-step pattern, no exceptions:**
  1. Define a contract in `data/contracts/` (or reuse an existing one).
  2. Write a provider that returns that contract.
  3. Register the provider with `DataFactory`.

## Contracts

Source: `src/TerraFin/data/contracts/` (canonical list in
[`__init__.py`](https://github.com/KiUngSong/TerraFin/blob/main/src/TerraFin/data/contracts/__init__.py))

Every provider's public return type is one of the contracts below. Each entry
gives the file location, the key fields, the validation the contract enforces
on construction, and one short example.

### `TimeSeriesDataFrame`

- Location: `src/TerraFin/data/contracts/dataframes.py`
- Subclass of `pd.DataFrame`
- Columns kept (in order): `time`, `open`, `high`, `low`, `close`, `volume`
- Validation: `close` is required after normalization; `time` must parse as
  datetime; rows are sorted by `time` and de-duplicated; non-positive prices
  are dropped; column aliases (`Date`, `datetime`, `Close`, ...) are
  normalized; on failure the constructor returns an empty frame with the
  canonical schema.
- Carries `.name` (series label) and `.chart_meta` (chart-side metadata).
- Example: `df = factory.get("AAPL")` β†’ `TimeSeriesDataFrame` chart-ready.

### `HistoryChunk`

- Location: `src/TerraFin/data/contracts/history.py`
- Dataclass with `frame: TimeSeriesDataFrame`, `loaded_start`, `loaded_end`,
  `requested_period`, `is_complete`, `has_older`, `source_version`.
- Used for progressive chart loading. Bounds and flags let the frontend
  decide whether to backfill older history.
- Example: `chunk = factory.get_recent_history("S&P 500", period="3y")` β†’
  seed window plus `has_older=True` to drive the backfill request.

### `PortfolioDataFrame` and `PortfolioOutput`

- Location: `src/TerraFin/data/contracts/dataframes.py`
- `PortfolioDataFrame` is a `pd.DataFrame` subclass with a `make_figure()`
  method that renders a Plotly treemap of 13F holdings.
- `PortfolioOutput` (defined alongside `DataFactory`) bundles `info: dict`
  metadata and `df: PortfolioDataFrame`.
- Validation: `Stock` / `Ticker` / `% of Portfolio` / `Updated` /
  `Recent Activity` columns are expected by `make_figure()`.
- Example: `out = factory.get_portfolio_data("Warren Buffett")` β†’
  `out.df.make_figure()` for the treemap.

### `FinancialStatementFrame`

- Location: `src/TerraFin/data/contracts/statements.py`
- `pd.DataFrame` subclass. Columns are reporting-period dates (ISO strings or
  `pd.Timestamp`); rows are line items.
- Required at construction: `statement_type` ∈ {`income`, `balance`,
  `cashflow`}, `period` ∈ {`annual`, `quarterly`}, `ticker`. Column-shape
  validation rejects non-date columns.
- Example: `frame = factory.get_corporate_data("AAPL",
  statement_type="income", period="annual")` β†’ income statement keyed by
  fiscal year.

### `CalendarEvent` and `EventList`

- Location: `src/TerraFin/data/contracts/events.py`
- `CalendarEvent` is a frozen dataclass with `id`, `title`, `start`
  (timezone-aware datetime β€” enforced), `category` ∈ {`macro`, `earning`,
  `fed`, `dividend`, `ipo`}, `importance` ∈ {`low`, `medium`, `high`},
  `display_time`, plus optional `description`, `source`, `metadata`.
- `EventList` wraps `list[CalendarEvent]` and supports iteration and
  indexing.
- Example: macro calendar provider returns an `EventList` of FOMC and
  release-date events; the calendar route serializes them directly.

### `TOCEntry` and `FilingDocument`

- Location: `src/TerraFin/data/contracts/filings.py`
- `TOCEntry`: frozen dataclass β€” `id`, `title`, `level`, `anchor`.
- `FilingDocument`: dataclass with `ticker`, `filing_type` ∈ {`10-K`,
  `10-Q`, `8-K`, `13F`, `S-1`, `DEF 14A`}, `accession`, `filing_date`,
  `markdown`, `toc: list[TOCEntry]`, optional `metadata`.
- Example: SEC EDGAR provider returns a `FilingDocument` whose `markdown`
  body and `toc` drive the Stock Analysis filings panel.

### `IndicatorSnapshot`

- Location: `src/TerraFin/data/contracts/indicators.py`
- Frozen dataclass: `name`, `value` (number or string), `as_of`,
  optional `unit`, `change`, `change_pct`, `rating`, `metadata`.
- Use for single-value scalar indicators (current Fear & Greed score, latest
  CAPE, breadth-of-the-day) where a full time series isn't needed.
- Example: dashboard fear/greed widget reads
  `snapshot = factory.get_indicator("fear_greed")` and renders
  `snapshot.value` and `snapshot.rating`.

### `chart_output`

- Location: `src/TerraFin/data/contracts/markers.py`
- Decorator that normalizes the return of any time-series-shaped
  `DataFactory` method into a `TimeSeriesDataFrame` via `_to_timeseries`,
  tagging the source for cache and debug visibility.
- Not a return type itself β€” a marker applied to factory methods that
  promise time-series output.

## DataFactory

Source: `src/TerraFin/data/factory.py`

```python
DataFactory(api_keys: dict[str, str] | None = None)
```

`DataFactory` is the main entry point. Use `get(name)` when you want TerraFin
to decide where a name belongs, or call a domain-specific method when you
already know the source.

### Resolution order for `get(name)`

1. Market indicator registry (VIX, treasuries, ...)
2. Economic indicator registry (FRED series, macro, ...)
3. Index map + yfinance (tickers, index names)

### Which method to call

| Method | Return type | Description |
|--------|-------------|-------------|
| `get(name)` | `TimeSeriesDataFrame` | Universal lookup across market indicators, economic indicators, index aliases, and raw tickers |
| `get_recent_history(name, period="3y")` | `HistoryChunk` | Recent seed window used by progressive chart loading |
| `get_full_history_backfill(name, loaded_start=None)` | `HistoryChunk` | Older history to prepend onto an already-seeded chart |
| `get_fred_data(indicator_name)` | `TimeSeriesDataFrame` | Direct FRED lookup by FRED code such as `"UNRATE"` |
| `get_economic_data(indicator_name)` | `TimeSeriesDataFrame` | Human-readable economic lookup such as `"Unemployment Rate"` |
| `get_market_data(ticker_or_name)` | `TimeSeriesDataFrame` | Market lookup through the market provider layer |
| `get_corporate_data(ticker, statement_type="income", period="annual")` | `pd.DataFrame \| None` | Company financials via TerraFin's yfinance-backed statement adapter. |
| `get_portfolio_data(guru_name)` | `PortfolioOutput` | Guru portfolio holdings via SEC EDGAR 13F filings |

The time-series methods are normalized by the `@chart_output` decorator before
they are returned.

## Output type conveniences

The contract definitions above are the source of truth. A few notes about
working with them in practice:

- `TimeSeriesDataFrame.make_empty()` returns an empty frame with the canonical
  schema; `.name` and `.chart_meta` survive slicing and pandas operations.
- `FinancialStatementFrame.make_empty(statement_type, period, ticker)`
  creates an empty statement frame with the right metadata for a missing
  source.
- `EventList.make_empty()` and `FilingDocument.make_empty(ticker, filing_type)`
  exist for the same reason β€” empty results stay typed.

## Provider map

| Domain | Backing source | Typical access path | Notes |
|--------|----------------|---------------------|-------|
| Market prices | yfinance | `get("AAPL")`, `get("S&P 500")`, `get("Shanghai Composite")` | Handles tickers and index aliases |
| Market indicators | Registry-backed market series | `get("VIX")`, `get("MOVE")`, `get("Net Breadth")` | Mix of yfinance-backed and private-series-backed names resolved before raw tickers |
| Economic series | FRED | `get_fred_data("UNRATE")`, `get("Unemployment Rate")` | Human-readable names map to FRED codes |
| Computed macro indicators | FRED-derived logic | `get("Buffett Indicator")` | Built from public series |
| Credit and risk indicators | FRED and FRED-derived | `get("High Yield Spread")`, `get("Net Liquidity")` | HY spread, RRP, net liquidity, 18M forward rate spread, credit spread |
| Corporate fundamentals | yfinance statement adapter | `get_corporate_data("AAPL")` | Returns a plain pandas frame |
| SEC filings | SEC EDGAR | `get_sec_data(ticker, filing_type)`, `fetch_and_parse_filing(cik, accession, doc, form, include_images)` | Parses 10-K / 10-Q / 8-K HTML into markdown + TOC. For 8-K (and 8-K/A), EX-99.x exhibits (earnings press release, CFO commentary) are fetched alongside the cover doc and appended as `## Exhibit 99.x β€” <label>` sections so the substantive content is reachable. Cached 30 days under the `sec.*` namespaces. |
| Guru portfolios | SEC EDGAR 13F | `get_portfolio_data("Warren Buffett")` | Returns `PortfolioOutput` |
| Private dashboard data | Private endpoint with fallbacks | dashboard and market-insights APIs | Watchlist, breadth, trailing-forward P/E spread, CAPE, calendar, fear/greed, top companies |
| Macro events | FRED plus yfinance | calendar API | Fetched locally, but managed through the private-data cache lifecycle |

Registry locations:

- Market indicators: `src/TerraFin/data/providers/market/market_indicator.py`
- Economic indicators: `src/TerraFin/data/providers/economic/registry.py`
- Guru portfolio registry: `src/TerraFin/data/providers/corporate/filings/sec_edgar/guru_cik.json`

The supported guru names for the 13F feature are maintained in the JSON
registry above rather than hardcoded inline in Python, so additions and edits
can stay data-backed and easier to review.

If `TERRAFIN_SEC_USER_AGENT` is missing, TerraFin still exposes the supported
guru list but treats SEC-backed holdings as disabled. The interface and agent
API return explicit configuration errors instead of silently falling back to
third-party proxies.

### Private access

Private-access features are TerraFin's bridge beyond the public core. They let
the same public interfaces connect to deployment-specific data and
operator-side workflows without making those extensions part of the default
open-source path.

These are intentional private-access extensions, not arbitrary hidden features.
They provide one authenticated boundary where operator-managed deployments can
attach broader workflow-specific data while public/demo deployments continue to
run on public providers and safe fallbacks.

Private-access features provide proprietary or deployment-specific data behind
an authenticated endpoint. They are optional: if the endpoint is unavailable,
TerraFin may fall back to local file cache first and then to bundled fixtures
or empty defaults, depending on the resource. This means local or private
installs can continue to function without private credentials, with reduced
coverage for private dashboard data. That fallback behavior should be treated
as an operational convenience for controlled deployments, not as a blanket
permission to serve cached restricted data publicly.

Configuration via env vars:

| Variable | Description |
|----------|-------------|
| `TERRAFIN_PRIVATE_SOURCE_ENDPOINT` | Base endpoint URL for the private source |
| `TERRAFIN_PRIVATE_SOURCE_ACCESS_KEY` | Header name used for authentication |
| `TERRAFIN_PRIVATE_SOURCE_ACCESS_VALUE` | Header value used for authentication |
| `TERRAFIN_PRIVATE_SOURCE_TIMEOUT_SECONDS` | HTTP timeout (default: 10) |
| `TERRAFIN_SEC_USER_AGENT` | Required SEC EDGAR user-agent string for filings and 13F access |
| `TERRAFIN_MONGODB_URI` / `MONGODB_URI` | Optional MongoDB backend for watchlist write mode |

Implementation lives under `src/TerraFin/data/providers/private_access/`.

The private endpoint currently backs these dashboard and market-insight
resources:

- watchlist
- market breadth
- trailing-forward P/E spread
- CAPE
- calendar data
- fear/greed
- top companies

### Private series vs private widget

Private-source data in TerraFin should be classified one of two ways.

#### PrivateSeries

Use this when the data should behave like a real TerraFin series.

Requirements:

- normalize to `TimeSeriesDataFrame`
- be usable through `DataFactory`
- support `HistoryChunk` semantics if optimized chart serving is needed
- share the same cache and progressive-history contract as other chartable series

Examples:

- `Fear & Greed` when used as a chart/searchable series
- `Net Breadth` as a chart/searchable breadth history series
- future chartable private series such as `CAPE` or
  `Trailing-Forward P/E Spread`, if promoted into the chart/search flow

#### PrivateWidget

Use this when the data is only a dashboard or page payload.

Characteristics:

- arbitrary JSON/dict/list response shape
- simple cache and refresh behavior
- no `DataFactory` or chart progressive-history requirement

Examples:

- top-companies payloads
- dashboard-only summaries that are not intended to become chart series

The rule is:

- if a private-source feature wants TerraFin's optimized chart serving, it must
  enter the system as `TimeSeriesDataFrame`
- otherwise it remains a widget payload and should not be forced into the chart
  pipeline

If TerraFin is deployed publicly, keep those private-source resources behind the
authenticated endpoint and treat fallback caches as an operational convenience,
not as redistribution permission. Public/demo deployments should rely on public
providers and bundled public-safe fixtures, not warmed private-source caches.

### Watchlist write mode

The watchlist page always remains available in read-only sample mode. Writable
watchlist CRUD is optional and only turns on when MongoDB is configured
through:

- `TERRAFIN_MONGODB_URI` or `MONGODB_URI`
- `TERRAFIN_WATCHLIST_MONGODB_DATABASE`
- `TERRAFIN_WATCHLIST_MONGODB_COLLECTION`
- `TERRAFIN_WATCHLIST_DOCUMENT_ID`

Without those settings, TerraFin keeps the page visible and serves bundled
sample data instead of failing startup.

### Macro events and the private-data lifecycle

Macro calendar events are fetched by TerraFin itself from public sources, not
from the private endpoint. They still participate in the same
`PrivateDataService` refresh and fallback flow as private data so the interface
has one consistent cache lifecycle.

| Module | Path | Responsibility |
|--------|------|----------------|
| Macro calendar | `src/TerraFin/data/providers/economic/macro_calendar.py` | Fetches release dates from FRED API |
| Macro values | `src/TerraFin/data/providers/economic/macro_values.py` | Enriches events with Latest/Previous from FRED series observations |
| Cache source | `private.macro` in CacheManager | Daily refresh via `PrivateDataService.refresh_macro()` |

Current limitation: macro events do not yet carry a reliable consensus
`expected` value. The current enrichment step only guarantees actual and prior
values.

## Caching

Provider caches are described in [caching.md](./caching.md). The short version:

- public providers such as yfinance and FRED use in-memory plus file cache
- yfinance also exposes progressive-history helpers backed by `yfinance_v2`
  columnar artifacts for `3Y` seed + full-history backfill flows
- guru portfolios use file cache and now participate in manager-driven invalidation
- private-access resources are also registered with the background cache manager
- file cache sits under `~/.terrafin/cache/`

When deploying TerraFin publicly, review private-access cache usage carefully.
Local cache can preserve previously fetched restricted data; that does not make
the data public-safe to serve. If a deployment mixes public traffic with
private-source access, treat cache contents as potentially restricted unless the
upstream terms clearly allow that storage and display pattern.

## Adding a provider

Use this checklist when extending the data layer:

1. Add a provider function under the correct domain package.
2. Return `TimeSeriesDataFrame` for chartable time series, or a clearly
   different type when the data is not time-series shaped.
3. Register the name in the market or economic registry if it should be
   discoverable through `DataFactory.get(...)`.
4. Add cache behavior only if the source benefits from reuse or background
   refresh.
5. For private-source features, decide explicitly whether the feature is a
   `PrivateSeries` or a `PrivateWidget` before wiring UI, chart, or agent
   surfaces.

## See also

- [feature-integration.md](./feature-integration.md) for the cross-layer checklist when a new data capability becomes public
- [interface.md](./interface.md) for the API layer built on top of these outputs
- [chart-architecture.md](./chart-architecture.md) for the shared chart session and progressive-history contract
- [analytics.md](./analytics.md) for modules that consume `TimeSeriesDataFrame`
- [caching.md](./caching.md) for refresh policies and file-cache behavior