File size: 12,688 Bytes
6a7089a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 | # Lite Engine: Chrome-Free DOM Capture using Gost-DOM
**Branch:** `feat/lite-engine-gostdom`
**Issue:** [#201](https://github.com/pinchtab/pinchtab/issues/201)
**Related Draft PR:** [#200](https://github.com/pinchtab/pinchtab/pull/200)
**Dependency:** [gost-dom/browser v0.11.0](https://github.com/gost-dom/browser) (MIT, ~255 stars, Go 78.4%)
---
## Overview
This implementation adds a **Lite Engine** that can perform DOM capture (navigate, snapshot, text extraction, click, type) without requiring Chrome/Chromium. It uses [Gost-DOM](https://github.com/gost-dom/browser), a headless browser written in pure Go, to parse and traverse HTML documents.
The architecture follows the maintainer's guidance for **"clever routing that is expandable without touching the rest of the code"** — implemented via a strategy-pattern Router with pluggable rules.
## Architecture
### Engine Interface (`internal/engine/engine.go`)
```go
type Engine interface {
Name() string
Navigate(ctx context.Context, url string) (*NavigateResult, error)
Snapshot(ctx context.Context, filter string) ([]SnapshotNode, error)
Text(ctx context.Context) (string, error)
Click(ctx context.Context, ref string) error
Type(ctx context.Context, ref, text string) error
Capabilities() []Capability
Close() error
}
```
### Router (`internal/engine/router.go`)
The Router evaluates an ordered chain of `RouteRule` implementations. The first rule to return a non-`Undecided` verdict wins.
```
Request → Router → [Rule 1] → [Rule 2] → ... → [Fallback Rule] → Engine
```
Rules are hot-swappable at runtime via `AddRule()` / `RemoveRule()` — no handler code changes needed.
### Three Modes
| Mode | Behavior | Default Rules |
|------|----------|---------------|
| `chrome` | All requests → Chrome (default, backward compatible) | DefaultChromeRule |
| `lite` | DOM ops → Gost-DOM, screenshots/PDF/evaluate → Chrome | CapabilityRule → DefaultLiteRule |
| `auto` | Per-request routing based on URL patterns | CapabilityRule → ContentHintRule → DefaultChromeRule |
### Built-in Rules (`internal/engine/rules.go`)
| Rule | Purpose |
|------|---------|
| `CapabilityRule` | Routes screenshot/pdf/evaluate/cookies → Chrome (lite can't do these) |
| `ContentHintRule` | Routes `.html/.htm/.xml/.txt/.md` URLs → Lite (for navigate/snapshot/text) |
| `DefaultLiteRule` | Catch-all: routes all DOM ops → Lite |
| `DefaultChromeRule` | Final fallback: routes everything → Chrome |
### Expandability
Adding new routing logic requires only:
1. Implement `RouteRule` interface (2 methods: `Name()`, `Decide()`)
2. Call `router.AddRule(myRule)` — inserted before the fallback rule
No handler, config, or CMD changes needed.
## Files Changed
### New Files (8)
| File | Purpose | Lines |
|------|---------|-------|
| `internal/engine/engine.go` | Engine interface, types, capabilities | ~70 |
| `internal/engine/lite.go` | LiteEngine implementation using Gost-DOM | ~430 |
| `internal/engine/router.go` | Router with AddRule/RemoveRule | ~120 |
| `internal/engine/rules.go` | 4 built-in RouteRule implementations | ~95 |
| `internal/engine/lite_test.go` | LiteEngine unit tests | ~280 |
| `internal/engine/router_test.go` | Router unit tests | ~130 |
| `internal/engine/rules_test.go` | Rule unit tests | ~115 |
| `internal/engine/realworld_test.go` | Real-world website comparison tests | ~570 |
### Modified Files (8)
| File | Change |
|------|--------|
| `internal/config/config.go` | Added `Engine` field to RuntimeConfig + ServerConfig |
| `internal/handlers/handlers.go` | Added `Router *engine.Router` field, `useLite()` helper |
| `internal/handlers/navigation.go` | Lite fast path before ensureChrome |
| `internal/handlers/snapshot.go` | Lite fast path with SnapshotNode → A11yNode conversion |
| `internal/handlers/text.go` | Lite fast path returning plain text |
| `cmd/pinchtab/cmd_bridge.go` | Engine router wiring based on config mode |
| `go.mod` | Added gost-dom/browser v0.11.0, gost-dom/css v0.1.0 |
| `go.sum` | Updated checksums |
## Improvements Over PR #200 Draft
| Area | PR #200 | This Implementation |
|------|---------|-------------------|
| Tab management | Single window | Multi-tab with sequential IDs |
| HTML parsing | `browser.Open()` double-fetches | HTTP fetch → strip scripts → `html.NewWindowReader` |
| Script handling | Panics on `<script>` tags | Pre-parse stripping via `x/net/html` tokenizer |
| Click safety | No panic protection | `defer recover()` in Click method |
| Text output | Raw DOM text | `normalizeWhitespace()` — collapses runs of whitespace |
| Role mapping | Basic (a, button, input, etc.) | Extended: section→region, details→group, summary→button, dialog, article |
| Interactive detection | Basic tags | Adds summary, ARIA roles (tab, menuitem, switch) |
| Routing | None (always lite) | Strategy-pattern Router with pluggable rules |
| Configuration | None | Config file support (`server.engine`) |
## Test Results
### Engine Package Tests (40+ tests, all passing)
```
=== Unit Tests ===
TestLiteEngine_Navigate PASS
TestLiteEngine_Snapshot_All PASS
TestLiteEngine_Snapshot_Interactive PASS
TestLiteEngine_Text PASS
TestLiteEngine_Click PASS
TestLiteEngine_Type PASS
TestLiteEngine_RefNotFound PASS
TestLiteEngine_ScriptStyleSkipped PASS
TestLiteEngine_AriaAttributes PASS
TestLiteEngine_MultiTab PASS
TestLiteEngine_Close PASS
TestLiteEngine_Capabilities PASS
TestLiteEngine_Name PASS
TestNormalizeWhitespace PASS
=== Router Tests ===
TestRouterChromeMode PASS
TestRouterLiteMode PASS
TestRouterAutoModeStaticContent PASS
TestRouterAutoModeLiteNil PASS
TestRouterAddRemoveRule PASS
TestRouterRulesSnapshot PASS
=== Rule Tests ===
TestCapabilityRule (9 cases) PASS
TestContentHintRule (9 cases) PASS
TestDefaultLiteRule (7 cases) PASS
TestDefaultChromeRule (4 cases) PASS
```
### Real-World Website Comparison Tests (16 suites, 63+ subtests)
| Suite | Simulates | Subtests | Result |
|-------|-----------|----------|--------|
| WikipediaStyle | Wikipedia article page | 9 | PASS |
| HackerNewsStyle | HN front page | 4 | PASS |
| EcommerceStyle | Product page with forms | 9 | PASS |
| FormHeavy | Registration form | 7 | PASS |
| AriaHeavy | Dashboard with ARIA roles | 11 | PASS |
| DeeplyNested | 5+ levels of div nesting | 4 | PASS |
| SpecialCharacters | Unicode, HTML entities, CJK | 3 | PASS |
| EmptyPage | Empty HTML body | 1 | PASS |
| NonHTMLContentType | JSON response | 1 | PASS |
| HTTP404 | 404 error page | 1 | PASS |
| LargePagePerformance | 200 sections, 800+ nodes | 1 | PASS |
| MultipleScriptTags | 5 script tags in head+body | 1 | PASS |
| InlineStyles | Style tags in head+body | 1 | PASS |
| ClickWorkflow | Button clicks | 1 | PASS |
| ClickLinkRecovery | Anchor click panic recovery | 1 | PASS |
| TypeWorkflow | Type into all textboxes | 1 | PASS |
### Full Project Test Suite
```
ok cmd/pinchtab 2.8s
ok internal/allocation 2.0s
ok internal/config 1.6s
ok internal/dashboard 3.1s
ok internal/engine 1.4s ← new package
ok internal/handlers 6.8s
ok internal/human 10.7s
ok internal/idpi 2.0s
ok internal/idutil 1.8s
ok internal/instance 2.6s
ok internal/orchestrator 3.2s
ok internal/profiles 2.8s
ok internal/proxy 2.8s
ok internal/scheduler 4.0s
ok internal/semantic 1.6s
ok internal/strategy 1.7s
ok internal/uameta 1.1s
ok internal/web 1.5s
```
## Known Edge Cases & Limitations
| Edge Case | Behavior | Mitigation |
|-----------|----------|------------|
| `<script>` tags in HTML | Gost-DOM panics (nil ScriptHost) | Pre-parse stripping via x/net/html tokenizer |
| Click on `<a href>` | Gost-DOM navigates, may encounter scripts | `defer recover()` in Click, returns error |
| CSS `display:none` | Elements still appear in snapshot | Lite engine has no CSS engine |
| JavaScript-rendered content | Not captured (SPA, dynamic DOM) | Falls back to Chrome in auto mode |
| Screenshots / PDF | Not supported in lite | CapabilityRule routes to Chrome |
| Cookies / Evaluate | Not supported in lite | CapabilityRule routes to Chrome |
| `<noscript>` content | Stripped from snapshot | Consistent with script-disabled behavior |
## Configuration
Set the engine in your config file:
```json
{
"server": {
"engine": "lite"
}
}
```
### Response Headers
Lite-served responses include `X-Engine: lite` header for observability.
## Dependency Analysis
| Package | Size | License | Purpose |
|---------|------|---------|---------|
| gost-dom/browser v0.11.0 | ~2.5MB source | MIT | Headless browser (HTML parsing, DOM traversal) |
| gost-dom/css v0.1.0 | ~200KB | MIT | CSS selector support |
| golang.org/x/net (existing) | already in go.mod | BSD-3 | HTML tokenizer for script stripping |
## Performance Benchmark: Lite vs Chrome
**Lite run:** 2026-03-09 | **Chrome run:** 2026-03-09
**Method:** 8 real-world websites × 4 operations each (Navigate → Snapshot All → Snapshot Interactive → Text)
### Response Times (ms)
| Website | Lite Navigate | Lite Snap (all) | Lite Text | Chrome Navigate | Chrome Snap (all) | Chrome Text | Winner |
|---------|:------------:|:--------------:|:---------:|:--------------:|:----------------:|:-----------:|:------:|
| Example.com | 38ms | 23ms | 29ms | 396ms | 46ms | 34ms | **LITE** |
| Wikipedia (Go) | 657ms | 775ms | 120ms | 1310ms | 2703ms | 201ms | **LITE** |
| Hacker News | 1032ms | 188ms | 21ms | 1218ms | 247ms | 27ms | **LITE** |
| httpbin.org | 1117ms | 31ms | 24ms | 4745ms | 187ms | 47ms | **LITE** |
| GitHub Explore | 1402ms | 161ms | 24ms | 6156ms | 329ms | 20ms | **LITE** |
| DuckDuckGo | 119ms | 26ms | 20ms | 1488ms | 394ms | 41ms | **LITE** |
| Wikipedia (CS) | 215ms | 535ms | 687ms | 2668ms | 1249ms | 130ms | **LITE** |
| Stack Overflow | ❌ 502 | 694ms | 111ms | 6433ms | 376ms | 61ms | **CHROME** |
> Stack Overflow blocks bot HTTP requests — the Lite engine's `Navigate` got a 502. Chrome handles this via a real browser session.
### Totals (7 sites where both engines succeeded)
| Metric | Lite | Chrome | Speedup |
|--------|-----:|-------:|--------:|
| Navigate Total | 4,580ms | 17,981ms | **3.9×** faster |
| Snapshot Total | 1,739ms | 5,155ms | **3.0×** faster |
| Text Total | 925ms | 500ms | 0.5× (Chrome faster) |
| **Grand Total** | **7,244ms** | **23,636ms** | **3.3× faster** |
> Lite wins **7/8 sites** overall. Chrome is faster at text extraction because it runs Mozilla Readability.js in-browser. Lite performs raw DOM text walk which is slower for very large articles (e.g. Wikipedia CS: 687ms vs 130ms).
### Node Count Comparison
| Website | Lite Nodes | Chrome Nodes | Lite Interactive | Chrome Interactive | Lite Text (chars) | Chrome Text (chars) |
|---------|:----------:|:------------:|:----------------:|:-----------------:|:----------------:|:-----------------:|
| Example.com | 6 | 8 | 1 | 1 | 125 | 209 |
| Wikipedia (Go) | 6,074 | 7,110 | 1,276 | 1,063 | 75,659 | 62,859 |
| Hacker News | 805 | 975 | 229 | 229 | 4,025 | 4,169 |
| httpbin.org | 62 | 113 | 5 | 29 | 274 | 1,179 |
| GitHub Explore | 1,533 | 830 | 331 | 240 | 8,340 | 368 |
| DuckDuckGo | 143 | 655 | 20 | 102 | 123 | 7,231 |
| Wikipedia (CS) | 4,941 | 4,653 | 1,627 | 1,061 | 79,799 | 58,071 |
| Stack Overflow | — | 779 | — | 192 | — | 23,671 |
> **Why node counts differ:** Lite strips `<script>` tags before parsing and has no CSS engine so hidden elements still appear. Chrome's accessibility tree prunes hidden/invisible elements. DuckDuckGo and GitHub Explore show lower Chrome text because Chrome's Readability.js strips nav/sidebar content, while Lite captures all visible text.
### Key Takeaways
| Scenario | Recommendation |
|----------|---------------|
| Static sites, wikis, news, blogs | **Lite** — 3–12× faster, no Chrome overhead |
| JavaScript-rendered SPAs (React, Next.js, etc.) | **Chrome** — Lite captures pre-JS HTML only |
| Sites that block HTTP bots (Stack Overflow, some social) | **Chrome** — real browser bypasses bot detection |
| Snapshot / DOM traversal on large pages | **Lite** — 3× faster snapshot on Wikipedia |
| Text extraction on large articles | **Chrome** — Readability.js is more accurate and faster |
| Pipelines needing screenshots / PDF / evaluate | **Chrome** — Lite doesn't support these |
*Benchmark run from `tests/lite_engine_benchmark.ps1` on 2026-03-09*
|