Spaces:
Runtime error
Runtime error
| Here's an overview of v86's workings. For details, check the | |
| [source](https://github.com/copy/v86/tree/master/src). | |
| The major limitations of WebAssembly are (for the purpose of making emulators with jit): | |
| - structured control flow (no arbitrary jumps) | |
| - no control over registers (you can't keep hardware registers in wasm locals across functions) | |
| - no mmap (paging needs to be fully emulated) | |
| - no patching | |
| - module generation is fairly slow, but at least it's asynchronous, so other things can keep running | |
| - there is some memory overhead per module, so you can't generate more than a few thousand | |
| v86 has an interpreted mode, which collects entry points (targets of function | |
| calls and indirect jumps). It also measures the hotness per page, so that | |
| compilation is focused on code that is often executed. Once a page is | |
| considered hot, code is generated for the entire page and up to `MAX_PAGES` | |
| that are directly reachable from it. | |
| v86 generates a single function with a big switch statement (brtable), to | |
| ensure that all functions and targets of indirect jumps are reachable from | |
| other modules. The remaining control flow is handled using the "stackifier" | |
| algorithm (well-explained in | |
| [this blog post](https://medium.com/leaningtech/solving-the-structured-control-flow-problem-once-and-for-all-5123117b1ee2)). | |
| At the moment, there is no linking of wasm modules. The current module is | |
| exited, and the main loop detects if a new module can be entered. | |
| In practice, I found that browsers don't handle this structure (deep brtables, | |
| with locals being used across the entire function) very well, and `MAX_PAGES` | |
| has to be set to fairly low, otherwise memory usage blows up. It's likely that | |
| improvements are possible (generating fewer entry points, splitting code across | |
| multiple functions). | |
| Code-generation happens in two passes. The first pass finds all basic block | |
| boundaries, the second generates code for each basic block. Instruction | |
| decoding is generated by a [set of | |
| scripts](https://github.com/copy/v86/tree/master/gen) from a [table of | |
| instructions](https://github.com/copy/v86/blob/master/gen/x86_table.js). It's | |
| also used to [generate | |
| tests](https://github.com/copy/v86/blob/master/tests/nasm/create_tests.js). | |
| To handle paging, v86 generates code similar to this (see `gen_safe_read`): | |
| ``` | |
| entry <- tlb[addr >> 12 << 2] | |
| if entry & MASK == TLB_VALID && (addr & 0xFFF) <= 0xFFC - bytes: goto fast | |
| entry <- safe_read_jit_slow(addr, instruction_pointer) | |
| if page_fault: goto exit-with-pagefault | |
| fast: mem[(entry & ~0xFFF) ^ addr] | |
| ``` | |
| There is a 4 MB cache that acts like a tlb. It contains the physical address, | |
| read-only bit, whether the page contains code (in order to invalidate it on | |
| write) and whether the page points to mmio. Any of those cases are handled in | |
| the slow path (`safe_read_jit_slow`), as well as walking the page tables and | |
| triggering page faults. The fast path is taken in the vast majority of times. | |
| The remaining code generation is mostly a straight-forward, 1-to-1 translation | |
| of x86 to wasm. The only analysis done is to optimise generation of condional | |
| jumps immediately after arithmetic instructions, e.g.: | |
| ``` | |
| cmp eax, 52 | |
| setb eax | |
| ``` | |
| becomes: | |
| ``` | |
| ... // code for cmp | |
| eax <- eax < 52 | |
| ``` | |
| A lazy flag mechanism is used to speed arithmetic (applies to both jit and | |
| interpreted mode, see | |
| [`arith.rs`](https://github.com/copy/v86/blob/master/src/rust/cpu/arith.rs) and | |
| [`misc_instr.rs`](https://github.com/copy/v86/blob/master/src/rust/cpu/misc_instr.rs)). | |
| There's a wip that tries to elide most lazy-flags updates: | |
| https://github.com/copy/v86/pull/466 | |
| FPU instructions are emulated using softfloat (very slow, but unfortunately | |
| some code relies on 80 bit floats). | |