license: mit
tags:
- threshold-logic
- neuromorphic
- computer-architecture
- turing-complete
- loihi
- truenorth
- akida
8bit-threshold-computer
A Turing-complete 8-bit CPU implemented entirely as threshold logic gates.
Every logic gate is a threshold neuron: output = 1 if (Ξ£ wα΅’xα΅’ + b) β₯ 0 else 0
Tensors: 11,581
Parameters: 8,290,134 (full CPU) / 32,397 (pure ALU for LLM)
What Is This?
A complete 8-bit processor where every operationβfrom Boolean logic to arithmetic to control flowβis implemented using only weighted sums and step functions. No traditional gates.
| Component | Specification |
|---|---|
| Registers | 4 Γ 8-bit general purpose |
| Memory | Configurable: 0B (pure ALU) to 64KB (full CPU) |
| ALU | 16 operations (ADD, SUB, AND, OR, XOR, NOT, SHL, SHR, MUL, DIV, INC, DEC, NEG, ROL, ROR, CMP) |
| Flags | Zero, Negative, Carry, Overflow |
| Control | JMP, JZ, JNZ, JC, JNC, JN, JP, JV, JNV, CALL, RET, PUSH, POP |
Turing complete. Verified with loops, conditionals, recursion, and self-modification.
Execution Model
A self-contained, autonomous computational machine:
- Pure tensor computation: State in, state out
- Frozen circuits: Integer weights, Heaviside activation
- ACT execution: Internal loop until HALT
- No external orchestration: One forward pass = complete program execution
βββββββββββββββββββββββββββββββ
β Initial State β
β [PC|Regs|Flags|Memory...] β
βββββββββββββββ¬ββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββ
β β
β Threshold Circuit Layer β
β β
β βββββββββββββββββββββββββ β
β β Fetch: PC β Instr β β
β βββββββββββββββββββββββββ€ β
β β Decode: Opcode/Ops β β
β βββββββββββββββββββββββββ€ β
β β Execute: ALU/Mem β β
β βββββββββββββββββββββββββ€ β
β β Writeback: Results β β
β βββββββββββββββββββββββββ€ β
β β PC Update β β
β βββββββββββββ¬ββββββββββββ β
β β β
β ββββββΌβββββ β
β β HALTED? β β
β ββββββ¬βββββ β
β β β
β no βββ΄ββ yes β
β β β β
β βΌ βΌ β
β [loop] [exit] β
β β
βββββββββββββββ¬ββββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββ
β Final State β
β [PC|Regs|Flags|Memory...] β
βββββββββββββββββββββββββββββββ
Instruction Set
| Opcode | Mnemonic | Operation |
|---|---|---|
| 0x0 | ADD | R[d] = R[a] + R[b] |
| 0x1 | SUB | R[d] = R[a] - R[b] |
| 0x2 | AND | R[d] = R[a] & R[b] |
| 0x3 | OR | R[d] = R[a] | R[b] |
| 0x4 | XOR | R[d] = R[a] ^ R[b] |
| 0x5 | SHL | R[d] = R[a] << 1 |
| 0x6 | SHR | R[d] = R[a] >> 1 |
| 0x7 | MUL | R[d] = R[a] * R[b] |
| 0x8 | DIV | R[d] = R[a] / R[b] |
| 0x9 | CMP | flags = R[a] - R[b] |
| 0xA | LOAD | R[d] = M[addr] |
| 0xB | STORE | M[addr] = R[s] |
| 0xC | JMP | PC = addr |
| 0xD | Jcc | PC = addr if cond (imm8[2:0]: 0=Z,1=NZ,2=C,3=NC,4=N,5=P,6=V,7=NV) |
| 0xE | CALL | push PC; PC = addr |
| 0xF | HALT | stop execution |
Design Principles
- Autonomy: Machine runs without external logic
- Purity: forward(state) β state', no side effects
- Transparency: All weights inspectable, all operations traceable
- Universality: Turing complete, runs arbitrary programs
Background
Threshold Logic
A threshold gate computes a Boolean function by taking a weighted sum of binary inputs and comparing it to a threshold. If the sum meets or exceeds the threshold, the output is 1; otherwise, 0. This can be expressed as a neuron with Heaviside step activation: output = 1 if (Ξ£ wα΅’xα΅’ + b) β₯ 0 else 0, where weights wα΅’ and bias b are integers.
Threshold gates are strictly more powerful than standard Boolean gates. A single threshold gate can compute any linearly separable Boolean functionβthis includes AND, OR, NAND, NOR, and many others that require multiple levels of conventional gates. Functions that are not linearly separable (such as XOR or parity) require multiple threshold gates arranged in layers.
Historical Development
Warren McCulloch and Walter Pitts introduced the threshold neuron model in 1943, proving that networks of such neurons could compute any Boolean function. This work preceded both the perceptron and modern neural networks, establishing the theoretical foundation for neural computation.
The 1960s saw significant development in threshold logic synthesis. Researchers including Saburo Muroga, Robert McNaughton, and Michael Dertouzos developed algebraic methods for determining whether a Boolean function could be implemented as a single threshold gate, and if so, how to calculate appropriate weights. This work produced systematic techniques for threshold gate design but focused on individual gates rather than complete systems.
Frank Rosenblatt's Mark I Perceptron (1957-1960) implemented threshold neurons in hardware using potentiometers for weights, but it was a pattern classifier that learned its weights through trainingβthe final weight configurations were not published. Bernard Widrow's ADALINE and MADALINE systems (1960-1963) similarly used adaptive threshold elements with weights learned via the LMS algorithm.
Hava Siegelmann and Eduardo Sontag proved in the 1990s that recurrent neural networks are Turing complete. Their construction, however, relied on continuous sigmoid activation functions with infinite precisionβnot the discrete step function used in threshold logic. Other theoretical work on neural Turing machines and differentiable computers followed similar patterns: proving computational universality using continuous, differentiable activations suitable for gradient-based training.
Neuromorphic Hardware
Modern neuromorphic processors implement large arrays of configurable threshold-like neurons in silicon:
Intel Loihi (2017) provides 128 neuromorphic cores with programmable synaptic weights, spike-based communication, and on-chip learning. The architecture supports integer weights and configurable neuron dynamics.
IBM TrueNorth (2014) integrates one million neurons and 256 million synapses in a 4096-core array. Each neurosynaptic core implements 256 neurons with configurable weights and thresholds. The chip was designed as an alternative to von Neumann architecture rather than an implementation of one.
BrainChip Akida (2021) targets edge deployment with event-based processing and integer weights. The architecture supports standard neural network operations mapped onto neuromorphic primitives.
SpiNNaker (University of Manchester) uses ARM processor cores to simulate spiking neural networks at scale. The platform has hosted various neural models but is simulation-based rather than native neuromorphic silicon.
Despite the availability of these platforms, published work has focused on neural network inference, sensory processing, and pattern recognition. A 2024 paper demonstrated basic logic gates, adders, and decoders on SpiNNaker and Dynap-SE1, describing this as "a first step toward the construction of a spiking computer"βthe implementation lacked instruction fetch, program counter, memory systems, and control logic.
This Implementation
The weights in this repository implement a complete 8-bit computer: registers, ALU with 16 operations, status flags, conditional branching, subroutine calls, stack operations, and memory access. Every component is built from threshold neurons with integer weights. The weight configurations are published in safetensors format for direct loading and deployment.
Circuit Categories
| Category | Circuits | Examples |
|---|---|---|
| Boolean | 9 | AND, OR, NOT, NAND, NOR, XOR, XNOR, IMPLIES, BIIMPLIES |
| Arithmetic | 18 | Half/full adder, 2/4/8-bit ripple carry, comparators |
| ALU | 3 | 8-bit ALU, control decoder, flag computation |
| Combinational | 10 | MUX (2:1, 4:1, 8:1), DEMUX, encoders, decoders |
| Control Flow | 16 | JMP, conditional jumps, CALL, RET, PUSH, POP |
| Error Detection | 11 | Parity (XOR tree), checksum, CRC, Hamming |
| Modular | 11 | Divisibility by 2-12 (multi-layer for non-powers-of-2) |
| Threshold | 13 | k-of-n gates, majority, minority, exactly-k |
| Pattern | 10 | Popcount, leading/trailing ones, symmetry |
| Memory | 3 | N-bit addr decoder, 2^NΓ8 read mux, write cells (configurable, packed) |
Usage
import torch
from safetensors.torch import load_file
tensors = load_file("neural_computer.safetensors")
def heaviside(x):
return (x >= 0).float()
# AND gate: fires when both inputs are 1
w = tensors['boolean.and.weight'] # [1, 1]
b = tensors['boolean.and.bias'] # [-2]
for a, b_in in [(0,0), (0,1), (1,0), (1,1)]:
inp = torch.tensor([a, b_in], dtype=torch.float32)
out = heaviside(inp @ w + b)
print(f"AND({a}, {b_in}) = {int(out.item())}")
State Tensor Layout
All multi-bit fields are MSB-first (index 0 is the most-significant bit).
[ PC[N] | IR[16] | R0[8] R1[8] R2[8] R3[8] | FLAGS[4] | SP[N] | CTRL[4] | MEM[2^N][8] ]
Where N = address bits (configurable: 0-16).
Flags are ordered as: Z, N, C, V.
Control bits are ordered as: HALT, MEM_WE, MEM_RE, RESERVED.
| Memory Profile | Addr Bits | Memory Size | State Bits |
|---|---|---|---|
| Full CPU | 16 | 64KB | 524,376 |
| Reduced | 12 | 4KB | 32,856 |
| Scratchpad | 8 | 256B | 2,104 |
| Registers | 4 | 16B | 184 |
| Pure ALU | 0 | 0B | 56 |
Instruction Encoding (16-bit)
All instruction bits are MSB-first.
15..12 11..10 9..8 7..0
opcode rd rs imm8
Interpretation:
- R-type:
rd = rd op rs(imm8 ignored). - I-type:
rd = op rd, imm8(rs ignored). - Address-extended:
LOAD,STORE,JMP,JZ,CALLconsume the next word as a 16-bit address (big-endian).imm8is reserved, and the PC skips 4 bytes when the jump is not taken.
Verification
python eval.py
python threshold_cpu.py
Verification Status
| Category | Status | Notes |
|---|---|---|
| Boolean gates | Exhaustively tested | All 2^n input combinations |
| Arithmetic | Exhaustively tested | Full 8-bit range |
| ALU | Exhaustively tested | All operations, all inputs |
| Control flow | Exhaustively tested | Branch/jump conditions |
| Threshold | Exhaustively tested | k-of-n, majority, etc. |
| Modular (mod 3,5,6,7,9,10,11,12) | Exhaustively tested | Multi-layer, hand-constructed |
| Parity | Exhaustively tested | XOR tree, hand-constructed |
| Modular (mod 2,4,8) | Exhaustively tested | Single-layer, trivial |
The modular arithmetic circuits for non-powers-of-2 and the parity circuits were hand-constructed because:
- Divisibility by 3, 5, etc. is not linearly separable in binary
- 8-bit parity (XOR of all bits) requires a tree of XOR gates
All circuits pass exhaustive testing over their full input domains.
Tensor Naming Convention
{category}.{circuit}[.{layer}][.{component}].{weight|bias}
Examples:
boolean.and.weight
boolean.xor.layer1.neuron1.weight
arithmetic.ripplecarry8bit.fa7.ha2.sum.layer1.or.weight
modular.mod5.layer2.eq3.weight
error_detection.paritychecker8bit.stage2.xor1.layer1.nand.bias
Memory circuits are stored as packed tensors to keep the safetensors header size manageable
(e.g., `memory.addr_decode.weight`, `memory.read.and.weight`, `memory.write.and_old.weight`).
Hardware Compatibility
All weights are integers. All activations are Heaviside step. Designed for:
- Intel Loihi β Neuromorphic research chip
- IBM TrueNorth β 1M neuron chip
- BrainChip Akida β Edge neuromorphic processor
LLM Integration
The threshold circuits can be embedded into transformer MLP layers to give LLMs exact arithmetic capability.
For LLM integration, use --memory-profile none to generate a pure ALU model (~32K params) without memory circuits.
Core Thesis
Standard LLMs fail at arithmetic because they're interpolatorsβthey approximate functions over training distributions rather than compute exact results. A 360M parameter model trained on internet text has seen "127 + 128 = 255" zero or few times, so it guesses based on pattern matching.
We solve this by embedding frozen, proven-correct arithmetic circuits directly into the transformer's MLP layers. The circuits use threshold logic (weighted sums + step activation), which is structurally compatible with neural network layers. We train only the interface layers that learn to:
- Extract operands from token embeddings
- Route computation through the circuits
- Inject results back into the residual stream
The model learns call dispatch, not arithmetic. The arithmetic is already solved.
Architecture
Standard MLP block with parallel circuit path:
x βββ¬ββ MLP path βββββββββββββββββ¬ββ + ββ output
β β
βββ BitExtractor ββ Circuit ββ΄ββ BitInjector
β
Router (learned weighting)
Augmented MLP forward pass:
def forward(x): # x: [batch, seq, d_model]
# Original MLP path (unchanged)
mlp_out = self.down_proj(silu(self.gate_proj(x)) * self.up_proj(x))
# Circuit path (new)
a_bits, b_bits = self.bit_extractor(x) # [batch, seq, 8] each
result_bits, carry = self.circuits.add_8bit(a_bits, b_bits)
flags = self.compute_flags(result_bits, carry)
circuit_delta = self.bit_injector(result_bits, flags)
# Routing
route_weights = self.router(x) # [batch, seq, 2] softmax
# Combine
return mlp_out + route_weights[..., 1:2] * circuit_delta
Threshold Logic Fundamentals
A threshold gate computes:
output = 1 if (Ξ£ wα΅’xα΅’ + b) β₯ 0
0 otherwise
Example gates:
AND: w=[1,1], b=-2
AND(0,0) = H(-2) = 0
AND(1,1) = H(0) = 1
OR: w=[1,1], b=-1
OR(0,1) = H(0) = 1
OR(1,1) = H(1) = 1
XOR: requires 2 layers (not linearly separable)
Layer 1: OR + NAND
Layer 2: AND
Full adder = 2 half-adders + carry OR, ~4 threshold layers. 8-bit ripple carry = 8 chained full adders, ~32 threshold layers.
Interface Layers (Trainable)
BitExtractor β Maps embedding β two 8-bit operands:
class BitExtractor(nn.Module):
def __init__(self, d_model):
self.proj = nn.Linear(d_model, 16)
def forward(self, x):
logits = self.proj(x)
bits = heaviside(logits) # STE for training
return bits[..., :8], bits[..., 8:]
BitInjector β Maps result bits β embedding delta:
class BitInjector(nn.Module):
def __init__(self, d_model):
self.proj = nn.Linear(16, d_model)
self.scale = nn.Parameter(torch.tensor(0.1))
def forward(self, result_bits, flags):
combined = torch.cat([result_bits, flags], dim=-1)
return self.proj(combined) * self.scale
Router β Decides when to use circuits:
class Router(nn.Module):
def __init__(self, d_model):
self.net = nn.Sequential(
nn.Linear(d_model, 64), nn.ReLU(),
nn.Linear(64, 2), nn.Softmax(dim=-1)
)
Trainable Parameters
For SmolLM2-360M (d_model=960), augmenting 11 layers:
| Component | Params/Layer |
|---|---|
| BitExtractor | 15,376 |
| BitInjector | 16,321 |
| Router | 61,698 |
| OpSelector | ~31,000 |
| Total | ~124,395 |
11 layers Γ 124,395 = ~1.37M trainable parameters (0.38% of model)
Gradient Flow
Heaviside has zero gradient almost everywhere. We use Straight-Through Estimator (STE):
class HeavisideSTE(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
return (x >= 0).float()
@staticmethod
def backward(ctx, grad_output):
return grad_output # pass through unchanged
Training Strategy
- Data: Generate 8-bit arithmetic problems exhaustively (256Γ256 = 65,536 unique)
- Loss: Cross-entropy on answer tokens only (prompt masked with -100)
- Optimizer: AdamW on interface params only, lr=1e-4
- Curriculum: Single-digit β two-digit β full 8-bit β adversarial (127+128, 255+1)
Inference
At inference, Heaviside is true step functionβno approximation. If BitExtractor correctly extracts operands, the circuit will output the correct result. Circuit computation adds ~5-10% latency overhead.
Target Performance
| Model | Baseline | Target |
|---|---|---|
| SmolLM2-360M | ~5-10% | >95% |
The interface generalizes to all 65,536 8-bit additions once trainedβno memorization, the circuits compute.
LLM Integration: Proof of Concept (In Progress)
Before proceeding with architectural extensions, we are validating the core thesis: that frozen threshold circuits can provide exact arithmetic capability to language models that otherwise fail at computation.
Baseline Evaluation
We evaluated SmolLM2-360M-Instruct on randomized 8-bit arithmetic using a generous answer extraction protocol. The model was prompted with a system message instructing it to output only numeric answers, and we accepted any correct number found in the output (first number, last number, or word-to-number conversion).
| Operation | SmolLM2-360M Accuracy | Notes |
|---|---|---|
| Addition (A + B) | 35.92% | Best performance, still fails 2/3 |
| Subtraction (A - B) | 17.72% | Poor handling of borrowing |
| Multiplication (A Γ B) | 1.25% | Near-total failure |
| Greater Than (A > B) | 14.37% | Often echoes expression |
| Less Than (A < B) | 4.31% | Often echoes expression |
| Equality (A == B) | 0.28% | Near-total failure |
| Overall Fitness | 11.90% | 238/2000 correct |
Methodology: 2000 randomized test cases with operands uniformly sampled from [0, 255]. Ground truth computed as 8-bit arithmetic (matching the threshold circuit specification). Batch size 64, greedy decoding (temperature=0).
Key Observations:
- Multiplication accuracy (1.25%) is essentially random guessing over the output space
- Comparison operations fail because the model often echoes the expression rather than evaluating it
- Even additionβthe simplest operationβfails nearly two-thirds of the time on 8-bit operands
- Performance degrades sharply as operand magnitude increases (edge cases like 127+128 are almost never correct)
These results establish the control condition for our experiment.
Experimental Design
| Condition | Model Configuration | Target Fitness |
|---|---|---|
| Control | Vanilla SmolLM2-360M-Instruct | 11.90% (measured) |
| Experimental | SmolLM2-360M + Frozen ThresholdALU + Trained Interface | 100% |
The experimental condition adds:
- BitEncoder (trainable): Projects hidden states β 24 bits (3 Γ 8-bit operands)
- OpRouter (trainable): Selects which circuit to activate based on context
- BitDecoder (trainable): Projects 8-bit result β hidden state delta
- ThresholdALU (frozen): The verified circuits from this repository
Training Signal: The fitness function itself. We do not provide answer supervisionβthe model must learn to correctly encode operands and route to circuits such that the frozen circuits produce correct outputs. This is possible because the circuits are proven correct; the interface layers need only learn the encoding/routing protocol.
Success Criterion: If the experimental condition achieves 100% fitness on randomized arithmetic while the control remains at ~12%, this demonstrates:
- The frozen threshold circuits provide exact computation
- Neural interface layers can learn to use discrete computational substrates
- Small language models can achieve perfect arithmetic via architectural augmentation rather than scale
Progress
Stage 1: Circuit Validation β COMPLETE
The frozen threshold circuits achieve 100% accuracy when given correctly formatted bit inputs:
| Test | Result |
|---|---|
| DirectCircuitModel (ground truth bits) | 100.00% on 10,000 random cases |
| All operations (ADD, SUB, MUL, GT, LT, EQ) | 100.00% each |
This confirms the circuits compute correctly. However, this was already established by eval.py.
Stage 2: LLM Baseline β COMPLETE
SmolLM2-360M-Instruct baseline on randomized 8-bit arithmetic:
| Operation | Accuracy |
|---|---|
| Addition | 35.92% |
| Subtraction | 17.72% |
| Multiplication | 1.25% |
| Comparisons | 0.28β14.37% |
| Overall | 11.90% |
Head-to-head on 50 random cases: SmolLM2 got 7/50 (14%), circuits got 50/50 (100%).
Stage 3: LLM Integration β IN PROGRESS
The actual challenge: train an interface that extracts operands and operations from LLM hidden states (not from pre-formatted bit inputs).
"What is 47 + 86?"
β
[LLM hidden states]
β
BitExtractor (must LEARN: "47" β 00101111, "86" β 01010110)
OpRouter (must LEARN: "+" β add operation)
β
[Frozen threshold circuits]
β
[Result bits] β "133"
The train_passthrough_*.py files demonstrate that routing works when given labels, but this is trivialβthe real test is learning to parse from natural language.
Proof of Concept Scope
- 8-bit operands (0-255)
- Six operations: ADD, SUB, MUL, GT, LT, EQ
- Pure ALU profile (no memory access)
Current status: Circuit validation complete. LLM hidden state extraction in development.
Extension Roadmap
The following extensions are planned after proof-of-concept validation:
16-bit operations (0-65535) β Chain two 8-bit circuits with carry propagation. ADD16: low = ADD8(A_lo, B_lo), high = ADD8(A_hi, B_hi, carry_out). MUL16: four partial products + shift-add. Doubles operand extraction width. This extension is a priority as it dramatically expands the useful range of arithmetic operations.
Parenthetical expressions ((5 + 3) Γ 2 = 16) β Explicit grouping overrides precedence. Parser must recognize parens and build correct tree. Evaluation proceeds innermost-out. Adds complexity to extraction layer.
Multi-operation chains (a + b - c Γ d) β Sequential dispatch through multiple circuits with intermediate result routing. Requires state management in interface layers.
Floating point arithmetic β IEEE 754-style with separate circuits for mantissa and exponent. ADD: align exponents, add mantissas, renormalize. MUL: add exponents, multiply mantissas. Requires sign handling, overflow detection, and rounding logic.
Full CPU integration β Enable memory access circuits for stateful computation. Allows multi-step algorithms executed entirely within threshold logic.
Completed Extensions
3-operand addition (15 + 27 + 33 = 75) β
arithmetic.add3_8bitchains two 8-bit ripple carry stages. 16 full adders, 144 gates, 240 test cases verified.Order of operations (5 + 3 Γ 2 = 11) β
arithmetic.expr_add_mulcomputes A + (B Γ C) using shift-add multiplication then addition. 64 AND gates + 64 full adders, 73 test cases verified.
Files
| File | Description |
|---|---|
neural_computer.safetensors |
15,685 tensors, 43,366 parameters (pure ALU profile) |
eval.py |
Unified evaluation suite (6,738 tests, GPU-batched) |
build.py |
Build tools with configurable memory partitioning |
prune_weights.py |
Weight magnitude pruning (GPU-batched, binary search conflict resolution) |
llm_integration/baseline.py |
SmolLM2-360M arithmetic baseline evaluation (11.90% fitness) |
llm_integration/fitness.py |
Shared fitness function for randomized arithmetic tests |
llm_integration/circuits.py |
Frozen threshold circuit wrapper with STE gradients |
llm_integration/model.py |
Interface layer definitions (BitEncoder, OpRouter, BitDecoder) |
llm_integration/train_passthrough.py |
Scaffolding: trains with pre-formatted bit inputs |
llm_integration/train_passthrough_router.py |
Scaffolding: router-only with ground truth bits |
Build Tool Usage
# Full CPU (64KB memory, default)
python build.py memory --apply
# LLM integration profiles
python build.py --memory-profile none memory --apply # Pure ALU (32K params)
python build.py --memory-profile registers memory --apply # 16-byte register file
python build.py --memory-profile scratchpad memory --apply # 256-byte scratchpad
# Custom memory size
python build.py --addr-bits 6 memory --apply # 64 bytes (2^6)
# Regenerate ALU and input metadata
python build.py alu --apply
python build.py inputs --apply
python build.py all --apply # memory + alu + inputs
Memory profiles:
| Profile | Addr Bits | Memory | Memory Params | Total Params |
|---|---|---|---|---|
none |
0 | 0B | 0 | ~32K |
registers |
4 | 16B | ~2K | ~34K |
scratchpad |
8 | 256B | ~30K | ~63K |
reduced |
12 | 4KB | ~516K | ~549K |
full |
16 | 64KB | ~8.26M | ~8.29M |
Citation
@misc{8bit-threshold-computer,
title={8bit-threshold-computer: A Turing-Complete Threshold Logic CPU},
author={Norton, Charles},
year={2026},
howpublished={Hugging Face},
url={https://huggingface.co/phanerozoic/8bit-threshold-computer}
}
License
MIT
References
- McCulloch & Pitts (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity"
- Muroga (1971). "Threshold Logic and Its Applications"
- Siegelmann & Sontag (1995). "On the Computational Power of Neural Nets"
- Bengio et al. (2013). "Estimating or Propagating Gradients Through Stochastic Neurons"
- Ma et al. (2024). "The Era of 1-bit LLMs" (BitNet b1.58)
- HuggingFace (2024). "SmolLM2: Small Language Models"