Upload README.md with huggingface_hub

3d11f2b verified about 1 month ago

4.92 kB

	---
	language: en
	license: apache-2.0
	library_name: transformers
	tags:
	- sparse-retrieval
	- information-retrieval
	- bert
	- fles1
	datasets:
	- ms_marco
	metrics:
	- ndcg
	- mrr
	---

	# FLES-1 v14 — Sparse Lexical Encoder (Best Quality)

	> Paper: [Closed-Loop FLOPS Regulation for Learned Sparse Retrieval](https://mindoval.com/ai-research) — Golvis Tavarez, Mindoval, Inc.

	## Model Description

	FLES-1 transforms text into interpretable sparse vectors using BERT's MLM predictions. Each of the 30,522 dimensions corresponds to a real vocabulary word — readable, debuggable, and compatible with standard inverted indices (Elasticsearch, OpenSearch).

	Trained with two novel techniques:
	- L1 FLOPS regularization — eliminates the gradient explosion that causes training instability in all published sparse retrieval models
	- Step-interval CLFR — closed-loop sparsity control that adjusts regularization every ~6,250 steps (one epoch in our setup) based on measured sparsity

	## Metrics

	### nfcorpus (threshold=0.3)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| NDCG@10 \| 0.3049 \|
	\| MRR \| 0.5182 \|
	\| Recall@100 \| 0.2544 \|
	\| Avg NNZ \| 359 \|

	### Reproducibility

	This recipe was run 5 times with different seeds:

	\| Seed \| NDCG@10 \|
	\|------\|---------\|
	\| v14 (original) \| 0.305 \|
	\| v17c \| 0.299 \|
	\| v31a \| 0.299 \|
	\| v32 (seed=7777) \| 0.299 \|
	\| v26a (seed=42) \| 0.272 \|

	Mean: 0.295. Std: 0.013. v14 is at the high end of variance. Expected reproduction: 0.295 ± 0.013.

	### Baselines

	\| Model \| NDCG@10 \| NNZ \| Distillation \| Training Data \|
	\|-------\|---------\|-----\|-------------\|---------------\|
	\| FLES-1 v14 \| 0.305 \| 359 \| None \| 200K MS MARCO \|
	\| BM25 (Pyserini, stemmed) \| 0.325 \| — \| — \| — \|
	\| BM25 (regex, no stemming) \| 0.307 \| — \| — \| — \|
	\| SPLADE-Doc (no distillation) \| 0.323 \| — \| None \| Full MS MARCO \|
	\| SPLADE original (no distillation) \| 0.336 \| — \| None \| Full MS MARCO \|
	\| SPLADE-cocondenser (distilled) \| 0.340 \| 125 \| Cross-encoder \| Full MS MARCO \|

	FLES-1 v14 is 6% behind Pyserini BM25 (0.325) and 6-10% behind non-distilled SPLADE variants. The paper's contribution is the training methodology (CLFR, L1 FLOPS, lambda-steps tradeoff), not the absolute numbers.

	### Cross-Domain (zero-shot)

	\| Dataset \| Domain \| NDCG@10 \|
	\|---------\|--------\|---------\|
	\| nfcorpus \| Medical \| 0.305 \|
	\| scifact \| Scientific claims \| 0.557 \|
	\| fiqa \| Financial Q&A \| 0.212 \|
	\| arguana \| Argument retrieval \| 0.142 \|
	\| scidocs \| Scientific docs \| 0.112 \|

	### Production

	\| Metric \| GPU (A100) \| CPU \|
	\|--------\|-----------\|-----\|
	\| Encoding \| 245 docs/sec \| 87 docs/sec \|
	\| Query latency \| 10 ms avg \| 33 ms avg \|
	\| Index size (1K docs) \| 0.32 MB \| — \|
	\| vs dense 768d \| 9.5x smaller \| — \|

	## Training

	```
	Foundation: fles1-v12b (2 generations from bert-base-uncased)
	Data: 200,000 MS MARCO random negatives
	Epochs: 2 (12,500 steps)
	Loss: InfoNCE (τ=0.05) + L1 FLOPS (λ_d=0.00003) + anti-collapse
	Controller: Step-interval CLFR, adjusted every ~6,250 steps (target_nnz_d=400, gain=0.1)
	Optimizer: AdamW, lr=2e-5, batch_size=32, 7 negatives per query
	Hardware: 1× A100 80GB, ~2 hours
	```

	## The CLFR Paper

	> Full paper coming soon.

	This model is the primary result of a 75-run empirical study of training dynamics in sparse retrieval. The study discovered:

	- L1 FLOPS regularization (reduces training crashes from 10-17 to 0-7 per run)
	- Epoch-level closed-loop sparsity control (1 adjustment per ~6,250 steps outperforms 12,500 per-step adjustments)
	- The lambda-steps tradeoff (eff_reg = λ × steps, sweet spot 0.10-0.20)
	- The binary contrastive ceiling (0.298 ± 0.007 for InfoNCE with random negatives)
	- Checkpoint archaeology (longitudinal weight analysis across 43 training runs)

	## Limitations

	- Trained on MS MARCO (English web Q&A). Domain transfer to non-English or specialized domains requires fine-tuning.
	- NNZ=359 is denser than SPLADE (125). For latency-critical deployments, consider [fles1-v12b](../fles1-v12b/) (NNZ=139).
	- The 0.305 result is at the high end of variance for this recipe (mean=0.295).
	- Does not use knowledge distillation — the gap to SPLADE (10.4%) is structural.

	## Usage

	```python
	from fles1_encoder import FLES1Encoder

	# Load model
	encoder = FLES1Encoder.from_pretrained("mindoval/fles1-v14")

	# Encode text to sparse vector
	sparse = encoder.encode("What is machine learning?")
	# Returns: {'machine': 1.39, 'learning': 1.08, 'machines': 0.63, ...}

	# Batch encode
	vectors = encoder.encode_batch(["query 1", "query 2"], batch_size=32)

	# Encode to term IDs (for inverted index)
	ids, weights = encoder.encode_to_ids("What is machine learning?")
	```

	## License

	Apache 2.0

	Golvis Tavarez — Mindoval, Inc.
	We thank Microsoft Corporation for supporting this research through the Microsoft for Startups program.
	[https://mindoval.com/ai-research](https://mindoval.com/ai-research)