# K-mer–based Sequence Predictor

This Space predicts the most likely group of **unknown sequences** using
group-specific **unique k-mers** generated by the companion Space:

?? **Unique k-mer discovery Space:**  
https://huggingface.co/spaces/<your-username>/<space-1-name>

---

## Overview

This tool assigns each unknown sequence to a group by detecting
group-specific k-mers and computing a confidence score.
It is designed to work directly with the `kmer_results.zip`
produced by the Unique k-mer discovery Space.

---

## Inputs

### 1. Unknown sequences
Upload one or more FASTA files containing unknown sequences:
- `.fa`, `.fasta`, `.fas`, `.fna`

### 2. K-mer results ZIP
Upload **`kmer_results.zip`** generated by the Unique k-mer discovery Space.

> ?? This Space only accepts ZIP input for k-mers to ensure compatibility
> and reproducibility.

---

## Parameters

- **Sequence type**
  - `dna` or `protein`
- **Mode**
  - **fast**: exact k-mer matching (recommended)
  - **full**: alignment-based matching + Fisher test + FDR (slower)
- **Identity / Coverage / FDR**
  - Used only in *full* mode

---

## Outputs

- **predictions_by_alignment.csv**
  - One row per sequence
  - Predicted group and confidence metrics
- **predicted_results_summary.png**
  - Group counts and confidence distribution
- **prediction_outputs.zip**
  - ZIP containing all outputs

---

## Performance notes

- The **fast** mode is recommended for large datasets.
- The **full** mode is computationally intensive and best suited for
  small validation sets.

---

## Citation

If you use this tool, please cite:

Muhamed-Kheir TAHA, Institut Pasteur, Paris France.

---

## License
Others