Muhamed-Kheir commited on
Commit
47a59ac
·
verified ·
1 Parent(s): aa33062

Update README.txt

Browse files
Files changed (1) hide show
  1. README.txt +75 -18
README.txt CHANGED
@@ -1,18 +1,75 @@
1
- # Multi-group unique k-mer analysis
2
-
3
- This tool compares multiple groups of FASTA sequences (one directory per group) and identifies **k-mers unique to each group** relative to all other groups. It outputs per-group TSV files, a summary Excel file, and two plots.
4
-
5
- ## Install
6
- pip install -r requirements.txt
7
-
8
- ## Run
9
- python kmer_unique.py \
10
- --group-dirs path/to/groupA path/to/groupB path/to/groupC \
11
- --k-min 15 --k-max 31 --min-freq 5 \
12
- --outdir results
13
-
14
- ## Outputs
15
- - `results/unique_k{k}_{group}.tsv` : unique k-mers and counts
16
- - `results/kmer_summary.xlsx` : summary table across k
17
- - `results/unique_kmers_per_group.png`
18
- - `results/total_freq_per_group.png`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # K-mer–based Sequence Predictor
2
+
3
+ This Space predicts the most likely group of **unknown sequences** using
4
+ group-specific **unique k-mers** generated by the companion Space:
5
+
6
+ ?? **Unique k-mer discovery Space:**
7
+ https://huggingface.co/spaces/<your-username>/<space-1-name>
8
+
9
+ ---
10
+
11
+ ## Overview
12
+
13
+ This tool assigns each unknown sequence to a group by detecting
14
+ group-specific k-mers and computing a confidence score.
15
+ It is designed to work directly with the `kmer_results.zip`
16
+ produced by the Unique k-mer discovery Space.
17
+
18
+ ---
19
+
20
+ ## Inputs
21
+
22
+ ### 1. Unknown sequences
23
+ Upload one or more FASTA files containing unknown sequences:
24
+ - `.fa`, `.fasta`, `.fas`, `.fna`
25
+
26
+ ### 2. K-mer results ZIP
27
+ Upload **`kmer_results.zip`** generated by the Unique k-mer discovery Space.
28
+
29
+ > ?? This Space only accepts ZIP input for k-mers to ensure compatibility
30
+ > and reproducibility.
31
+
32
+ ---
33
+
34
+ ## Parameters
35
+
36
+ - **Sequence type**
37
+ - `dna` or `protein`
38
+ - **Mode**
39
+ - **fast**: exact k-mer matching (recommended)
40
+ - **full**: alignment-based matching + Fisher test + FDR (slower)
41
+ - **Identity / Coverage / FDR**
42
+ - Used only in *full* mode
43
+
44
+ ---
45
+
46
+ ## Outputs
47
+
48
+ - **predictions_by_alignment.csv**
49
+ - One row per sequence
50
+ - Predicted group and confidence metrics
51
+ - **predicted_results_summary.png**
52
+ - Group counts and confidence distribution
53
+ - **prediction_outputs.zip**
54
+ - ZIP containing all outputs
55
+
56
+ ---
57
+
58
+ ## Performance notes
59
+
60
+ - The **fast** mode is recommended for large datasets.
61
+ - The **full** mode is computationally intensive and best suited for
62
+ small validation sets.
63
+
64
+ ---
65
+
66
+ ## Citation
67
+
68
+ If you use this tool, please cite:
69
+
70
+ Muhamed-Kheir TAHA, Institut Pasteur, Paris France.
71
+
72
+ ---
73
+
74
+ ## License
75
+ Others