splicing-predictor / webapp /docs /09-API-REFERENCE.md
sachin1801
help page revamp according to requirements, removed tutorial page, search filter improved on history, login with email pass created
068e060
# API Reference
Complete reference for all functions and classes in the project.
---
## Data Preprocessing Module
### utils.py
#### String Functions
| Function | Signature | Description |
|----------|-----------|-------------|
| `human_format` | `(num: float) β†’ str` | Format number with K/M/B suffix |
| `hamming` | `(s1: str, s2: str) β†’ int` | Hamming distance between strings |
| `revcomp` | `(str: str) β†’ str` | Reverse complement of DNA |
| `get_qualities` | `(str: str) β†’ List[int]` | ASCII quality to Phred scores |
| `contains_Esp3I_site` | `(str: str) β†’ bool` | Check for restriction site |
#### File I/O
| Function | Signature | Description |
|----------|-----------|-------------|
| `tqdm_readline` | `(file, pbar) β†’ str` | Read line with progress update |
| `process_paired_fastq_file` | `(f1, f2, callback) β†’ int` | Process paired FASTQ files |
#### Sequence Features
| Function | Signature | Description |
|----------|-----------|-------------|
| `add_flanking` | `(nts: str, len: int) β†’ str` | Add flanking sequences |
| `add_barcode_flanking` | `(nts: str, len: int) β†’ str` | Add barcode flanking |
| `nts_to_vector` | `(nts: str, rna=False) β†’ ndarray` | One-hot encode sequence |
| `folding_to_vector` | `(nts: str) β†’ ndarray` | One-hot encode structure |
| `str_to_vector` | `(str: str, template: str) β†’ ndarray` | Generic one-hot encoding |
| `ei_vec` | `(i: int, len: int) β†’ List[int]` | Create one-hot vector |
#### RNA Structure
| Function | Signature | Description |
|----------|-----------|-------------|
| `rna_fold_structs` | `(seqs, maxBPspan=0) β†’ (structs, mfes)` | Predict structures |
| `compute_structure` | `(seqs) β†’ (struct_oh, structs, mfes)` | One-hot encoded structures |
| `compute_seq_oh` | `(seqs) β†’ ndarray` | One-hot encode sequences |
| `compute_wobbles` | `(seqs, structs) β†’ ndarray` | Identify wobble pairs |
| `create_input_data` | `(seqs) β†’ (seq_oh, struct_oh, wobbles)` | Complete feature extraction |
#### Structure Analysis
| Function | Signature | Description |
|----------|-----------|-------------|
| `find_parentheses` | `(s: str) β†’ Dict[int, int]` | Map base pair positions |
| `compute_bijection` | `(s: str) β†’ ndarray` | Pairing array |
| `compute_wobble_indicator` | `(seq, struct) β†’ List[int]` | Wobble pair flags |
---
### RNAutils.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `RNAfold` | `(seqs, bin, temp, span, cmd) β†’ List[[str, float]]` | MFE structure prediction |
| `RNAsubopt` | `(seq, bin, delta) β†’ List[(str, float)]` | Suboptimal structures |
| `RNAsample` | `(seqs, bin, temp, n, span) β†’ List[List[str]]` | Boltzmann sampling |
| `RNA_partition_function` | `(seqs, constraints, ...) β†’ List[float]` | Partition function |
---
### compute_coupling.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `collect_barcodes` | `(r1, r2, r1_q, r2_q) β†’ None` | Extract barcode-exon pairs |
**Global variables:** `couplings`, `good_reads`, `reads_with_N`, `unidentified_reads`
---
### compute_splicing_outcomes.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `identify_splicing_pattern` | `(r1, r2, r1_q, r2_q) β†’ None` | Classify splicing |
**Splicing categories:** `num_exon_inclusion`, `num_exon_skipping`, `num_intron_retention`, `num_splicing_in_exon`, `num_unknown_splicing`
---
### generate_training_data.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `read_dataset` | `(path, filter=True) β†’ DataFrame` | Load filtered CSV |
| `to_input_data` | `(df, flanking=10) β†’ tuple` | Create model inputs |
| `to_target_data` | `(df) β†’ ndarray` | Compute PSI values |
---
## Model Training Module
### model.py
#### Custom Layers
| Class | Purpose | Key Parameters |
|-------|---------|----------------|
| `Selector` | Select between inputs | `trainable=False` |
| `ResidualTuner` | Residual MLP | `hidden_units=100` |
| `SumDiff` | Energy difference | `freeze=False` |
| `RegularizedBiasLayer` | Position bias | Regularization params |
#### Regularizers
| Class/Function | Purpose |
|----------------|---------|
| `MultiRegularizer` | Combined regularizer |
| `pos_reg` | L2 position penalty |
| `adj_reg_fo` | First-order smoothness |
| `adj_reg_so` | Second-order smoothness |
#### Functions
| Function | Signature | Description |
|----------|-----------|-------------|
| `binary_KL` | `(y_true, y_pred) β†’ scalar` | Binary KL divergence loss |
| `regularized_act` | `(x, reg, act) β†’ tensor` | Activation with regularization |
| `train_model` | `(model, X, y, file, ...) β†’ history` | Train with checkpointing |
| `get_model` | `(**kwargs) β†’ Model` | Create model instance |
#### get_model() Parameters
| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_length` | int | 90 | Sequence length |
| `randomized_region` | tuple | (10, 80) | Exon position |
| `num_filters` | int | 20 | Sequence filters |
| `num_structure_filters` | int | 8 | Structure filters |
| `filter_width` | int | 6 | Sequence filter size |
| `structure_filter_width` | int | 30 | Structure filter size |
| `dropout_rate` | float | 0.01 | Dropout probability |
| `activity_regularization` | float | 0.0 | Activation L1 |
| `position_regularization` | float | 2.5e-5 | Position L2 |
| `adjacency_regularization` | float | 0.0 | First-order smoothness |
| `adjacency_regularization_so` | float | 0.0 | Second-order smoothness |
| `energy_activation` | str | "softplus" | Energy activation |
| `tune_energy` | bool | True | Train energy params |
---
## Figures Module
### force_plot.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `get_link_midpoint` | `(fn, mid, eps, ...) β†’ float` | Find sigmoid midpoint |
| `collapse_filters` | `(act_i, act_s, ...) β†’ (df_i, df_s)` | Group filter activations |
| `create_force_data` | `(act_i, act_s, ...) β†’ (Series, Series)` | Aggregate forces |
| `merge_small_forces` | `(forces, thresh) β†’ Series` | Combine small contributions |
| `draw_force_plot` | `(seqs, annots, ...) β†’ Figure` | Create visualization |
### sequence_logo.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `plot_logo` | `(df, thresh, ax, colors) β†’ None` | Draw sequence logo |
| `compute_freqs` | `(kmers) β†’ DataFrame` | Nucleotide frequencies |
| `compute_info` | `(freqs) β†’ ndarray` | Information content |
| `compute_heights` | `(freqs) β†’ DataFrame` | Logo heights |
| `sequence_logo_heights` | `(df) β†’ DataFrame` | Combined calculation |
| `draw_floating_logo` | `(heights, ..., ax) β†’ None` | Overlay logo on axes |
| `compute_EDLogo_scores` | `(kmers, normed) β†’ DataFrame` | Enrichment/depletion |
| `plot_EDLogo` | `(df, thresh, ax) β†’ None` | Draw ED logo |
### draw_stem_loop.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `draw_line` | `(d, x1, y1, x2, y2, color) β†’ None` | SVG line |
| `draw_nucleotide` | `(d, x, y, nt, color) β†’ None` | SVG nucleotide circle |
| `draw_oligo` | `(d, xs, ys, nts, colors) β†’ None` | SVG oligonucleotide |
| `draw_stem_loop` | `(nts, stem_len, colors, file) β†’ None` | Complete stem-loop SVG |
### kl.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `knn_distance` | `(point, sample, k) β†’ float` | k-NN distance |
| `verify_sample_shapes` | `(s1, s2, k) β†’ None` | Validate input shapes |
| `naive_estimator` | `(s1, s2, k) β†’ float` | Brute-force KL estimate |
| `scipy_estimator` | `(s1, s2, k) β†’ float` | KDTree-based KL |
| `skl_estimator` | `(s1, s2, k) β†’ float` | sklearn-based KL |
| `skl_estimator_efficient` | `(s1, s2, k) β†’ float` | Vectorized KL |
### generate_custom_model.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `lanczos_kernel` | `(x, order) β†’ ndarray` | Lanczos interpolation kernel |
| `lanczos_interpolate` | `(arr, positions, order) β†’ ndarray` | Interpolate at positions |
| `lanczos_resampling` | `(arr, new_len, order) β†’ ndarray` | Resample to new length |
| `resample_one_positional_bias` | `(weights, len, pad) β†’ ndarray` | Resample position bias |
| `resample_positional_bias_weights` | `(weights, len, pad) β†’ ndarray` | Resample all biases |
| `generate_custom_model` | `(new_len, delta_basal) β†’ Model` | Create modified model |
### figutils.py
| Function | Signature | Description |
|----------|-----------|-------------|
| `subsample_points` | `(x, y, max) β†’ (x, y)` | Random subsampling |
| `scatter_with_kde` | `(x, y, ax, alpha) β†’ None` | Density scatter plot |
| `safelog` | `(x, tol) β†’ ndarray` | Numerically safe log |
| `bin_kl` | `(y_true, y_pred) β†’ ndarray` | Binary KL divergence |
| `flatten_dict` | `(d) β†’ (keys, values)` | Flatten nested dict |
| `insert_motif_in_middle_of_sequence` | `(seq, motif) β†’ str` | Insert motif |
| `insert_motif_in_middle_of_sequences` | `(seqs, motif) β†’ Dict` | Batch insert |
| `landing_pads_to_sw_exons` | `(mers, motif, pre, post) β†’ List` | Create landing pads |
| `all_seqs` | `(length) β†’ List[str]` | Generate all k-mers |
| `extract_str_patches` | `(lst, n) β†’ List[List[str]]` | Extract n-grams |
| `compute_activations_simple_conv` | `(layer, window) β†’ Dict` | k-mer activations |
---
## Usage Examples
### Making Predictions
```python
from model_training.model import binary_KL, Selector, ResidualTuner, SumDiff, RegularizedBiasLayer
import tensorflow as tf
from joblib import load
# Load model
model = tf.keras.models.load_model(
'output/custom_adjacency_regularizer_20210731_124_step3.h5',
custom_objects={
'binary_KL': binary_KL,
'Selector': Selector,
'ResidualTuner': ResidualTuner,
'SumDiff': SumDiff,
'RegularizedBiasLayer': RegularizedBiasLayer,
}
)
# Load test data
xTe = load('data/xTe_ES7_HeLa_ABC.pkl.gz')
yTe = load('data/yTe_ES7_HeLa_ABC.pkl.gz')
# Predict
predictions = model.predict(xTe)
```
### Creating Force Plots
```python
import sys
sys.path.append('figures')
from force_plot import draw_force_plot
fig = draw_force_plot(
sequences=['ATGC...' * 22 + 'AT'], # 90 nt
annotations=['My Sequence'],
)
fig.savefig('my_force_plot.pdf')
```
### Processing New Sequences
```python
from data_preprocessing.utils import add_flanking, create_input_data
exon = 'ACGT' * 17 + 'AC' # 70 nt
full_seq = add_flanking(exon, 10) # 90 nt
seq_oh, struct_oh, wobbles = create_input_data([full_seq])
# Now use with model
X = [seq_oh, struct_oh, wobbles]
psi = model.predict(X)[0, 0]
print(f"Predicted PSI: {psi:.3f}")
```