Spaces:

sachin1801
/

splicing-predictor

Running

File size: 10,709 Bytes
# API Reference

Complete reference for all functions and classes in the project.

---

## Data Preprocessing Module

### utils.py

#### String Functions

| Function | Signature | Description |
|----------|-----------|-------------|
| `human_format` | `(num: float) → str` | Format number with K/M/B suffix |
| `hamming` | `(s1: str, s2: str) → int` | Hamming distance between strings |
| `revcomp` | `(str: str) → str` | Reverse complement of DNA |
| `get_qualities` | `(str: str) → List[int]` | ASCII quality to Phred scores |
| `contains_Esp3I_site` | `(str: str) → bool` | Check for restriction site |

#### File I/O

| Function | Signature | Description |
|----------|-----------|-------------|
| `tqdm_readline` | `(file, pbar) → str` | Read line with progress update |
| `process_paired_fastq_file` | `(f1, f2, callback) → int` | Process paired FASTQ files |

#### Sequence Features

| Function | Signature | Description |
|----------|-----------|-------------|
| `add_flanking` | `(nts: str, len: int) → str` | Add flanking sequences |
| `add_barcode_flanking` | `(nts: str, len: int) → str` | Add barcode flanking |
| `nts_to_vector` | `(nts: str, rna=False) → ndarray` | One-hot encode sequence |
| `folding_to_vector` | `(nts: str) → ndarray` | One-hot encode structure |
| `str_to_vector` | `(str: str, template: str) → ndarray` | Generic one-hot encoding |
| `ei_vec` | `(i: int, len: int) → List[int]` | Create one-hot vector |

#### RNA Structure

| Function | Signature | Description |
|----------|-----------|-------------|
| `rna_fold_structs` | `(seqs, maxBPspan=0) → (structs, mfes)` | Predict structures |
| `compute_structure` | `(seqs) → (struct_oh, structs, mfes)` | One-hot encoded structures |
| `compute_seq_oh` | `(seqs) → ndarray` | One-hot encode sequences |
| `compute_wobbles` | `(seqs, structs) → ndarray` | Identify wobble pairs |
| `create_input_data` | `(seqs) → (seq_oh, struct_oh, wobbles)` | Complete feature extraction |

#### Structure Analysis

| Function | Signature | Description |
|----------|-----------|-------------|
| `find_parentheses` | `(s: str) → Dict[int, int]` | Map base pair positions |
| `compute_bijection` | `(s: str) → ndarray` | Pairing array |
| `compute_wobble_indicator` | `(seq, struct) → List[int]` | Wobble pair flags |

---

### RNAutils.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `RNAfold` | `(seqs, bin, temp, span, cmd) → List[[str, float]]` | MFE structure prediction |
| `RNAsubopt` | `(seq, bin, delta) → List[(str, float)]` | Suboptimal structures |
| `RNAsample` | `(seqs, bin, temp, n, span) → List[List[str]]` | Boltzmann sampling |
| `RNA_partition_function` | `(seqs, constraints, ...) → List[float]` | Partition function |

---

### compute_coupling.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `collect_barcodes` | `(r1, r2, r1_q, r2_q) → None` | Extract barcode-exon pairs |

**Global variables:** `couplings`, `good_reads`, `reads_with_N`, `unidentified_reads`

---

### compute_splicing_outcomes.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `identify_splicing_pattern` | `(r1, r2, r1_q, r2_q) → None` | Classify splicing |

**Splicing categories:** `num_exon_inclusion`, `num_exon_skipping`, `num_intron_retention`, `num_splicing_in_exon`, `num_unknown_splicing`

---

### generate_training_data.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `read_dataset` | `(path, filter=True) → DataFrame` | Load filtered CSV |
| `to_input_data` | `(df, flanking=10) → tuple` | Create model inputs |
| `to_target_data` | `(df) → ndarray` | Compute PSI values |

---

## Model Training Module

### model.py

#### Custom Layers

| Class | Purpose | Key Parameters |
|-------|---------|----------------|
| `Selector` | Select between inputs | `trainable=False` |
| `ResidualTuner` | Residual MLP | `hidden_units=100` |
| `SumDiff` | Energy difference | `freeze=False` |
| `RegularizedBiasLayer` | Position bias | Regularization params |

#### Regularizers

| Class/Function | Purpose |
|----------------|---------|
| `MultiRegularizer` | Combined regularizer |
| `pos_reg` | L2 position penalty |
| `adj_reg_fo` | First-order smoothness |
| `adj_reg_so` | Second-order smoothness |

#### Functions

| Function | Signature | Description |
|----------|-----------|-------------|
| `binary_KL` | `(y_true, y_pred) → scalar` | Binary KL divergence loss |
| `regularized_act` | `(x, reg, act) → tensor` | Activation with regularization |
| `train_model` | `(model, X, y, file, ...) → history` | Train with checkpointing |
| `get_model` | `(**kwargs) → Model` | Create model instance |

#### get_model() Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_length` | int | 90 | Sequence length |
| `randomized_region` | tuple | (10, 80) | Exon position |
| `num_filters` | int | 20 | Sequence filters |
| `num_structure_filters` | int | 8 | Structure filters |
| `filter_width` | int | 6 | Sequence filter size |
| `structure_filter_width` | int | 30 | Structure filter size |
| `dropout_rate` | float | 0.01 | Dropout probability |
| `activity_regularization` | float | 0.0 | Activation L1 |
| `position_regularization` | float | 2.5e-5 | Position L2 |
| `adjacency_regularization` | float | 0.0 | First-order smoothness |
| `adjacency_regularization_so` | float | 0.0 | Second-order smoothness |
| `energy_activation` | str | "softplus" | Energy activation |
| `tune_energy` | bool | True | Train energy params |

---

## Figures Module

### force_plot.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `get_link_midpoint` | `(fn, mid, eps, ...) → float` | Find sigmoid midpoint |
| `collapse_filters` | `(act_i, act_s, ...) → (df_i, df_s)` | Group filter activations |
| `create_force_data` | `(act_i, act_s, ...) → (Series, Series)` | Aggregate forces |
| `merge_small_forces` | `(forces, thresh) → Series` | Combine small contributions |
| `draw_force_plot` | `(seqs, annots, ...) → Figure` | Create visualization |

### sequence_logo.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `plot_logo` | `(df, thresh, ax, colors) → None` | Draw sequence logo |
| `compute_freqs` | `(kmers) → DataFrame` | Nucleotide frequencies |
| `compute_info` | `(freqs) → ndarray` | Information content |
| `compute_heights` | `(freqs) → DataFrame` | Logo heights |
| `sequence_logo_heights` | `(df) → DataFrame` | Combined calculation |
| `draw_floating_logo` | `(heights, ..., ax) → None` | Overlay logo on axes |
| `compute_EDLogo_scores` | `(kmers, normed) → DataFrame` | Enrichment/depletion |
| `plot_EDLogo` | `(df, thresh, ax) → None` | Draw ED logo |

### draw_stem_loop.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `draw_line` | `(d, x1, y1, x2, y2, color) → None` | SVG line |
| `draw_nucleotide` | `(d, x, y, nt, color) → None` | SVG nucleotide circle |
| `draw_oligo` | `(d, xs, ys, nts, colors) → None` | SVG oligonucleotide |
| `draw_stem_loop` | `(nts, stem_len, colors, file) → None` | Complete stem-loop SVG |

### kl.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `knn_distance` | `(point, sample, k) → float` | k-NN distance |
| `verify_sample_shapes` | `(s1, s2, k) → None` | Validate input shapes |
| `naive_estimator` | `(s1, s2, k) → float` | Brute-force KL estimate |
| `scipy_estimator` | `(s1, s2, k) → float` | KDTree-based KL |
| `skl_estimator` | `(s1, s2, k) → float` | sklearn-based KL |
| `skl_estimator_efficient` | `(s1, s2, k) → float` | Vectorized KL |

### generate_custom_model.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `lanczos_kernel` | `(x, order) → ndarray` | Lanczos interpolation kernel |
| `lanczos_interpolate` | `(arr, positions, order) → ndarray` | Interpolate at positions |
| `lanczos_resampling` | `(arr, new_len, order) → ndarray` | Resample to new length |
| `resample_one_positional_bias` | `(weights, len, pad) → ndarray` | Resample position bias |
| `resample_positional_bias_weights` | `(weights, len, pad) → ndarray` | Resample all biases |
| `generate_custom_model` | `(new_len, delta_basal) → Model` | Create modified model |

### figutils.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `subsample_points` | `(x, y, max) → (x, y)` | Random subsampling |
| `scatter_with_kde` | `(x, y, ax, alpha) → None` | Density scatter plot |
| `safelog` | `(x, tol) → ndarray` | Numerically safe log |
| `bin_kl` | `(y_true, y_pred) → ndarray` | Binary KL divergence |
| `flatten_dict` | `(d) → (keys, values)` | Flatten nested dict |
| `insert_motif_in_middle_of_sequence` | `(seq, motif) → str` | Insert motif |
| `insert_motif_in_middle_of_sequences` | `(seqs, motif) → Dict` | Batch insert |
| `landing_pads_to_sw_exons` | `(mers, motif, pre, post) → List` | Create landing pads |
| `all_seqs` | `(length) → List[str]` | Generate all k-mers |
| `extract_str_patches` | `(lst, n) → List[List[str]]` | Extract n-grams |
| `compute_activations_simple_conv` | `(layer, window) → Dict` | k-mer activations |

---

## Usage Examples

### Making Predictions

```python
from model_training.model import binary_KL, Selector, ResidualTuner, SumDiff, RegularizedBiasLayer
import tensorflow as tf
from joblib import load

# Load model
model = tf.keras.models.load_model(
    'output/custom_adjacency_regularizer_20210731_124_step3.h5',
    custom_objects={
        'binary_KL': binary_KL,
        'Selector': Selector,
        'ResidualTuner': ResidualTuner,
        'SumDiff': SumDiff,
        'RegularizedBiasLayer': RegularizedBiasLayer,
    }
)

# Load test data
xTe = load('data/xTe_ES7_HeLa_ABC.pkl.gz')
yTe = load('data/yTe_ES7_HeLa_ABC.pkl.gz')

# Predict
predictions = model.predict(xTe)
```

### Creating Force Plots

```python
import sys
sys.path.append('figures')
from force_plot import draw_force_plot

fig = draw_force_plot(
    sequences=['ATGC...' * 22 + 'AT'],  # 90 nt
    annotations=['My Sequence'],
)
fig.savefig('my_force_plot.pdf')
```

### Processing New Sequences

```python
from data_preprocessing.utils import add_flanking, create_input_data

exon = 'ACGT' * 17 + 'AC'  # 70 nt
full_seq = add_flanking(exon, 10)  # 90 nt

seq_oh, struct_oh, wobbles = create_input_data([full_seq])

# Now use with model
X = [seq_oh, struct_oh, wobbles]
psi = model.predict(X)[0, 0]
print(f"Predicted PSI: {psi:.3f}")
```