File size: 10,709 Bytes
9432853
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
# API Reference

Complete reference for all functions and classes in the project.

---

## Data Preprocessing Module

### utils.py

#### String Functions

| Function | Signature | Description |
|----------|-----------|-------------|
| `human_format` | `(num: float) β†’ str` | Format number with K/M/B suffix |
| `hamming` | `(s1: str, s2: str) β†’ int` | Hamming distance between strings |
| `revcomp` | `(str: str) β†’ str` | Reverse complement of DNA |
| `get_qualities` | `(str: str) β†’ List[int]` | ASCII quality to Phred scores |
| `contains_Esp3I_site` | `(str: str) β†’ bool` | Check for restriction site |

#### File I/O

| Function | Signature | Description |
|----------|-----------|-------------|
| `tqdm_readline` | `(file, pbar) β†’ str` | Read line with progress update |
| `process_paired_fastq_file` | `(f1, f2, callback) β†’ int` | Process paired FASTQ files |

#### Sequence Features

| Function | Signature | Description |
|----------|-----------|-------------|
| `add_flanking` | `(nts: str, len: int) β†’ str` | Add flanking sequences |
| `add_barcode_flanking` | `(nts: str, len: int) β†’ str` | Add barcode flanking |
| `nts_to_vector` | `(nts: str, rna=False) β†’ ndarray` | One-hot encode sequence |
| `folding_to_vector` | `(nts: str) β†’ ndarray` | One-hot encode structure |
| `str_to_vector` | `(str: str, template: str) β†’ ndarray` | Generic one-hot encoding |
| `ei_vec` | `(i: int, len: int) β†’ List[int]` | Create one-hot vector |

#### RNA Structure

| Function | Signature | Description |
|----------|-----------|-------------|
| `rna_fold_structs` | `(seqs, maxBPspan=0) β†’ (structs, mfes)` | Predict structures |
| `compute_structure` | `(seqs) β†’ (struct_oh, structs, mfes)` | One-hot encoded structures |
| `compute_seq_oh` | `(seqs) β†’ ndarray` | One-hot encode sequences |
| `compute_wobbles` | `(seqs, structs) β†’ ndarray` | Identify wobble pairs |
| `create_input_data` | `(seqs) β†’ (seq_oh, struct_oh, wobbles)` | Complete feature extraction |

#### Structure Analysis

| Function | Signature | Description |
|----------|-----------|-------------|
| `find_parentheses` | `(s: str) β†’ Dict[int, int]` | Map base pair positions |
| `compute_bijection` | `(s: str) β†’ ndarray` | Pairing array |
| `compute_wobble_indicator` | `(seq, struct) β†’ List[int]` | Wobble pair flags |

---

### RNAutils.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `RNAfold` | `(seqs, bin, temp, span, cmd) β†’ List[[str, float]]` | MFE structure prediction |
| `RNAsubopt` | `(seq, bin, delta) β†’ List[(str, float)]` | Suboptimal structures |
| `RNAsample` | `(seqs, bin, temp, n, span) β†’ List[List[str]]` | Boltzmann sampling |
| `RNA_partition_function` | `(seqs, constraints, ...) β†’ List[float]` | Partition function |

---

### compute_coupling.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `collect_barcodes` | `(r1, r2, r1_q, r2_q) β†’ None` | Extract barcode-exon pairs |

**Global variables:** `couplings`, `good_reads`, `reads_with_N`, `unidentified_reads`

---

### compute_splicing_outcomes.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `identify_splicing_pattern` | `(r1, r2, r1_q, r2_q) β†’ None` | Classify splicing |

**Splicing categories:** `num_exon_inclusion`, `num_exon_skipping`, `num_intron_retention`, `num_splicing_in_exon`, `num_unknown_splicing`

---

### generate_training_data.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `read_dataset` | `(path, filter=True) β†’ DataFrame` | Load filtered CSV |
| `to_input_data` | `(df, flanking=10) β†’ tuple` | Create model inputs |
| `to_target_data` | `(df) β†’ ndarray` | Compute PSI values |

---

## Model Training Module

### model.py

#### Custom Layers

| Class | Purpose | Key Parameters |
|-------|---------|----------------|
| `Selector` | Select between inputs | `trainable=False` |
| `ResidualTuner` | Residual MLP | `hidden_units=100` |
| `SumDiff` | Energy difference | `freeze=False` |
| `RegularizedBiasLayer` | Position bias | Regularization params |

#### Regularizers

| Class/Function | Purpose |
|----------------|---------|
| `MultiRegularizer` | Combined regularizer |
| `pos_reg` | L2 position penalty |
| `adj_reg_fo` | First-order smoothness |
| `adj_reg_so` | Second-order smoothness |

#### Functions

| Function | Signature | Description |
|----------|-----------|-------------|
| `binary_KL` | `(y_true, y_pred) β†’ scalar` | Binary KL divergence loss |
| `regularized_act` | `(x, reg, act) β†’ tensor` | Activation with regularization |
| `train_model` | `(model, X, y, file, ...) β†’ history` | Train with checkpointing |
| `get_model` | `(**kwargs) β†’ Model` | Create model instance |

#### get_model() Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `input_length` | int | 90 | Sequence length |
| `randomized_region` | tuple | (10, 80) | Exon position |
| `num_filters` | int | 20 | Sequence filters |
| `num_structure_filters` | int | 8 | Structure filters |
| `filter_width` | int | 6 | Sequence filter size |
| `structure_filter_width` | int | 30 | Structure filter size |
| `dropout_rate` | float | 0.01 | Dropout probability |
| `activity_regularization` | float | 0.0 | Activation L1 |
| `position_regularization` | float | 2.5e-5 | Position L2 |
| `adjacency_regularization` | float | 0.0 | First-order smoothness |
| `adjacency_regularization_so` | float | 0.0 | Second-order smoothness |
| `energy_activation` | str | "softplus" | Energy activation |
| `tune_energy` | bool | True | Train energy params |

---

## Figures Module

### force_plot.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `get_link_midpoint` | `(fn, mid, eps, ...) β†’ float` | Find sigmoid midpoint |
| `collapse_filters` | `(act_i, act_s, ...) β†’ (df_i, df_s)` | Group filter activations |
| `create_force_data` | `(act_i, act_s, ...) β†’ (Series, Series)` | Aggregate forces |
| `merge_small_forces` | `(forces, thresh) β†’ Series` | Combine small contributions |
| `draw_force_plot` | `(seqs, annots, ...) β†’ Figure` | Create visualization |

### sequence_logo.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `plot_logo` | `(df, thresh, ax, colors) β†’ None` | Draw sequence logo |
| `compute_freqs` | `(kmers) β†’ DataFrame` | Nucleotide frequencies |
| `compute_info` | `(freqs) β†’ ndarray` | Information content |
| `compute_heights` | `(freqs) β†’ DataFrame` | Logo heights |
| `sequence_logo_heights` | `(df) β†’ DataFrame` | Combined calculation |
| `draw_floating_logo` | `(heights, ..., ax) β†’ None` | Overlay logo on axes |
| `compute_EDLogo_scores` | `(kmers, normed) β†’ DataFrame` | Enrichment/depletion |
| `plot_EDLogo` | `(df, thresh, ax) β†’ None` | Draw ED logo |

### draw_stem_loop.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `draw_line` | `(d, x1, y1, x2, y2, color) β†’ None` | SVG line |
| `draw_nucleotide` | `(d, x, y, nt, color) β†’ None` | SVG nucleotide circle |
| `draw_oligo` | `(d, xs, ys, nts, colors) β†’ None` | SVG oligonucleotide |
| `draw_stem_loop` | `(nts, stem_len, colors, file) β†’ None` | Complete stem-loop SVG |

### kl.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `knn_distance` | `(point, sample, k) β†’ float` | k-NN distance |
| `verify_sample_shapes` | `(s1, s2, k) β†’ None` | Validate input shapes |
| `naive_estimator` | `(s1, s2, k) β†’ float` | Brute-force KL estimate |
| `scipy_estimator` | `(s1, s2, k) β†’ float` | KDTree-based KL |
| `skl_estimator` | `(s1, s2, k) β†’ float` | sklearn-based KL |
| `skl_estimator_efficient` | `(s1, s2, k) β†’ float` | Vectorized KL |

### generate_custom_model.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `lanczos_kernel` | `(x, order) β†’ ndarray` | Lanczos interpolation kernel |
| `lanczos_interpolate` | `(arr, positions, order) β†’ ndarray` | Interpolate at positions |
| `lanczos_resampling` | `(arr, new_len, order) β†’ ndarray` | Resample to new length |
| `resample_one_positional_bias` | `(weights, len, pad) β†’ ndarray` | Resample position bias |
| `resample_positional_bias_weights` | `(weights, len, pad) β†’ ndarray` | Resample all biases |
| `generate_custom_model` | `(new_len, delta_basal) β†’ Model` | Create modified model |

### figutils.py

| Function | Signature | Description |
|----------|-----------|-------------|
| `subsample_points` | `(x, y, max) β†’ (x, y)` | Random subsampling |
| `scatter_with_kde` | `(x, y, ax, alpha) β†’ None` | Density scatter plot |
| `safelog` | `(x, tol) β†’ ndarray` | Numerically safe log |
| `bin_kl` | `(y_true, y_pred) β†’ ndarray` | Binary KL divergence |
| `flatten_dict` | `(d) β†’ (keys, values)` | Flatten nested dict |
| `insert_motif_in_middle_of_sequence` | `(seq, motif) β†’ str` | Insert motif |
| `insert_motif_in_middle_of_sequences` | `(seqs, motif) β†’ Dict` | Batch insert |
| `landing_pads_to_sw_exons` | `(mers, motif, pre, post) β†’ List` | Create landing pads |
| `all_seqs` | `(length) β†’ List[str]` | Generate all k-mers |
| `extract_str_patches` | `(lst, n) β†’ List[List[str]]` | Extract n-grams |
| `compute_activations_simple_conv` | `(layer, window) β†’ Dict` | k-mer activations |

---

## Usage Examples

### Making Predictions

```python
from model_training.model import binary_KL, Selector, ResidualTuner, SumDiff, RegularizedBiasLayer
import tensorflow as tf
from joblib import load

# Load model
model = tf.keras.models.load_model(
    'output/custom_adjacency_regularizer_20210731_124_step3.h5',
    custom_objects={
        'binary_KL': binary_KL,
        'Selector': Selector,
        'ResidualTuner': ResidualTuner,
        'SumDiff': SumDiff,
        'RegularizedBiasLayer': RegularizedBiasLayer,
    }
)

# Load test data
xTe = load('data/xTe_ES7_HeLa_ABC.pkl.gz')
yTe = load('data/yTe_ES7_HeLa_ABC.pkl.gz')

# Predict
predictions = model.predict(xTe)
```

### Creating Force Plots

```python
import sys
sys.path.append('figures')
from force_plot import draw_force_plot

fig = draw_force_plot(
    sequences=['ATGC...' * 22 + 'AT'],  # 90 nt
    annotations=['My Sequence'],
)
fig.savefig('my_force_plot.pdf')
```

### Processing New Sequences

```python
from data_preprocessing.utils import add_flanking, create_input_data

exon = 'ACGT' * 17 + 'AC'  # 70 nt
full_seq = add_flanking(exon, 10)  # 90 nt

seq_oh, struct_oh, wobbles = create_input_data([full_seq])

# Now use with model
X = [seq_oh, struct_oh, wobbles]
psi = model.predict(X)[0, 0]
print(f"Predicted PSI: {psi:.3f}")
```