Spaces:
Running
Running
sachin1801
help page revamp according to requirements, removed tutorial page, search filter improved on history, login with email pass created
068e060
| # API Reference | |
| Complete reference for all functions and classes in the project. | |
| --- | |
| ## Data Preprocessing Module | |
| ### utils.py | |
| #### String Functions | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `human_format` | `(num: float) β str` | Format number with K/M/B suffix | | |
| | `hamming` | `(s1: str, s2: str) β int` | Hamming distance between strings | | |
| | `revcomp` | `(str: str) β str` | Reverse complement of DNA | | |
| | `get_qualities` | `(str: str) β List[int]` | ASCII quality to Phred scores | | |
| | `contains_Esp3I_site` | `(str: str) β bool` | Check for restriction site | | |
| #### File I/O | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `tqdm_readline` | `(file, pbar) β str` | Read line with progress update | | |
| | `process_paired_fastq_file` | `(f1, f2, callback) β int` | Process paired FASTQ files | | |
| #### Sequence Features | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `add_flanking` | `(nts: str, len: int) β str` | Add flanking sequences | | |
| | `add_barcode_flanking` | `(nts: str, len: int) β str` | Add barcode flanking | | |
| | `nts_to_vector` | `(nts: str, rna=False) β ndarray` | One-hot encode sequence | | |
| | `folding_to_vector` | `(nts: str) β ndarray` | One-hot encode structure | | |
| | `str_to_vector` | `(str: str, template: str) β ndarray` | Generic one-hot encoding | | |
| | `ei_vec` | `(i: int, len: int) β List[int]` | Create one-hot vector | | |
| #### RNA Structure | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `rna_fold_structs` | `(seqs, maxBPspan=0) β (structs, mfes)` | Predict structures | | |
| | `compute_structure` | `(seqs) β (struct_oh, structs, mfes)` | One-hot encoded structures | | |
| | `compute_seq_oh` | `(seqs) β ndarray` | One-hot encode sequences | | |
| | `compute_wobbles` | `(seqs, structs) β ndarray` | Identify wobble pairs | | |
| | `create_input_data` | `(seqs) β (seq_oh, struct_oh, wobbles)` | Complete feature extraction | | |
| #### Structure Analysis | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `find_parentheses` | `(s: str) β Dict[int, int]` | Map base pair positions | | |
| | `compute_bijection` | `(s: str) β ndarray` | Pairing array | | |
| | `compute_wobble_indicator` | `(seq, struct) β List[int]` | Wobble pair flags | | |
| --- | |
| ### RNAutils.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `RNAfold` | `(seqs, bin, temp, span, cmd) β List[[str, float]]` | MFE structure prediction | | |
| | `RNAsubopt` | `(seq, bin, delta) β List[(str, float)]` | Suboptimal structures | | |
| | `RNAsample` | `(seqs, bin, temp, n, span) β List[List[str]]` | Boltzmann sampling | | |
| | `RNA_partition_function` | `(seqs, constraints, ...) β List[float]` | Partition function | | |
| --- | |
| ### compute_coupling.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `collect_barcodes` | `(r1, r2, r1_q, r2_q) β None` | Extract barcode-exon pairs | | |
| **Global variables:** `couplings`, `good_reads`, `reads_with_N`, `unidentified_reads` | |
| --- | |
| ### compute_splicing_outcomes.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `identify_splicing_pattern` | `(r1, r2, r1_q, r2_q) β None` | Classify splicing | | |
| **Splicing categories:** `num_exon_inclusion`, `num_exon_skipping`, `num_intron_retention`, `num_splicing_in_exon`, `num_unknown_splicing` | |
| --- | |
| ### generate_training_data.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `read_dataset` | `(path, filter=True) β DataFrame` | Load filtered CSV | | |
| | `to_input_data` | `(df, flanking=10) β tuple` | Create model inputs | | |
| | `to_target_data` | `(df) β ndarray` | Compute PSI values | | |
| --- | |
| ## Model Training Module | |
| ### model.py | |
| #### Custom Layers | |
| | Class | Purpose | Key Parameters | | |
| |-------|---------|----------------| | |
| | `Selector` | Select between inputs | `trainable=False` | | |
| | `ResidualTuner` | Residual MLP | `hidden_units=100` | | |
| | `SumDiff` | Energy difference | `freeze=False` | | |
| | `RegularizedBiasLayer` | Position bias | Regularization params | | |
| #### Regularizers | |
| | Class/Function | Purpose | | |
| |----------------|---------| | |
| | `MultiRegularizer` | Combined regularizer | | |
| | `pos_reg` | L2 position penalty | | |
| | `adj_reg_fo` | First-order smoothness | | |
| | `adj_reg_so` | Second-order smoothness | | |
| #### Functions | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `binary_KL` | `(y_true, y_pred) β scalar` | Binary KL divergence loss | | |
| | `regularized_act` | `(x, reg, act) β tensor` | Activation with regularization | | |
| | `train_model` | `(model, X, y, file, ...) β history` | Train with checkpointing | | |
| | `get_model` | `(**kwargs) β Model` | Create model instance | | |
| #### get_model() Parameters | |
| | Parameter | Type | Default | Description | | |
| |-----------|------|---------|-------------| | |
| | `input_length` | int | 90 | Sequence length | | |
| | `randomized_region` | tuple | (10, 80) | Exon position | | |
| | `num_filters` | int | 20 | Sequence filters | | |
| | `num_structure_filters` | int | 8 | Structure filters | | |
| | `filter_width` | int | 6 | Sequence filter size | | |
| | `structure_filter_width` | int | 30 | Structure filter size | | |
| | `dropout_rate` | float | 0.01 | Dropout probability | | |
| | `activity_regularization` | float | 0.0 | Activation L1 | | |
| | `position_regularization` | float | 2.5e-5 | Position L2 | | |
| | `adjacency_regularization` | float | 0.0 | First-order smoothness | | |
| | `adjacency_regularization_so` | float | 0.0 | Second-order smoothness | | |
| | `energy_activation` | str | "softplus" | Energy activation | | |
| | `tune_energy` | bool | True | Train energy params | | |
| --- | |
| ## Figures Module | |
| ### force_plot.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `get_link_midpoint` | `(fn, mid, eps, ...) β float` | Find sigmoid midpoint | | |
| | `collapse_filters` | `(act_i, act_s, ...) β (df_i, df_s)` | Group filter activations | | |
| | `create_force_data` | `(act_i, act_s, ...) β (Series, Series)` | Aggregate forces | | |
| | `merge_small_forces` | `(forces, thresh) β Series` | Combine small contributions | | |
| | `draw_force_plot` | `(seqs, annots, ...) β Figure` | Create visualization | | |
| ### sequence_logo.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `plot_logo` | `(df, thresh, ax, colors) β None` | Draw sequence logo | | |
| | `compute_freqs` | `(kmers) β DataFrame` | Nucleotide frequencies | | |
| | `compute_info` | `(freqs) β ndarray` | Information content | | |
| | `compute_heights` | `(freqs) β DataFrame` | Logo heights | | |
| | `sequence_logo_heights` | `(df) β DataFrame` | Combined calculation | | |
| | `draw_floating_logo` | `(heights, ..., ax) β None` | Overlay logo on axes | | |
| | `compute_EDLogo_scores` | `(kmers, normed) β DataFrame` | Enrichment/depletion | | |
| | `plot_EDLogo` | `(df, thresh, ax) β None` | Draw ED logo | | |
| ### draw_stem_loop.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `draw_line` | `(d, x1, y1, x2, y2, color) β None` | SVG line | | |
| | `draw_nucleotide` | `(d, x, y, nt, color) β None` | SVG nucleotide circle | | |
| | `draw_oligo` | `(d, xs, ys, nts, colors) β None` | SVG oligonucleotide | | |
| | `draw_stem_loop` | `(nts, stem_len, colors, file) β None` | Complete stem-loop SVG | | |
| ### kl.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `knn_distance` | `(point, sample, k) β float` | k-NN distance | | |
| | `verify_sample_shapes` | `(s1, s2, k) β None` | Validate input shapes | | |
| | `naive_estimator` | `(s1, s2, k) β float` | Brute-force KL estimate | | |
| | `scipy_estimator` | `(s1, s2, k) β float` | KDTree-based KL | | |
| | `skl_estimator` | `(s1, s2, k) β float` | sklearn-based KL | | |
| | `skl_estimator_efficient` | `(s1, s2, k) β float` | Vectorized KL | | |
| ### generate_custom_model.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `lanczos_kernel` | `(x, order) β ndarray` | Lanczos interpolation kernel | | |
| | `lanczos_interpolate` | `(arr, positions, order) β ndarray` | Interpolate at positions | | |
| | `lanczos_resampling` | `(arr, new_len, order) β ndarray` | Resample to new length | | |
| | `resample_one_positional_bias` | `(weights, len, pad) β ndarray` | Resample position bias | | |
| | `resample_positional_bias_weights` | `(weights, len, pad) β ndarray` | Resample all biases | | |
| | `generate_custom_model` | `(new_len, delta_basal) β Model` | Create modified model | | |
| ### figutils.py | |
| | Function | Signature | Description | | |
| |----------|-----------|-------------| | |
| | `subsample_points` | `(x, y, max) β (x, y)` | Random subsampling | | |
| | `scatter_with_kde` | `(x, y, ax, alpha) β None` | Density scatter plot | | |
| | `safelog` | `(x, tol) β ndarray` | Numerically safe log | | |
| | `bin_kl` | `(y_true, y_pred) β ndarray` | Binary KL divergence | | |
| | `flatten_dict` | `(d) β (keys, values)` | Flatten nested dict | | |
| | `insert_motif_in_middle_of_sequence` | `(seq, motif) β str` | Insert motif | | |
| | `insert_motif_in_middle_of_sequences` | `(seqs, motif) β Dict` | Batch insert | | |
| | `landing_pads_to_sw_exons` | `(mers, motif, pre, post) β List` | Create landing pads | | |
| | `all_seqs` | `(length) β List[str]` | Generate all k-mers | | |
| | `extract_str_patches` | `(lst, n) β List[List[str]]` | Extract n-grams | | |
| | `compute_activations_simple_conv` | `(layer, window) β Dict` | k-mer activations | | |
| --- | |
| ## Usage Examples | |
| ### Making Predictions | |
| ```python | |
| from model_training.model import binary_KL, Selector, ResidualTuner, SumDiff, RegularizedBiasLayer | |
| import tensorflow as tf | |
| from joblib import load | |
| # Load model | |
| model = tf.keras.models.load_model( | |
| 'output/custom_adjacency_regularizer_20210731_124_step3.h5', | |
| custom_objects={ | |
| 'binary_KL': binary_KL, | |
| 'Selector': Selector, | |
| 'ResidualTuner': ResidualTuner, | |
| 'SumDiff': SumDiff, | |
| 'RegularizedBiasLayer': RegularizedBiasLayer, | |
| } | |
| ) | |
| # Load test data | |
| xTe = load('data/xTe_ES7_HeLa_ABC.pkl.gz') | |
| yTe = load('data/yTe_ES7_HeLa_ABC.pkl.gz') | |
| # Predict | |
| predictions = model.predict(xTe) | |
| ``` | |
| ### Creating Force Plots | |
| ```python | |
| import sys | |
| sys.path.append('figures') | |
| from force_plot import draw_force_plot | |
| fig = draw_force_plot( | |
| sequences=['ATGC...' * 22 + 'AT'], # 90 nt | |
| annotations=['My Sequence'], | |
| ) | |
| fig.savefig('my_force_plot.pdf') | |
| ``` | |
| ### Processing New Sequences | |
| ```python | |
| from data_preprocessing.utils import add_flanking, create_input_data | |
| exon = 'ACGT' * 17 + 'AC' # 70 nt | |
| full_seq = add_flanking(exon, 10) # 90 nt | |
| seq_oh, struct_oh, wobbles = create_input_data([full_seq]) | |
| # Now use with model | |
| X = [seq_oh, struct_oh, wobbles] | |
| psi = model.predict(X)[0, 0] | |
| print(f"Predicted PSI: {psi:.3f}") | |
| ``` | |