Spaces:

sachin1801
/

splicing-predictor

Running

splicing-predictor / webapp /docs /09-API-REFERENCE.md

sachin1801

help page revamp according to requirements, removed tutorial page, search filter improved on history, login with email pass created

068e060 21 days ago

preview code

raw

history blame contribute delete

10.7 kB

	# API Reference

	Complete reference for all functions and classes in the project.

	---

	## Data Preprocessing Module

	### utils.py

	#### String Functions

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `human_format` \| `(num: float) → str` \| Format number with K/M/B suffix \|
	\| `hamming` \| `(s1: str, s2: str) → int` \| Hamming distance between strings \|
	\| `revcomp` \| `(str: str) → str` \| Reverse complement of DNA \|
	\| `get_qualities` \| `(str: str) → List[int]` \| ASCII quality to Phred scores \|
	\| `contains_Esp3I_site` \| `(str: str) → bool` \| Check for restriction site \|

	#### File I/O

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `tqdm_readline` \| `(file, pbar) → str` \| Read line with progress update \|
	\| `process_paired_fastq_file` \| `(f1, f2, callback) → int` \| Process paired FASTQ files \|

	#### Sequence Features

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `add_flanking` \| `(nts: str, len: int) → str` \| Add flanking sequences \|
	\| `add_barcode_flanking` \| `(nts: str, len: int) → str` \| Add barcode flanking \|
	\| `nts_to_vector` \| `(nts: str, rna=False) → ndarray` \| One-hot encode sequence \|
	\| `folding_to_vector` \| `(nts: str) → ndarray` \| One-hot encode structure \|
	\| `str_to_vector` \| `(str: str, template: str) → ndarray` \| Generic one-hot encoding \|
	\| `ei_vec` \| `(i: int, len: int) → List[int]` \| Create one-hot vector \|

	#### RNA Structure

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `rna_fold_structs` \| `(seqs, maxBPspan=0) → (structs, mfes)` \| Predict structures \|
	\| `compute_structure` \| `(seqs) → (struct_oh, structs, mfes)` \| One-hot encoded structures \|
	\| `compute_seq_oh` \| `(seqs) → ndarray` \| One-hot encode sequences \|
	\| `compute_wobbles` \| `(seqs, structs) → ndarray` \| Identify wobble pairs \|
	\| `create_input_data` \| `(seqs) → (seq_oh, struct_oh, wobbles)` \| Complete feature extraction \|

	#### Structure Analysis

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `find_parentheses` \| `(s: str) → Dict[int, int]` \| Map base pair positions \|
	\| `compute_bijection` \| `(s: str) → ndarray` \| Pairing array \|
	\| `compute_wobble_indicator` \| `(seq, struct) → List[int]` \| Wobble pair flags \|

	---

	### RNAutils.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `RNAfold` \| `(seqs, bin, temp, span, cmd) → List[[str, float]]` \| MFE structure prediction \|
	\| `RNAsubopt` \| `(seq, bin, delta) → List[(str, float)]` \| Suboptimal structures \|
	\| `RNAsample` \| `(seqs, bin, temp, n, span) → List[List[str]]` \| Boltzmann sampling \|
	\| `RNA_partition_function` \| `(seqs, constraints, ...) → List[float]` \| Partition function \|

	---

	### compute_coupling.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `collect_barcodes` \| `(r1, r2, r1_q, r2_q) → None` \| Extract barcode-exon pairs \|

	Global variables: `couplings`, `good_reads`, `reads_with_N`, `unidentified_reads`

	---

	### compute_splicing_outcomes.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `identify_splicing_pattern` \| `(r1, r2, r1_q, r2_q) → None` \| Classify splicing \|

	Splicing categories: `num_exon_inclusion`, `num_exon_skipping`, `num_intron_retention`, `num_splicing_in_exon`, `num_unknown_splicing`

	---

	### generate_training_data.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `read_dataset` \| `(path, filter=True) → DataFrame` \| Load filtered CSV \|
	\| `to_input_data` \| `(df, flanking=10) → tuple` \| Create model inputs \|
	\| `to_target_data` \| `(df) → ndarray` \| Compute PSI values \|

	---

	## Model Training Module

	### model.py

	#### Custom Layers

	\| Class \| Purpose \| Key Parameters \|
	\|-------\|---------\|----------------\|
	\| `Selector` \| Select between inputs \| `trainable=False` \|
	\| `ResidualTuner` \| Residual MLP \| `hidden_units=100` \|
	\| `SumDiff` \| Energy difference \| `freeze=False` \|
	\| `RegularizedBiasLayer` \| Position bias \| Regularization params \|

	#### Regularizers

	\| Class/Function \| Purpose \|
	\|----------------\|---------\|
	\| `MultiRegularizer` \| Combined regularizer \|
	\| `pos_reg` \| L2 position penalty \|
	\| `adj_reg_fo` \| First-order smoothness \|
	\| `adj_reg_so` \| Second-order smoothness \|

	#### Functions

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `binary_KL` \| `(y_true, y_pred) → scalar` \| Binary KL divergence loss \|
	\| `regularized_act` \| `(x, reg, act) → tensor` \| Activation with regularization \|
	\| `train_model` \| `(model, X, y, file, ...) → history` \| Train with checkpointing \|
	\| `get_model` \| `(**kwargs) → Model` \| Create model instance \|

	#### get_model() Parameters

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `input_length` \| int \| 90 \| Sequence length \|
	\| `randomized_region` \| tuple \| (10, 80) \| Exon position \|
	\| `num_filters` \| int \| 20 \| Sequence filters \|
	\| `num_structure_filters` \| int \| 8 \| Structure filters \|
	\| `filter_width` \| int \| 6 \| Sequence filter size \|
	\| `structure_filter_width` \| int \| 30 \| Structure filter size \|
	\| `dropout_rate` \| float \| 0.01 \| Dropout probability \|
	\| `activity_regularization` \| float \| 0.0 \| Activation L1 \|
	\| `position_regularization` \| float \| 2.5e-5 \| Position L2 \|
	\| `adjacency_regularization` \| float \| 0.0 \| First-order smoothness \|
	\| `adjacency_regularization_so` \| float \| 0.0 \| Second-order smoothness \|
	\| `energy_activation` \| str \| "softplus" \| Energy activation \|
	\| `tune_energy` \| bool \| True \| Train energy params \|

	---

	## Figures Module

	### force_plot.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `get_link_midpoint` \| `(fn, mid, eps, ...) → float` \| Find sigmoid midpoint \|
	\| `collapse_filters` \| `(act_i, act_s, ...) → (df_i, df_s)` \| Group filter activations \|
	\| `create_force_data` \| `(act_i, act_s, ...) → (Series, Series)` \| Aggregate forces \|
	\| `merge_small_forces` \| `(forces, thresh) → Series` \| Combine small contributions \|
	\| `draw_force_plot` \| `(seqs, annots, ...) → Figure` \| Create visualization \|

	### sequence_logo.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `plot_logo` \| `(df, thresh, ax, colors) → None` \| Draw sequence logo \|
	\| `compute_freqs` \| `(kmers) → DataFrame` \| Nucleotide frequencies \|
	\| `compute_info` \| `(freqs) → ndarray` \| Information content \|
	\| `compute_heights` \| `(freqs) → DataFrame` \| Logo heights \|
	\| `sequence_logo_heights` \| `(df) → DataFrame` \| Combined calculation \|
	\| `draw_floating_logo` \| `(heights, ..., ax) → None` \| Overlay logo on axes \|
	\| `compute_EDLogo_scores` \| `(kmers, normed) → DataFrame` \| Enrichment/depletion \|
	\| `plot_EDLogo` \| `(df, thresh, ax) → None` \| Draw ED logo \|

	### draw_stem_loop.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `draw_line` \| `(d, x1, y1, x2, y2, color) → None` \| SVG line \|
	\| `draw_nucleotide` \| `(d, x, y, nt, color) → None` \| SVG nucleotide circle \|
	\| `draw_oligo` \| `(d, xs, ys, nts, colors) → None` \| SVG oligonucleotide \|
	\| `draw_stem_loop` \| `(nts, stem_len, colors, file) → None` \| Complete stem-loop SVG \|

	### kl.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `knn_distance` \| `(point, sample, k) → float` \| k-NN distance \|
	\| `verify_sample_shapes` \| `(s1, s2, k) → None` \| Validate input shapes \|
	\| `naive_estimator` \| `(s1, s2, k) → float` \| Brute-force KL estimate \|
	\| `scipy_estimator` \| `(s1, s2, k) → float` \| KDTree-based KL \|
	\| `skl_estimator` \| `(s1, s2, k) → float` \| sklearn-based KL \|
	\| `skl_estimator_efficient` \| `(s1, s2, k) → float` \| Vectorized KL \|

	### generate_custom_model.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `lanczos_kernel` \| `(x, order) → ndarray` \| Lanczos interpolation kernel \|
	\| `lanczos_interpolate` \| `(arr, positions, order) → ndarray` \| Interpolate at positions \|
	\| `lanczos_resampling` \| `(arr, new_len, order) → ndarray` \| Resample to new length \|
	\| `resample_one_positional_bias` \| `(weights, len, pad) → ndarray` \| Resample position bias \|
	\| `resample_positional_bias_weights` \| `(weights, len, pad) → ndarray` \| Resample all biases \|
	\| `generate_custom_model` \| `(new_len, delta_basal) → Model` \| Create modified model \|

	### figutils.py

	\| Function \| Signature \| Description \|
	\|----------\|-----------\|-------------\|
	\| `subsample_points` \| `(x, y, max) → (x, y)` \| Random subsampling \|
	\| `scatter_with_kde` \| `(x, y, ax, alpha) → None` \| Density scatter plot \|
	\| `safelog` \| `(x, tol) → ndarray` \| Numerically safe log \|
	\| `bin_kl` \| `(y_true, y_pred) → ndarray` \| Binary KL divergence \|
	\| `flatten_dict` \| `(d) → (keys, values)` \| Flatten nested dict \|
	\| `insert_motif_in_middle_of_sequence` \| `(seq, motif) → str` \| Insert motif \|
	\| `insert_motif_in_middle_of_sequences` \| `(seqs, motif) → Dict` \| Batch insert \|
	\| `landing_pads_to_sw_exons` \| `(mers, motif, pre, post) → List` \| Create landing pads \|
	\| `all_seqs` \| `(length) → List[str]` \| Generate all k-mers \|
	\| `extract_str_patches` \| `(lst, n) → List[List[str]]` \| Extract n-grams \|
	\| `compute_activations_simple_conv` \| `(layer, window) → Dict` \| k-mer activations \|

	---

	## Usage Examples

	### Making Predictions

	```python
	from model_training.model import binary_KL, Selector, ResidualTuner, SumDiff, RegularizedBiasLayer
	import tensorflow as tf
	from joblib import load

	# Load model
	model = tf.keras.models.load_model(
	'output/custom_adjacency_regularizer_20210731_124_step3.h5',
	custom_objects={
	'binary_KL': binary_KL,
	'Selector': Selector,
	'ResidualTuner': ResidualTuner,
	'SumDiff': SumDiff,
	'RegularizedBiasLayer': RegularizedBiasLayer,
	}
	)

	# Load test data
	xTe = load('data/xTe_ES7_HeLa_ABC.pkl.gz')
	yTe = load('data/yTe_ES7_HeLa_ABC.pkl.gz')

	# Predict
	predictions = model.predict(xTe)
	```

	### Creating Force Plots

	```python
	import sys
	sys.path.append('figures')
	from force_plot import draw_force_plot

	fig = draw_force_plot(
	sequences=['ATGC...' * 22 + 'AT'], # 90 nt
	annotations=['My Sequence'],
	)
	fig.savefig('my_force_plot.pdf')
	```

	### Processing New Sequences

	```python
	from data_preprocessing.utils import add_flanking, create_input_data

	exon = 'ACGT' * 17 + 'AC' # 70 nt
	full_seq = add_flanking(exon, 10) # 90 nt

	seq_oh, struct_oh, wobbles = create_input_data([full_seq])

	# Now use with model
	X = [seq_oh, struct_oh, wobbles]
	psi = model.predict(X)[0, 0]
	print(f"Predicted PSI: {psi:.3f}")
	```