David/miRBind cnn optimization by davidcechak · Pull Request #7 · BioGeMT/miRBind_2.0

davidcechak · 2025-03-06T14:05:35Z

No description provided.

katarinagresova

Please remove duplicate code and move files to their appropriate locations. I will have another look at the code after.

katarinagresova · 2025-03-14T17:06:53Z

+    parser = argparse.ArgumentParser(
+        description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'")
+    parser.add_argument('-i', '--i_file', type=str, required=True, help="Input dataset file name")
+    parser.add_argument('-o', '--o_prefix', type=str, required=True, help="Output file name prefix")


Using two arguments for matrix file and label file would be better. Also, I would allow the user to define the whole path to the file, not just prefix.

katarinagresova · 2025-03-14T17:08:28Z

-        return ohe_matrix_2d
+    parser = argparse.ArgumentParser(
+        description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'")
+    parser.add_argument('-i', '--i_file', type=str, required=True, help="Input dataset file name")


Help to this argument is misleading. It says "dataset file name", but it expects the path to the data file.

katarinagresova · 2025-03-14T17:10:31Z

+    for index, row in df.iterrows():
+        for bind_index, bind_nt in enumerate(row['gene'].upper()):
+            for ncrna_index, ncrna_nt in enumerate(row['noncodingRNA'].upper()):
+                if ncrna_index >= tensor_dim[1]:


Should we have this check for bind_index as well? If we try to use this function for different data.

It could be also solved by for bind_index, bind_nt in enumerate(row['gene'][:tensor_dim[0]].upper()):

katarinagresova · 2025-03-14T17:13:43Z

Could we move this file out of train folder? Maybe we can have model folder

katarinagresova · 2025-03-14T17:15:01Z

Is this training script somehow specific to miRBind_CNN model? Or should it support multiple different models?

The goal was for it to be specific to miRBind. To make it more generic, it could accept the model as an argument - that seems like (maybe the only) thing making it miRBind specific at the moment. Should we do it like that?

I don't mind having a miRBind-specific training script. But this one looks like there are some general functions. At least DataGenerator or plot_history functions could be used elsewhere too. You already have data generators in code/machine_learning/data_generators.py in code/machine_learning/plots.py
But if it is miRBind-specific, it feels more like analysis, not generic code.
What do you think?

katarinagresova · 2025-03-14T17:32:31Z

+from tensorflow.keras.utils import Sequence
+
+
+class TrainDataGenerator(Sequence):


This looks like a duplicate of the code already in code/machine_learning

katarinagresova · 2025-03-14T17:33:48Z

+
+class TestDataGenerator:
+    def __init__(self, data_path, labels_path, batch_size=32, dataset_size=None):
+        if dataset_size is None:


Why there would be None dataset_size for test file, if we require dataset_size for train file?

katarinagresova · 2025-03-14T17:35:31Z

What is this script used for?

katarinagresova · 2025-03-14T17:35:48Z

What is this script used for?

katarinagresova · 2025-03-14T17:37:01Z

What is this script used for?
If it is for training the model with best parameters, I would save the parameters to some file and pass it as input to training script

davidcechak · 2025-03-18T17:19:36Z

+import time
+
+
+def binding_encoding(df, alphabet, tensor_dim=(50, 20, 1)):


Add the string for 'gene' and 'noncodingRNA' columns as the function arguments
def binding_encoding(df, alphabet, miRNA_col="noncodingRNA", gene_col="gene", tensor_dim=(50, 20, 1)):

katarinagresova · 2025-04-10T09:40:35Z

I don't mind having a miRBind-specific training script. But this one looks like there are some general functions. At least DataGenerator or plot_history functions could be used elsewhere too. You already have data generators in code/machine_learning/data_generators.py in code/machine_learning/plots.py
But if it is miRBind-specific, it feels more like analysis, not generic code.
What do you think?

katarinagresova · 2025-04-10T09:44:09Z

+
+This repository contains scripts for training and evaluating a deep learning model based on variations and tuning of the miRBind architecture for miRNA-binding prediction using eCLIP data from Manakov
+
+1. ../encode_dataset.sh


there is no encode_dataset.sh script in the parent directory

katarinagresova · 2025-04-10T09:45:21Z

+
+3. train_model.sh
+Trains the model using the optimized hyperparameters until convergence. Saves model checkpoints and training results.
+Requires setting the name (timestamp) for your model


What do you meant it requires setting the name for your model? It looks like the train_model.sh script does it itself.
Also, why we name model with timestamp? I think we could use any name. Output of hyperparameter tuning is always best_model.keras

katarinagresova · 2025-04-10T09:46:21Z

+Requires setting the name (timestamp) for your model
+
+4. evaluate_model.sh
+Evaluates the trained model on test and left-out datasets. Requires setting the name (timestamp) of your trained model. Generates performance metrics and plots.


Can we have this as input param of the script? And the previous script could print the name of the best model.

katarinagresova · 2025-04-10T09:47:05Z

+Saves results.
+
+../hyperparam_optimization_pipeline.sh
+Orchestrates the (almost) entire workflow (except training until convergence with found hyperpara.) in a single execution, combining dataset encoding, hyperparameter optimization, and model evaluation.


Why is "except training until convergence with found hyperpara." step missing?

katarinagresova · 2025-04-10T10:03:40Z

+
+        indices = np.arange(self.num_samples)
+        if shuffle:
+            np.random.shuffle(indices)


You should set the seed before doing shuffling.

katarinagresova · 2025-04-10T10:04:24Z

+
+class TestDataGenerator:
+    def __init__(self, data_path, labels_path, batch_size=32, dataset_size=None):
+        if dataset_size is None:


Why are we checking for None here? We are not doing it for train generator.

katarinagresova · 2025-04-10T10:05:12Z

+        self.batch_size = batch_size
+        self.num_samples = dataset_size
+
+    def get_data(self):


We don't need __get_item function in test generator?

katarinagresova · 2025-04-10T10:08:20Z

-
-        return ohe_matrix_2d
+    parser = argparse.ArgumentParser(
+        description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'")


It would be nice to encode file with other names of columns as well

katarinagresova · 2025-04-10T10:10:45Z

+def evaluate_model(model, test_data, test_labels, logger, save_plots=True, output_dir='.', pred_threshold=0.5):
+    """Evaluate model performance"""
+    # Get predictions from prediction probabilities
+    y_pred_proba = model.predict(test_data)


I am not sure here, but could be evaluation done in batches as well? Do we also want to support using GPU?

evaklimentova and others added 17 commits October 15, 2024 11:23

Sync with main

7eb1cef

miRBind CNN cleanup

e92e058

adding pipeline for running miRBin CNN training

63f0558

optimization notebook as tutorial

410cf03

Tmp commit, completition of miRBind CNN retraining and optimisation

2fdcc3d

Add .gitignore

094359b

Encode data for mirbind memory mapping dataset

cf62cce

Use optuna to optimise hyperparamters of miRBind model

3a018ad

Clean up logs which should not be in git

c4bafe4

Evaluate a trained model on test datasets

f9299c1

Train a model with custom hyperparameters

f7d0f92

Run the full hyperparameter optimisation pipeline for miRBind

05f36df

Fix model log output paths

ffff74c

Model training: Add error handling

4f9bd9c

miRBind: extract model compilation to utils

ebfc315

miRBind model: remove pycache file

73fb66a

miRBind training and evaluation: refactore logging, clean bash files

ad84af1

davidcechak requested a review from katarinagresova March 6, 2025 14:05

davidcechak added 2 commits March 10, 2025 22:19

train miRBind with original params: clean bash scripts

8665929

miRBind optuna optimisation: add readme

a475694

katarinagresova requested changes Mar 14, 2025

View reviewed changes

davidcechak commented Mar 18, 2025

View reviewed changes

miRBind merge review: move generic scripts to code/machine_learning/

b6dddc4

katarinagresova requested changes Apr 10, 2025

View reviewed changes

katarinagresova mentioned this pull request Apr 10, 2025

CNN scripts cleanup #2

Closed

		from tensorflow.keras.utils import Sequence


		class TrainDataGenerator(Sequence):

		import time


		def binding_encoding(df, alphabet, tensor_dim=(50, 20, 1)):


		This repository contains scripts for training and evaluating a deep learning model based on variations and tuning of the miRBind architecture for miRNA-binding prediction using eCLIP data from Manakov

		1. ../encode_dataset.sh

Conversation

davidcechak commented Mar 6, 2025

Uh oh!

katarinagresova left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants