Skip to content

David/miRBind cnn optimization#7

Open
davidcechak wants to merge 20 commits into
mainfrom
david/miRBind_CNN_optimization
Open

David/miRBind cnn optimization#7
davidcechak wants to merge 20 commits into
mainfrom
david/miRBind_CNN_optimization

Conversation

@davidcechak
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Member

@katarinagresova katarinagresova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove duplicate code and move files to their appropriate locations. I will have another look at the code after.

parser = argparse.ArgumentParser(
description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'")
parser.add_argument('-i', '--i_file', type=str, required=True, help="Input dataset file name")
parser.add_argument('-o', '--o_prefix', type=str, required=True, help="Output file name prefix")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using two arguments for matrix file and label file would be better. Also, I would allow the user to define the whole path to the file, not just prefix.

return ohe_matrix_2d No newline at end of file
parser = argparse.ArgumentParser(
description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'")
parser.add_argument('-i', '--i_file', type=str, required=True, help="Input dataset file name")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help to this argument is misleading. It says "dataset file name", but it expects the path to the data file.

for index, row in df.iterrows():
for bind_index, bind_nt in enumerate(row['gene'].upper()):
for ncrna_index, ncrna_nt in enumerate(row['noncodingRNA'].upper()):
if ncrna_index >= tensor_dim[1]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have this check for bind_index as well? If we try to use this function for different data.

It could be also solved by for bind_index, bind_nt in enumerate(row['gene'][:tensor_dim[0]].upper()):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this file out of train folder? Maybe we can have model folder

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this training script somehow specific to miRBind_CNN model? Or should it support multiple different models?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal was for it to be specific to miRBind. To make it more generic, it could accept the model as an argument - that seems like (maybe the only) thing making it miRBind specific at the moment. Should we do it like that?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind having a miRBind-specific training script. But this one looks like there are some general functions. At least DataGenerator or plot_history functions could be used elsewhere too. You already have data generators in code/machine_learning/data_generators.py in code/machine_learning/plots.py
But if it is miRBind-specific, it feels more like analysis, not generic code.
What do you think?

from tensorflow.keras.utils import Sequence


class TrainDataGenerator(Sequence):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a duplicate of the code already in code/machine_learning


class TestDataGenerator:
def __init__(self, data_path, labels_path, batch_size=32, dataset_size=None):
if dataset_size is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there would be None dataset_size for test file, if we require dataset_size for train file?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this script used for?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this script used for?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this script used for?
If it is for training the model with best parameters, I would save the parameters to some file and pass it as input to training script

import time


def binding_encoding(df, alphabet, tensor_dim=(50, 20, 1)):
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the string for 'gene' and 'noncodingRNA' columns as the function arguments
def binding_encoding(df, alphabet, miRNA_col="noncodingRNA", gene_col="gene", tensor_dim=(50, 20, 1)):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind having a miRBind-specific training script. But this one looks like there are some general functions. At least DataGenerator or plot_history functions could be used elsewhere too. You already have data generators in code/machine_learning/data_generators.py in code/machine_learning/plots.py
But if it is miRBind-specific, it feels more like analysis, not generic code.
What do you think?


This repository contains scripts for training and evaluating a deep learning model based on variations and tuning of the miRBind architecture for miRNA-binding prediction using eCLIP data from Manakov

1. ../encode_dataset.sh
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no encode_dataset.sh script in the parent directory


3. train_model.sh
Trains the model using the optimized hyperparameters until convergence. Saves model checkpoints and training results.
Requires setting the name (timestamp) for your model
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you meant it requires setting the name for your model? It looks like the train_model.sh script does it itself.
Also, why we name model with timestamp? I think we could use any name. Output of hyperparameter tuning is always best_model.keras

Requires setting the name (timestamp) for your model

4. evaluate_model.sh
Evaluates the trained model on test and left-out datasets. Requires setting the name (timestamp) of your trained model. Generates performance metrics and plots.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have this as input param of the script? And the previous script could print the name of the best model.

Saves results.

../hyperparam_optimization_pipeline.sh
Orchestrates the (almost) entire workflow (except training until convergence with found hyperpara.) in a single execution, combining dataset encoding, hyperparameter optimization, and model evaluation. No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is "except training until convergence with found hyperpara." step missing?


indices = np.arange(self.num_samples)
if shuffle:
np.random.shuffle(indices)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set the seed before doing shuffling.


class TestDataGenerator:
def __init__(self, data_path, labels_path, batch_size=32, dataset_size=None):
if dataset_size is None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we checking for None here? We are not doing it for train generator.

self.batch_size = batch_size
self.num_samples = dataset_size

def get_data(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need __get_item function in test generator?


return ohe_matrix_2d No newline at end of file
parser = argparse.ArgumentParser(
description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to encode file with other names of columns as well

def evaluate_model(model, test_data, test_labels, logger, save_plots=True, output_dir='.', pred_threshold=0.5):
"""Evaluate model performance"""
# Get predictions from prediction probabilities
y_pred_proba = model.predict(test_data)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure here, but could be evaluation done in batches as well? Do we also want to support using GPU?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants