David/miRBind cnn optimization#7
Conversation
katarinagresova
left a comment
There was a problem hiding this comment.
Please remove duplicate code and move files to their appropriate locations. I will have another look at the code after.
| parser = argparse.ArgumentParser( | ||
| description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'") | ||
| parser.add_argument('-i', '--i_file', type=str, required=True, help="Input dataset file name") | ||
| parser.add_argument('-o', '--o_prefix', type=str, required=True, help="Output file name prefix") |
There was a problem hiding this comment.
Using two arguments for matrix file and label file would be better. Also, I would allow the user to define the whole path to the file, not just prefix.
| return ohe_matrix_2d No newline at end of file | ||
| parser = argparse.ArgumentParser( | ||
| description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'") | ||
| parser.add_argument('-i', '--i_file', type=str, required=True, help="Input dataset file name") |
There was a problem hiding this comment.
Help to this argument is misleading. It says "dataset file name", but it expects the path to the data file.
| for index, row in df.iterrows(): | ||
| for bind_index, bind_nt in enumerate(row['gene'].upper()): | ||
| for ncrna_index, ncrna_nt in enumerate(row['noncodingRNA'].upper()): | ||
| if ncrna_index >= tensor_dim[1]: |
There was a problem hiding this comment.
Should we have this check for bind_index as well? If we try to use this function for different data.
It could be also solved by for bind_index, bind_nt in enumerate(row['gene'][:tensor_dim[0]].upper()):
There was a problem hiding this comment.
Could we move this file out of train folder? Maybe we can have model folder
There was a problem hiding this comment.
Is this training script somehow specific to miRBind_CNN model? Or should it support multiple different models?
There was a problem hiding this comment.
The goal was for it to be specific to miRBind. To make it more generic, it could accept the model as an argument - that seems like (maybe the only) thing making it miRBind specific at the moment. Should we do it like that?
There was a problem hiding this comment.
I don't mind having a miRBind-specific training script. But this one looks like there are some general functions. At least DataGenerator or plot_history functions could be used elsewhere too. You already have data generators in code/machine_learning/data_generators.py in code/machine_learning/plots.py
But if it is miRBind-specific, it feels more like analysis, not generic code.
What do you think?
| from tensorflow.keras.utils import Sequence | ||
|
|
||
|
|
||
| class TrainDataGenerator(Sequence): |
There was a problem hiding this comment.
This looks like a duplicate of the code already in code/machine_learning
|
|
||
| class TestDataGenerator: | ||
| def __init__(self, data_path, labels_path, batch_size=32, dataset_size=None): | ||
| if dataset_size is None: |
There was a problem hiding this comment.
Why there would be None dataset_size for test file, if we require dataset_size for train file?
There was a problem hiding this comment.
What is this script used for?
There was a problem hiding this comment.
What is this script used for?
There was a problem hiding this comment.
What is this script used for?
If it is for training the model with best parameters, I would save the parameters to some file and pass it as input to training script
| import time | ||
|
|
||
|
|
||
| def binding_encoding(df, alphabet, tensor_dim=(50, 20, 1)): |
There was a problem hiding this comment.
Add the string for 'gene' and 'noncodingRNA' columns as the function arguments
def binding_encoding(df, alphabet, miRNA_col="noncodingRNA", gene_col="gene", tensor_dim=(50, 20, 1)):
There was a problem hiding this comment.
I don't mind having a miRBind-specific training script. But this one looks like there are some general functions. At least DataGenerator or plot_history functions could be used elsewhere too. You already have data generators in code/machine_learning/data_generators.py in code/machine_learning/plots.py
But if it is miRBind-specific, it feels more like analysis, not generic code.
What do you think?
|
|
||
| This repository contains scripts for training and evaluating a deep learning model based on variations and tuning of the miRBind architecture for miRNA-binding prediction using eCLIP data from Manakov | ||
|
|
||
| 1. ../encode_dataset.sh |
There was a problem hiding this comment.
there is no encode_dataset.sh script in the parent directory
|
|
||
| 3. train_model.sh | ||
| Trains the model using the optimized hyperparameters until convergence. Saves model checkpoints and training results. | ||
| Requires setting the name (timestamp) for your model |
There was a problem hiding this comment.
What do you meant it requires setting the name for your model? It looks like the train_model.sh script does it itself.
Also, why we name model with timestamp? I think we could use any name. Output of hyperparameter tuning is always best_model.keras
| Requires setting the name (timestamp) for your model | ||
|
|
||
| 4. evaluate_model.sh | ||
| Evaluates the trained model on test and left-out datasets. Requires setting the name (timestamp) of your trained model. Generates performance metrics and plots. |
There was a problem hiding this comment.
Can we have this as input param of the script? And the previous script could print the name of the best model.
| Saves results. | ||
|
|
||
| ../hyperparam_optimization_pipeline.sh | ||
| Orchestrates the (almost) entire workflow (except training until convergence with found hyperpara.) in a single execution, combining dataset encoding, hyperparameter optimization, and model evaluation. No newline at end of file |
There was a problem hiding this comment.
Why is "except training until convergence with found hyperpara." step missing?
|
|
||
| indices = np.arange(self.num_samples) | ||
| if shuffle: | ||
| np.random.shuffle(indices) |
There was a problem hiding this comment.
You should set the seed before doing shuffling.
|
|
||
| class TestDataGenerator: | ||
| def __init__(self, data_path, labels_path, batch_size=32, dataset_size=None): | ||
| if dataset_size is None: |
There was a problem hiding this comment.
Why are we checking for None here? We are not doing it for train generator.
| self.batch_size = batch_size | ||
| self.num_samples = dataset_size | ||
|
|
||
| def get_data(self): |
There was a problem hiding this comment.
We don't need __get_item function in test generator?
|
|
||
| return ohe_matrix_2d No newline at end of file | ||
| parser = argparse.ArgumentParser( | ||
| description="Encode dataset to miRNA x target binding matrix. Outputs numpy file with matrices and and numpy file with corresponding labels. Expected columns of the dataset are 'noncodingRNA', 'gene' and 'label'") |
There was a problem hiding this comment.
It would be nice to encode file with other names of columns as well
| def evaluate_model(model, test_data, test_labels, logger, save_plots=True, output_dir='.', pred_threshold=0.5): | ||
| """Evaluate model performance""" | ||
| # Get predictions from prediction probabilities | ||
| y_pred_proba = model.predict(test_data) |
There was a problem hiding this comment.
I am not sure here, but could be evaluation done in batches as well? Do we also want to support using GPU?
No description provided.