RNAseq_ML

Investigating methods to undertake feature selection and reduction on RNA-seq data.

A large number of Gleason score 7s (80/105 -> 10/35) have been temp removed as they were causing bias (with 23,281 features).

Draft scripts for discovering best methods for feature selection with RNAseq data (using sklearn package). To run:
$ python3 test_script.py

Mehods include:

Also PCA analysis

Multilabel confusion matrix (normalised for true data):

Validation curve:

Feature cross validation scores are visualised in order of method used for feature selection at the end:

For the high correlation filter, a heatmap is generated:

Feature importance is also extracted and plotted at each step:

Provide feedback