Skip to content

Latest commit

 

History

History
44 lines (30 loc) · 1.56 KB

File metadata and controls

44 lines (30 loc) · 1.56 KB

RNAseq_ML

Investigating methods to undertake feature selection and reduction on RNA-seq data.

Data from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54460 https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE54460&format=file&file=GSE54460%5FFPKM%2Dgenes%2DTopHat2%2D106samples%2D12%2D4%2D13%2Etxt%2Egz

A large number of Gleason score 7s (80/105 -> 10/35) have been temp removed as they were causing bias (with 23,281 features).

Preliminary data generated from manual methods script

Draft scripts for discovering best methods for feature selection with RNAseq data (using sklearn package). To run:
$ python3 test_script.py

Mehods include:

  • Removing low variance
  • Univariate filter
  • High correlated features filter
  • Low target correlation filter
  • Recursive feature elimination
  • Feature selection from model
  • Tree based selection
  • L1-based selection

Also PCA analysis

Multilabel confusion matrix (normalised for true data):
MlCM

Validation curve:
validation curve example

Feature cross validation scores are visualised in order of method used for feature selection at the end: cross validation scores example

For the high correlation filter, a heatmap is generated: heatplot example

Feature importance is also extracted and plotted at each step: relative feature importance example