DataScience II

This repo is a set of programs reflecting some of the data science methods and topics I learned during the fall of 2017.

Included are:

a logistic regression document classifier with weights learned via gradient decent
a logistic regression document classifier with weights learned via a simple genetic programming algorithm
an implementation of the MultiRankWalk algorithm
an implementation of a stochastic Singular Value Decomposition solver.

Logistic Regression Document Classifier (GD)

Given a set of training documents, X, and their corresponding labels, y, the parameter, θ=[θ₁,θ₂,...,θ_n]^T is learned via gradient decent. If a set of unlabeled documents is provided as a command line argument, then the predicted classes for the documents is printed. If the ground truth for the set of unlabeled documents is provided, then the accuracy of the predictions is reported as well.

python LogisticRegressionClassifer.py --help

OR

python LogisticRegressionClassifer.py -h

for more information.

Logistic Regression Binary Document Classifier (GP)

Given a set of training documents, X, and their corresponding labels, y, the parameter, θ=[θ₁,θ₂,...,θ_n]^T is learned via genetic programming. If a set of unlabeled documents is provided as a command line argument, then the predicted classes for the documents is printed. If the ground truth for the set of unlabeled documents is provided, then the accuracy of the predictions is reported as well.

python GeneticOptimizer.py --help

OR

python GeneticOptimizer.py -h

MultiRankWalk

The MulRankWalk algorithm is a semi-supervised learning algorithm. Given a small set of training instances - i.e. points in n-dimensional space - the algorithm will classify a set of unknown instances in the same space. The details of the algorithm may be found here. This particular implementation is designed to test the effect of a number of the algorithm's hyperparameters, including the size of the "seed" set (i.e. - labeled instances), the method of choosing seeds from the seed set (random or ranked according to a fitness criteria), the effect of varying the damping parameter, and the effect of varying the gamma parameter in the RBF kernel for measuring instance similarity.

python MultiRankWalk.py --help

OR

python MultiRankWalk.py -h

StochasticSVD

This is an implementation of a stochastic singular value decomposition solver which compares its results to the results of both the deterministic and stochastic SVD solvers implemented in the scikit learn package. Interestingly, this implementation far outperforms the stochastic SVD solver implemented in the scikit learn package in every tested environment. No data need be provided to the program; the program will run on the MNIST dataset by default. The performance of the implemented SSVD solver versus the stock SSVD solver as well as the deterministic solver will be plotted to the screen.

python StochasticSVD.py --help

OR

python StochasticSVD.py -h

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
GeneticOptimizer.py		GeneticOptimizer.py
LogisticRegressionClassifier.py		LogisticRegressionClassifier.py
MultiRankWalk.py		MultiRankWalk.py
README.md		README.md
StochasticSVD.py		StochasticSVD.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DataScience II

Logistic Regression Document Classifier (GD)

Logistic Regression Binary Document Classifier (GP)

MultiRankWalk

StochasticSVD

About

Uh oh!

Releases

Packages

Languages

NBKlepp/DataScienceII

Folders and files

Latest commit

History

Repository files navigation

DataScience II

Logistic Regression Document Classifier (GD)

Logistic Regression Binary Document Classifier (GP)

MultiRankWalk

StochasticSVD

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages