Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
93876a4
WIP with pipeline. Spark and HDFS are up, but postgresql is still bei…
bluerider Jan 23, 2019
22c9cad
S3 -> Spark -> Pipeline up
bluerider Jan 24, 2019
c6aeb27
Refactored code to use a better project structure
bluerider Jan 30, 2019
d66b1b9
Fix up some typos affecting functions in multiplyImages
bluerider Jan 30, 2019
57d106a
Fix up some typos affecting functions in classifyImages
bluerider Jan 30, 2019
d2b6b2d
Fix sourcing bash files
bluerider Jan 30, 2019
b895541
Do not add any files downloaded in /tools since that is supposed to b…
bluerider Jan 30, 2019
a294915
Remove foreachPartition support
bluerider Jan 30, 2019
86607e5
Make the tool directory for pegasus if needed
bluerider Jan 30, 2019
cafa582
Add warning about launching ssh-agent
bluerider Jan 30, 2019
cdf4b61
Expand main.sh to use functions and have command line switches
bluerider Jan 30, 2019
2017605
Add some echo statements to make things look better
bluerider Jan 30, 2019
b229fe3
Add more echo statements to inform user of what is happening
bluerider Jan 30, 2019
d2313ac
Improved main.sh to add more command switches
bluerider Jan 30, 2019
b302c2b
Change the name for --duplicate-images to --multiply-images
bluerider Jan 30, 2019
236c4e0
Add images for README
bluerider Jan 30, 2019
0590a5b
Update Readme
bluerider Jan 30, 2019
ba4e60c
Fix image loading for README.md
bluerider Jan 30, 2019
17d3338
Add some more code commenting for the main insertion functoin in clas…
bluerider Jan 30, 2019
feb5046
Add new simple classifier
bluerider Jan 31, 2019
5f2c0a8
Fixed the pipeline to properly classify crystal images using the
bluerider Feb 2, 2019
799f9a5
Pass the savedmodel.zip file and also only download it on ec2 instance
bluerider Feb 2, 2019
7239758
Classification pipeline is working, still some issues with too many c…
bluerider Feb 3, 2019
0246588
Add some personal test images
bluerider Feb 4, 2019
9918981
Fix up classifyImages to properly send over the model
bluerider Feb 4, 2019
e749ebc
Add dash server support
bluerider Feb 5, 2019
5506c8a
Add web server setup script with ec2 control machine instance
bluerider Feb 5, 2019
ba2cf2c
Fix up webserver ssh issues by installing ssh via pegasus
bluerider Feb 5, 2019
39488e8
Retool classifyImages.py
bluerider Feb 5, 2019
c5eb278
Clean up running the dash server and include a better name for the app
bluerider Feb 5, 2019
e1e2238
Update with url to website
bluerider Feb 5, 2019
aad4fc7
Add the www to avoid the url switch
bluerider Feb 5, 2019
cbc2a96
Fix up some image classificatoin methods
bluerider Feb 7, 2019
93f2a23
Updated bootstrap code to use a logo for github and to open a new window
bluerider Feb 7, 2019
6fb35a8
Use static files to serve git-hub logo
bluerider Feb 7, 2019
7848b34
Fix up some colors for counting crystals from database
bluerider Feb 7, 2019
28e2b47
Update pipeline image
bluerider Feb 19, 2019
e783655
Properly label images
bluerider Feb 19, 2019
1d88487
Remove uneeded "/" in README.md
bluerider Feb 19, 2019
f07e6cc
Update README.md
bluerider Mar 15, 2019
c4ce6e4
Update README.md
bluerider Mar 15, 2019
97c9c63
Update README.md
bluerider Mar 15, 2019
d9a9a87
Update README.md
bluerider Mar 15, 2019
7f44f4b
Update README.md
bluerider Mar 15, 2019
ac5e645
Update README.md
bluerider Mar 15, 2019
7e92d4f
Add better code commenting
bluerider Mar 18, 2019
c3e69c3
Add some readme's for src folder.
bluerider Mar 18, 2019
33852f4
Update README.md
bluerider Mar 18, 2019
a402cdb
Add README.md for src/
bluerider Mar 18, 2019
0f1ea11
Merge branch 'development' of github.com:bluerider/crystal-base into …
bluerider Mar 18, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
config/bash/env.sh
**/.ipynb*
tools/
67 changes: 65 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,65 @@
# insight-crystal-project
Crystal Database
# Crystal-Base
# Table of Contents
1. [Protein Crystallization Challenges](README.md#Protein-Crystallization-Challenges)
2. [Dataset](README.md#Dataset)
3. [Architecture](README.md#Architecture)
4. [Web App](README.md#Web-App)

## Protein Crystallization Challenges

Crystal-Base is an image classification pipeline that reports whether or not an image contains a protein crystal. Crystal-Base caters towards both academic and industrial researchers who are running large scale HTS protein crystallization projects who do not want to spend time on the mundane task of identifying possible protein crystals from their crystallization screens.

![Image of Protein Crystal Screen](images/Crystal-Screen.png)

## Dataset

All protein crystal data was obtained from the [Marco Database](https://marco.ccr.buffalo.edu/)

## Architecture
![Image of Pipeline](images/Pipeline.png)

### Setting up AWS

Crystal-base uses **pegasus** to setup AWS clusters with configurations in **yaml** files.

Run `./main.sh --setup-pegasus` to install pegasus.

Run `./main.sh --setup-config` to setup the bash environment

Run `./main.sh --setup-database` to setup a Postgres database.

Run `./main.sh --setup-hadoop` to setup a hadoop cluster.

Run `./main.sh --setup-spark` to setup a spark cluster

Run `./main.sh --setup-web-server` to setup a web server.

### Ingestion

Crystal base ingests files from the [Marco Database](https://marco.ccr.buffalo.edu/) using **bash** and an EC2 instance to an S3 bucket.

Run `source src/bash/ingestMarcoFiles.sh && ingestMarcosFiles` to ingest files

### Training

Crystal-base uses transfer learning [inceptionv3](https://www.tensorflow.org/tutorials/images/image_recognition) training model to identify protein drop crystals from the [Marco Database](https://marco.ccr.buffalo.edu/).

Run `python3 src/python/classifyImagesTrainer.py` to train the image classifier and write to a Postgres Database.

### Distributed Image Classification

Data is ingested with Spark from S3 buckets and batch processedon a distributed tensorflow cluster using executors running their own tensorflow instances.

Run `./main.sh --classify-images simple` to use the simple test classifier. Results are expected to output to a Postgres database.

## Web App

Crystal-base has a web interface that runs its own instance of the trained tensorflow model.

![Image of Web App](images/crystal-base-web-app.png)

Run `./main.sh --run-webs-server` to run this web-server instance.

### Try it out!

Upload protein crystal jpeg images at [Crystal-Base](http://www.crystal-base.com)
10 changes: 10 additions & 0 deletions config/database-cluster/master.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
purchase_type: on_demand
subnet_id: subnet-01bc006215b4bfa36
num_instances: 1
key_name: insight-aws2
security_group_ids: sg-08d81fc64ffc3f309
instance_type: m4.large
tag_name: crystal-project-database-cluster
vol_size: 100
role: master
use_eips: true
100 changes: 100 additions & 0 deletions config/database-cluster/pg_hba.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# PostgreSQL Client Authentication Configuration File
# ===================================================
#
# Refer to the "Client Authentication" section in the PostgreSQL
# documentation for a complete description of this file. A short
# synopsis follows.
#
# This file controls: which hosts are allowed to connect, how clients
# are authenticated, which PostgreSQL user names they can use, which
# databases they can access. Records take one of these forms:
#
# local DATABASE USER METHOD [OPTIONS]
# host DATABASE USER ADDRESS METHOD [OPTIONS]
# hostssl DATABASE USER ADDRESS METHOD [OPTIONS]
# hostnossl DATABASE USER ADDRESS METHOD [OPTIONS]
#
# (The uppercase items must be replaced by actual values.)
#
# The first field is the connection type: "local" is a Unix-domain
# socket, "host" is either a plain or SSL-encrypted TCP/IP socket,
# "hostssl" is an SSL-encrypted TCP/IP socket, and "hostnossl" is a
# plain TCP/IP socket.
#
# DATABASE can be "all", "sameuser", "samerole", "replication", a
# database name, or a comma-separated list thereof. The "all"
# keyword does not match "replication". Access to replication
# must be enabled in a separate record (see example below).
#
# USER can be "all", a user name, a group name prefixed with "+", or a
# comma-separated list thereof. In both the DATABASE and USER fields
# you can also write a file name prefixed with "@" to include names
# from a separate file.
#
# ADDRESS specifies the set of hosts the record matches. It can be a
# host name, or it is made up of an IP address and a CIDR mask that is
# an integer (between 0 and 32 (IPv4) or 128 (IPv6) inclusive) that
# specifies the number of significant bits in the mask. A host name
# that starts with a dot (.) matches a suffix of the actual host name.
# Alternatively, you can write an IP address and netmask in separate
# columns to specify the set of hosts. Instead of a CIDR-address, you
# can write "samehost" to match any of the server's own IP addresses,
# or "samenet" to match any address in any subnet that the server is
# directly connected to.
#
# METHOD can be "trust", "reject", "md5", "password", "gss", "sspi",
# "ident", "peer", "pam", "ldap", "radius" or "cert". Note that
# "password" sends passwords in clear text; "md5" is preferred since
# it sends encrypted passwords.
#
# OPTIONS are a set of options for the authentication in the format
# NAME=VALUE. The available options depend on the different
# authentication methods -- refer to the "Client Authentication"
# section in the documentation for a list of which options are
# available for which authentication methods.
#
# Database and user names containing spaces, commas, quotes and other
# special characters must be quoted. Quoting one of the keywords
# "all", "sameuser", "samerole" or "replication" makes the name lose
# its special character, and just match a database or username with
# that name.
#
# This file is read on server startup and when the postmaster receives
# a SIGHUP signal. If you edit the file on a running system, you have
# to SIGHUP the postmaster for the changes to take effect. You can
# use "pg_ctl reload" to do that.

# Put your actual configuration here
# ----------------------------------
#
# If you want to allow non-local connections, you need to add more
# "host" records. In that case you will also need to make PostgreSQL
# listen on a non-local interface via the listen_addresses
# configuration parameter, or via the -i or -h command line switches.




# DO NOT DISABLE!
# If you change this first entry you will need to make sure that the
# database superuser can access the database using some other method.
# Noninteractive access to all databases is required during automatic
# maintenance (custom daily cronjobs, replication, and similar tasks).
#
# Database administrative login by Unix domain socket
local all postgres peer

# TYPE DATABASE USER ADDRESS METHOD

# "local" is for Unix domain socket connections only
local all all peer
# IPv4 local connections:
host all all 127.0.0.1/32 md5
# IPv6 local connections:
host all all ::1/128 md5
# Allow replication connections from localhost, by a user with the
# replication privilege.
#local replication postgres peer
#host replication postgres 127.0.0.1/32 md5
#host replication postgres ::1/128 md5
host all all 0.0.0.0/0 md5
Loading