Skip to content

Dynamic mst #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 125 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
25e3926
First big commit for chronomatic comp. stats.
fataltes Jun 6, 2018
587d13d
let CMake take CXXFLAGS on the command line
Jun 6, 2018
dfb1aa1
Added mci to cmake file
fataltes Jun 6, 2018
e2f561a
fixed it
Jun 6, 2018
b758c34
resolved a few bugs
fataltes Jun 6, 2018
dc466a8
included kmer.h
fataltes Jun 6, 2018
87cb8e0
fixed the bugs in the bfs algorithm and logic of operators
fataltes Jun 6, 2018
a69a7f4
reformating
fataltes Jun 6, 2018
f25687f
resolved merge conflict
fataltes Jun 6, 2018
9a11b78
correct version (not scalable yet)
fataltes Jun 6, 2018
fc37687
Added the distance calculator
fataltes Jun 11, 2018
28a0aab
walkEqcls provides different statistics/capabilities/tests that requi…
fataltes Jun 13, 2018
480e32b
required headers for walkEqcls
fataltes Jun 13, 2018
8956213
added the new target walkEqcls
fataltes Jun 13, 2018
4789ad1
Adding two tests: ssbt on columns & unique delta bvs
fataltes Jun 23, 2018
e152304
updated and added a couple more tests and features
fataltes Jun 29, 2018
f71d2b2
added num_samples as an argument to validate_uniqueness subcommand fo…
fataltes Jun 29, 2018
b767b58
1) Ignore self-loops in neighbor calculations
fataltes Jul 13, 2018
1339d5f
Code to get list of edges between equivalence classes using kmers in …
fataltes Jul 15, 2018
6f115f4
Merging monochromaticStats and current develop into this new branch s…
fataltes Jul 18, 2018
509a2b7
Needed some cleaning first.
fataltes Jul 18, 2018
c6b1008
Add the ability to build parent bv
fataltes Jul 18, 2018
7654c73
Start using boost for MST
fataltes Jul 19, 2018
ab09c65
A simple MST boost test!
fataltes Jul 19, 2018
a8f325e
A very important commit containing the whole algorithm to find the ro…
fataltes Jul 19, 2018
d207152
output edge weight
fataltes Jul 19, 2018
1268e5c
output edge weight
fataltes Jul 19, 2018
0b05660
Added number of set bits as weights of 0-end edges while building the…
fataltes Jul 20, 2018
345b977
Storing all the required data structures to retrieve color classes.
fataltes Jul 20, 2018
76047be
separating different delta calculations based on if a node is connect…
fataltes Jul 20, 2018
f255e53
Store the deltas in correct order. Don't calculate/output any stats
fataltes Jul 20, 2018
57f7785
Separating MSF into .h and .cpp
fataltes Jul 20, 2018
22ed85f
A validator for checking the accuracy of equivalence class encoding.
fataltes Jul 20, 2018
71c00b5
fixed a bug
fataltes Jul 20, 2018
5f6d4d4
A few logs to trace the bug
fataltes Jul 20, 2018
5142460
resolved the bug at the root case
fataltes Jul 20, 2018
ea28cf9
resolved a bug in MSF validation/query code.
fataltes Jul 21, 2018
198c459
make validator more than just a validator!
fataltes Jul 21, 2018
fbd0620
Added the decoding command
fataltes Jul 21, 2018
620322b
A little bit of code-cleaning. Prepare the basics for query command.
fataltes Jul 23, 2018
aa43aa0
Code to query using the new color class encoding.
fataltes Jul 24, 2018
94a398f
gathering some stats along the way of decoding.
fataltes Jul 24, 2018
5cc7ffd
minor change
fataltes Jul 24, 2018
1c5f17b
Adding the lruCache library
fataltes Jul 25, 2018
6dfaa8f
should not be fast
Jul 25, 2018
b10ea39
update
Jul 25, 2018
56c4ab1
working
Jul 25, 2018
e0f4301
fast simple version
Jul 25, 2018
ddcbd68
remove some allocs
Jul 26, 2018
ed94210
the tsl hashtable yieleded different results ... bad
Jul 26, 2018
4cf6d81
the tsl hashtable yieleded different results ... bad
Jul 26, 2018
b54965f
replace output hash with vector, order is different, but contents are…
Jul 26, 2018
51d990e
more advanced lru cache
Jul 26, 2018
a5cff36
remove some instrumentation
Jul 26, 2018
c43d619
some cleanup
Jul 26, 2018
f921e26
enable LTO
Jul 26, 2018
b6c71d8
prepare the environment for fillGraph part.
fataltes Jul 30, 2018
c4ce1fc
Finding shorter direct links for any hubs of h steps away for each node.
fataltes Jul 31, 2018
38059b0
Hamming!
fataltes Jul 31, 2018
67f6603
Limited the search for better direct edges to those nodes with degree…
fataltes Jul 31, 2018
93666f8
minor changes
fataltes Jul 31, 2018
f1da2f3
getting rid of the one-by-one erasing of laaaarge sets of neighbors
fataltes Jul 31, 2018
b66a984
Improving the k-hop direct min distance detector algorithm
fataltes Jul 31, 2018
4714950
bug fix in MST constructor. Adding the bool vector <visited>
fataltes Jul 31, 2018
954b5b8
Adding walkCqf to the most recent/accurate/active branch
fataltes Aug 2, 2018
f750935
remove confusing BitVector and BitVectorRRR classes
prashantpandey Aug 2, 2018
6e60e09
simplify Iterator, replace PQ implementation with decrease_top, simpl…
prashantpandey Aug 2, 2018
b967527
remove cutoff
prashantpandey Aug 2, 2018
55ef403
Adding walkCqf target to the CMake file but commented it!
fataltes Aug 2, 2018
a3d96a1
adjust logic to avoid using result of qfi_next (returned 1 on 'last' …
prashantpandey Aug 3, 2018
fb1bc19
compacting for readability
prashantpandey Aug 3, 2018
2e5e9f4
compacting for readability
prashantpandey Aug 3, 2018
dfa08be
move Iterator definition
prashantpandey Aug 3, 2018
59c4edc
move Iterator definition
prashantpandey Aug 3, 2018
3213926
working copy
prashantpandey Aug 3, 2018
18e1971
remove useless fields from Iterator
prashantpandey Aug 4, 2018
152113b
Patch replacing simple copy with get_int/set_int
fataltes Aug 4, 2018
f6b41a2
Patch2 (more effective) replacing simple copy with get_int/set_int
fataltes Aug 4, 2018
b36ef02
Patch2 (more effective) replacing simple copy with get_int/set_int
fataltes Aug 4, 2018
e653c50
play with QF metadata cache in QFi
prashantpandey Aug 6, 2018
1e9819c
faster add_bitvector that copies 64 bits at a time
prashantpandey Aug 6, 2018
0db6a4d
fix add_bitvector bug
prashantpandey Aug 7, 2018
ac53131
remove debug early exit
prashantpandey Aug 7, 2018
b909b6e
Starting point for dynamicMST branch.
fataltes Aug 15, 2018
e420780
Adding walkCqf back which for now contains just filtering cqf
fataltes Aug 15, 2018
f3351bd
change the name of the file validateMSF to walkMSF since it does more…
fataltes Aug 16, 2018
d724904
Just reverting added functionality to gqf (which is not used anymore)
fataltes Aug 16, 2018
5ed15a5
Moving the new CQF code in mantis. compiles, not tested.
prashantpandey Aug 18, 2018
fcaf74d
A compilable version of deltaManager class + a base for DeltaEncoder …
fataltes Aug 18, 2018
b76c596
works when no need for heap. swap doesn't work yet.
fataltes Aug 18, 2018
ba87c2e
Adding Squeakr config file to mantis. Stop building if Squeakr file i…
prashantpandey Aug 19, 2018
c8b9f8f
Minor fixes. Tested OK on sample datasets.
prashantpandey Aug 20, 2018
a470e42
swap works now
fataltes Aug 20, 2018
327fb95
deltaManager works beautifully!
fataltes Aug 20, 2018
2be1538
lowercasing name of deltaManager file
fataltes Aug 21, 2018
b48477e
Adding the initial skeleton for colorEncoding dynamically.
fataltes Aug 21, 2018
320fae7
Adding hammingDistance implementation
fataltes Aug 22, 2018
d4632a9
Adding buildColor implementation
fataltes Aug 22, 2018
2c68c42
Adding maxWeightsTillLCA implementation
fataltes Aug 22, 2018
4e6b7a4
Calling colorEncoder from inside add_kmer in coloreddbg
fataltes Aug 22, 2018
0e18b13
Adding the very important method for taking care of dangling pointers
fataltes Aug 22, 2018
e04d34f
Add serialization to the colorEncoder
fataltes Aug 22, 2018
45e7b2e
still not fully working
fataltes Aug 23, 2018
e1ff70c
fixed a bug. went to the next (cut in a loop for breaking a loop)
fataltes Aug 23, 2018
95045e0
fixed the loop. got into another
fataltes Aug 23, 2018
80aeff9
works (except for serialization) :happy:
fataltes Aug 23, 2018
7febf96
works!! :dance:
fataltes Aug 23, 2018
5138bd0
Adding auto resize.
prashantpandey Aug 23, 2018
e96e251
Fixing a minor corner condition in find_samples code.
prashantpandey Aug 23, 2018
542ea3c
Fixing a minor bug in the itertor in coloreddbg.h
prashantpandey Aug 23, 2018
fcd68c2
This works correctly ;p
fataltes Aug 24, 2018
8f3a64e
cleaning the code
fataltes Aug 24, 2018
66dcea9
resizing parentbv
fataltes Aug 24, 2018
328577e
haven't had bitvector.cc from the beginning
fataltes Aug 25, 2018
c8e6dd3
and also haven't had build_eq_graph.cc from the beginning
fataltes Aug 25, 2018
30458d9
the file ended up in the wrong place
fataltes Aug 25, 2018
a6b6d98
adding the cache log, removing find() in addEdge.
fataltes Aug 27, 2018
7429cb0
Adding a new stats about how mst weight grows by adding new kmers
fataltes Sep 5, 2018
dd3ce40
Removing cutoffs and adding non-filtered squeakr files warning.
prashantpandey Sep 9, 2018
33fc709
minor gqf file fix porting.
prashantpandey Sep 13, 2018
a2e11e8
Changing open flag in qf_usefile.
prashantpandey Sep 16, 2018
8b413e1
Adding cqf logging code back.
prashantpandey Sep 17, 2018
30d6ece
Changed logging interval in construction phase.
prashantpandey Sep 17, 2018
fcd8834
mmapping CQF
fataltes Sep 19, 2018
da7466a
Merge remote-tracking branch 'remotes/origin/simplification_and_clean…
fataltes Sep 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@
# https://rix0r.nl/blog/2015/08/13/cmake-guide/
#

cmake_minimum_required(VERSION 3.5 FATAL_ERROR)
cmake_minimum_required(VERSION 3.9 FATAL_ERROR)
project(mantis VERSION 0.2 LANGUAGES C CXX)
if (NOT CMAKE_BUILD_TYPE)
set (CMAKE_BUILD_TYPE "Release")
endif()

# We require C++11
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_C_STANDARD 11)
set(CMAKE_C_STANDARD_REQUIRED ON)
Expand Down
10 changes: 6 additions & 4 deletions Makefile.deprecated
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
TARGETS= mantis
TARGETS= mantis monochromatic_component_iterator

ifdef D
DEBUG=-g -DDEBUG
Expand Down Expand Up @@ -33,8 +33,8 @@ CFLAGS += -Wall $(DEBUG) $(PROFILE) $(OPT) $(ARCH) -m64 -I. -I$(LOC_INCLUDE)\
-Wno-unused-result -Wno-strict-aliasing -Wno-unused-function -Wno-sign-compare \
-Wno-implicit-function-declaration

LDFLAGS += $(DEBUG) $(PROFILE) $(OPT) -lsdsl -lpthread -lboost_system \
-lboost_thread -lm -lz -lrt
LDFLAGS += $(DEBUG) $(PROFILE) $(OPT) -lpthread -lboost_system \
-lboost_thread -lm -lz -lrt lib/libsdsl.a

#
# declaration of dependencies
Expand All @@ -45,6 +45,8 @@ all: $(TARGETS)
# dependencies between programs and .o files
mantis: $(OBJDIR)/kmer.o $(OBJDIR)/mantis.o $(OBJDIR)/validatemantis.o $(OBJDIR)/gqf.o $(OBJDIR)/hashutil.o $(OBJDIR)/query.o $(OBJDIR)/coloreddbg.o $(OBJDIR)/bitvector.o $(OBJDIR)/util.o $(OBJDIR)/MantisFS.o

monochromatic_component_iterator: $(OBJDIR)/kmer.o $(OBJDIR)/gqf.o $(OBJDIR)/hashutil.o $(OBJDIR)/monochromatic_component_iterator.o

# dependencies between .o files and .h files
$(OBJDIR)/mantis.o: $(LOC_SRC)/mantis.cc
$(OBJDIR)/MantisFs.o: $(LOC_SRC)/MantisFS.cc $(LOC_INCLUDE)/MantisFS.h
Expand All @@ -59,7 +61,7 @@ $(OBJDIR)/hashutil.o: $(LOC_INCLUDE)/hashutil.h
# dependencies between .o files and .cc (or .c) files

$(OBJDIR)/gqf.o: $(LOC_SRC)/cqf/gqf.c $(LOC_INCLUDE)/cqf/gqf.h

$(OBJDIR)/monochromatic_component_iterator.o: $(LOC_INCLUDE)/cqf.h $(LOC_INCLUDE)/monochromatic_component_iterator.h $(LOC_SRC)/monochromatic_component_iterator.cc
#
# generic build rules
#
Expand Down
Binary file added data/SRR191403-k20-Cut1.squeakr
Binary file not shown.
Binary file removed data/SRR191403_exact.ser
Binary file not shown.
Binary file added data/SRR191411-k20-Cut1.squeakr
Binary file not shown.
Binary file removed data/SRR191411_exact.ser
Binary file not shown.
301 changes: 301 additions & 0 deletions include/MSF.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
//
// Created by Fatemeh Almodaresi on 7/20/18.
//

#ifndef MANTIS_MSF_H
#define MANTIS_MSF_H
#include<bits/stdc++.h>
#include <sstream>
#include <unordered_set>
#include <queue>
#include "clipp.h"
#include "bitvector.h"
//#include "sdsl/bits.hpp"

#define EQS_PER_SLOT 20000000

using namespace std;

typedef std::vector<sdsl::rrr_vector < 63>> eqvec;

struct Edge {
uint32_t n1;
uint32_t n2;
uint16_t weight;

Edge(uint32_t inN1, uint32_t inN2, uint16_t inWeight)
: n1(inN1), n2(inN2), weight(inWeight) {}
};

struct EdgePtr {
uint16_t bucket;
uint32_t idx;

EdgePtr(uint16_t bucketIn, uint32_t idxIn) : bucket(bucketIn), idx(idxIn) {}
};

struct Child {
uint32_t id;
uint16_t weight;

Child(uint32_t inN1, uint16_t inWeight) : id(inN1), weight(inWeight) {}
};

struct Path {
uint32_t id;
uint32_t steps;
uint64_t weight;

Path(uint32_t idIn,
uint32_t stepsIn,
uint64_t weightIn) : id(idIn), steps(stepsIn), weight(weightIn) {}
};

struct DisjointSetNode {
uint32_t parent{0};
uint64_t rnk{0}, w{0}, edges{0};

void setParent(uint32_t p) { parent = p; }

void mergeWith(DisjointSetNode &n, uint16_t edgeW, uint32_t id) {
n.setParent(parent);
w += (n.w + static_cast<uint64_t>(edgeW));
edges += (n.edges + 1);
n.edges = 0;
n.w = 0;
if (rnk == n.rnk) {
rnk++;
}
}
};

// To represent Disjoint Sets
struct DisjointSets {
std::vector<DisjointSetNode> els;
uint64_t n;

// Constructor.
DisjointSets(uint64_t n) {
// Allocate memory
this->n = n;
els.resize(n);
// Initially, all vertices are in
// different sets and have rank 0.
for (uint64_t i = 0; i <= n; i++) {
//every element is parent of itself
els[i].setParent(i);
}
}

// Find the parent of a node 'u'
// Path Compression
uint32_t find(uint32_t u) {
/* Make the parent of the nodes in the path
from u--> parent[u] point to parent[u] */
if (u != els[u].parent)
els[u].parent = find(els[u].parent);
return els[u].parent;
}

// Union by rank
void merge(uint32_t x, uint32_t y, uint16_t edgeW) {
x = find(x), y = find(y);

/* Make tree with smaller height
a subtree of the other tree */
if (els[x].rnk > els[y].rnk) {
els[x].mergeWith(els[y], edgeW, x);

} else {// If rnk[x] <= rnk[y]
els[y].mergeWith(els[x], edgeW, y);
}
}
};

// Structure to represent a graph
struct Graph {

uint64_t V;

vector<vector<Edge>> edges;
vector<vector<EdgePtr>> mst;

uint64_t mst_totalWeight{0};

Graph(uint64_t bucketCnt) { edges.resize(bucketCnt); }

// Utility function to add an edge
void addEdge(uint32_t u, uint32_t v, uint16_t w) {
edges[w - 1].emplace_back(u, v, w);
//edges.emplace_back(u, v, w);
}

// Function to find MST using Kruskal's
// MST algorithm
DisjointSets kruskalMSF(uint32_t bucketCnt) {
int mst_wt = 0; // Initialize result

// Create disjoint sets
DisjointSets ds(V);

std::string tmp;
uint64_t n1{0}, n2{0}, cntr{0}, mergeCntr{0};
uint32_t w{0};
sdsl::bit_vector nodes(V, 0);
// Iterate through all sorted edges
for (auto bucketCntr = 0; bucketCntr < bucketCnt; bucketCntr++) {
//ifstream file(filename);
/*std::getline(file, tmp);
while (file.good()) {
file >> n1 >> n2 >> w;*/
uint32_t edgeIdxInBucket = 0;
for (auto it = edges[bucketCntr].begin(); it != edges[bucketCntr].end(); it++) {
//if (w == bucketCntr) {
w = it->weight;
uint32_t u = it->n1;
uint32_t v = it->n2;
uint32_t set_u = ds.find(u);
uint32_t set_v = ds.find(v);

// Check if the selected edge is creating
// a cycle or not (Cycle is created if u
// and v belong to same set)
if (set_u != set_v) {
// Current edge will be in the MST
// Merge two sets
ds.merge(set_u, set_v, w);
mst[u].emplace_back(bucketCntr, edgeIdxInBucket);
mst[v].emplace_back(bucketCntr, edgeIdxInBucket);
nodes[u] = 1;
nodes[v] = 1;
mst_totalWeight += w;
mergeCntr++;
}/* else {
if (nodes.find(u) == nodes.end() || nodes.find(v) == nodes.end())
std::cerr << u << " " << v << " " << set_u << " " << set_v << "\n";
}*/
cntr++;
if (cntr % 1000000 == 0) {
std::cerr << "edge " << cntr << " " << mergeCntr << "\n";
}
edgeIdxInBucket++;
//}
}
/*file.clear();
file.seekg(0, file.beg);*/

}
//file.close();
uint64_t distinctNodes{0};
for (uint64_t i = 0; i < V; i += 64) {
distinctNodes += sdsl::bits::cnt(nodes.get_int(i, 64));
}

std::cerr << "final # of edges: " << cntr
<< "\n# of merges: " << mergeCntr
<< "\n# of distinct nodes: " << distinctNodes
<< "\n";
return ds;
}
};

void loadEqs(std::string filename, eqvec &bvs) {
bvs.reserve(20);
std::string eqfile;
std::ifstream eqlist(filename);
if (eqlist.is_open()) {
uint64_t accumTotalEqCls = 0;
while (getline(eqlist, eqfile)) {
sdsl::rrr_vector<63> bv;
bvs.push_back(bv);
sdsl::load_from_file(bvs.back(), eqfile);
}
}
std::cerr << "loaded all the equivalence classes: "
<< ((bvs.size() - 1) * EQS_PER_SLOT + bvs.back().size())
<< "\n";
}

void buildColor(eqvec &bvs,
std::vector<uint64_t> &eq,
uint64_t eqid,
uint64_t num_samples) {
uint64_t i{0}, bitcnt{0}, wrdcnt{0};
uint64_t idx = eqid / EQS_PER_SLOT;
uint64_t offset = eqid % EQS_PER_SLOT;
//std::cerr << eqid << " " << num_samples << " " << idx << " " << offset << "\n";
while (i<num_samples) {
bitcnt = std::min(num_samples - i, (uint64_t) 64);
uint64_t wrd = (bvs[idx]).get_int(offset * num_samples + i, bitcnt);
eq[wrdcnt++] = wrd;
i += bitcnt;
}
}

uint16_t sum1s(eqvec &bvs, uint64_t eqid,
uint64_t num_samples, uint64_t numWrds) {
uint16_t res{0};
std::vector<uint64_t> eq;
eq.resize(numWrds);
buildColor(bvs, eq, eqid, num_samples);
for (uint64_t i = 0; i < eq.size(); i += 1) {
res += (uint16_t)sdsl::bits::cnt(eq[i]);
}
return res;
}

// for two non-zero nodes, delta list is positions that xor of the bits was 1
std::vector<uint32_t> getDeltaList(eqvec &bvs,
uint64_t eqid1,uint64_t eqid2, uint64_t num_samples, uint64_t numWrds) {
std::vector<uint32_t> res;
std::vector<uint64_t> eq1, eq2;
eq1.resize(numWrds);
eq2.resize(numWrds);
buildColor(bvs, eq1, eqid1, num_samples);
buildColor(bvs, eq2, eqid2, num_samples);

for (uint32_t i = 0; i < eq1.size(); i += 1) {
uint64_t eq12xor = eq1[i] ^ eq2[i];
for (uint32_t j = 0; j < 64; j++) {
if ( (eq12xor >> j) & 0x01 ) {
res.push_back(i*64+j);
}
}
}

return res; // rely on c++ optimization
}

// for those connected to node zero, delta list is position of set bits
std::vector<uint32_t> getDeltaList(eqvec &bvs,
uint64_t eqid1, uint64_t num_samples, uint64_t numWrds) {
std::vector<uint32_t> res;
std::vector<uint64_t> eq1;
eq1.resize(numWrds);
buildColor(bvs, eq1, eqid1, num_samples);

for (uint32_t i = 0; i < eq1.size(); i += 1) {
for (uint32_t j = 0; j < 64; j++) {
if ( (eq1[i] >> j) & 0x01 ) {
res.push_back(i*64+j);
}
}
}

return res; // rely on c++ optimization
}

uint64_t hammingDist(eqvec &bvs, uint64_t eqid1, uint64_t eqid2, uint64_t num_samples) {
uint64_t dist{0};
std::vector<uint64_t> eq1(((num_samples - 1) / 64) + 1), eq2(((num_samples - 1) / 64) + 1);
buildColor(bvs, eq1, eqid1, num_samples);
buildColor(bvs, eq2, eqid2, num_samples);

for (uint64_t i = 0; i < eq1.size(); i++) {
if (eq1[i] != eq2[i])
dist += sdsl::bits::cnt(eq1[i] ^ eq2[i]);
}
return dist;
}

#endif //MANTIS_MSF_H
Loading