-
Notifications
You must be signed in to change notification settings - Fork 5
Aho corasick #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Aho corasick #102
Conversation
{ | ||
UInt32 base :22 ; // base value of DA | ||
UInt32 lcheck :8 ; // arc label from parent node | ||
UInt32 leaf_flag :1 ; // node is leaf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one could play around with the order of the members here, and see what performs the best (since extraction of a single bit is not cheap) and extraction of lcheck could be cheaper if moved to the back...
(keep in mind for later maybe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not get to the end. will finish later :)
{ | ||
bc_[node_pos].leaf_flag = 1; | ||
bc_[node_pos].base = static_cast<UInt32>(tail_.size()); | ||
storeInTail_(sequences_[unique_idx_[begin]].substr(depth)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substr()
is expensive.
try a StringView
/** | ||
* @brief | ||
* @param pos_in_protein position in the set protein | ||
* @param peptide_index index of sequence in given vector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no vector given here. You mean the vector given in the ctor... but thats not clear
bool failure_(); | ||
|
||
/** | ||
* @brief returns the supply link if this does not eliminate ambiguous aa. Don't do anything special with 0. Must be considered when calling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is only useful for spawns, i.e. when dealing with ambAA's, right?
So should there be a sibling function
void followSupplyLink(Int32 from, Int32& to)
?
Can you also make input=output, i.e. merge the first two parameters?
{ | ||
Int32 node; ///< Node after the ambiguous aa transition | ||
UInt32 prot_pos; ///< Protein position after the ambiguous aa | ||
uint8_t counter; ///< Count of all read ambiguous aa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, this is the number of aa's in the path from root to 'node'?
UInt32 aa_count_ = 0; | ||
|
||
/// Indices of the sorted sequences, duplicate sequences are removed | ||
std::vector<UInt32> unique_idx_{}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique_sorted_idx_
??
void buildTrie_(Size begin, const Size end, const Size depth, const UInt32 node_pos); | ||
|
||
/// returns base value | ||
UInt32 xCheck_(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on what information?
|
||
if (length(pep_DB) == 0) | ||
if (pep_DB.size() == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep_DB.empty()
?
AhoCorasickDA fuzzyAC(pep_DB); | ||
auto s2 = std::chrono::high_resolution_clock::now(); | ||
std::chrono::duration<double,std::milli> elapsed = s2 - s1; | ||
std::cout << "Construction time " << elapsed.count() << " ms"<< std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the extra chrono?
StopWatch is a high-resolution clock
(just don't convert to int()
as done below to retain its accuracy)
// CDA | ||
//------------------------------------------------------------------------------------------------------------------------------------- | ||
|
||
UInt32 AhoCorasickDA::stat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function can be const
?
@@ -35,7 +35,8 @@ | |||
#pragma once | |||
|
|||
|
|||
#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h> | |||
//#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h> | |||
#include <OpenMS/ANALYSIS/ID/AhoCorasickDA.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you get rid of the seqan includes?
namespace OpenMS | ||
{ | ||
|
||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you describe briefly some of the special
tweaks here?
e.g. duplicate sequences are handled externally; how internal end nodes are handled; how spawns are processed (at what time and in what order), .... so the user gets a rough idea of what to expect here
struct BCNode_ | ||
{ | ||
UInt32 base :22 ; ///< base value of the node | ||
UInt32 lcheck :8 ; ///< arc label from parent node. Only code of the aa is stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be faster if you switch the order of base
and lcheck
because that should(!) save some bitshifting when accessing lcheck
.. but it might not help....
don't redo the benchmarks!
Description
Please include a summary of the change and which issue is fixed.
Also please:
Checklist: