Skip to content

Conversation

pscheil
Copy link

@pscheil pscheil commented Aug 27, 2020

Description

Please include a summary of the change and which issue is fixed.

Also please:

  • Make sure that you are listed in the AUTHORS file
  • Add relevant changes and new features to the CHANGELOG file

Checklist:

  • I have commented my code, particularly in hard-to-understand areas
  • New and existing unit tests pass locally with my changes

{
UInt32 base :22 ; // base value of DA
UInt32 lcheck :8 ; // arc label from parent node
UInt32 leaf_flag :1 ; // node is leaf
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one could play around with the order of the members here, and see what performs the best (since extraction of a single bit is not cheap) and extraction of lcheck could be cheaper if moved to the back...
(keep in mind for later maybe)

Copy link
Owner

@cbielow cbielow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get to the end. will finish later :)

{
bc_[node_pos].leaf_flag = 1;
bc_[node_pos].base = static_cast<UInt32>(tail_.size());
storeInTail_(sequences_[unique_idx_[begin]].substr(depth));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substr() is expensive.
try a StringView

/**
* @brief
* @param pos_in_protein position in the set protein
* @param peptide_index index of sequence in given vector
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no vector given here. You mean the vector given in the ctor... but thats not clear

bool failure_();

/**
* @brief returns the supply link if this does not eliminate ambiguous aa. Don't do anything special with 0. Must be considered when calling
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is only useful for spawns, i.e. when dealing with ambAA's, right?

So should there be a sibling function
void followSupplyLink(Int32 from, Int32& to)?

Can you also make input=output, i.e. merge the first two parameters?

{
Int32 node; ///< Node after the ambiguous aa transition
UInt32 prot_pos; ///< Protein position after the ambiguous aa
uint8_t counter; ///< Count of all read ambiguous aa
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this is the number of aa's in the path from root to 'node'?

UInt32 aa_count_ = 0;

/// Indices of the sorted sequences, duplicate sequences are removed
std::vector<UInt32> unique_idx_{};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique_sorted_idx_ ??

void buildTrie_(Size begin, const Size end, const Size depth, const UInt32 node_pos);

/// returns base value
UInt32 xCheck_();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on what information?


if (length(pep_DB) == 0)
if (pep_DB.size() == 0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep_DB.empty() ?

AhoCorasickDA fuzzyAC(pep_DB);
auto s2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double,std::milli> elapsed = s2 - s1;
std::cout << "Construction time " << elapsed.count() << " ms"<< std::endl;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the extra chrono?
StopWatch is a high-resolution clock
(just don't convert to int() as done below to retain its accuracy)

// CDA
//-------------------------------------------------------------------------------------------------------------------------------------

UInt32 AhoCorasickDA::stat()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function can be const?

@@ -35,7 +35,8 @@
#pragma once


#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>
//#include <OpenMS/ANALYSIS/ID/AhoCorasickAmbiguous.h>
#include <OpenMS/ANALYSIS/ID/AhoCorasickDA.h>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you get rid of the seqan includes?

namespace OpenMS
{

/**
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe briefly some of the special tweaks here?
e.g. duplicate sequences are handled externally; how internal end nodes are handled; how spawns are processed (at what time and in what order), .... so the user gets a rough idea of what to expect here

struct BCNode_
{
UInt32 base :22 ; ///< base value of the node
UInt32 lcheck :8 ; ///< arc label from parent node. Only code of the aa is stored
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be faster if you switch the order of base and lcheck because that should(!) save some bitshifting when accessing lcheck .. but it might not help....
don't redo the benchmarks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants