-
Notifications
You must be signed in to change notification settings - Fork 28
pseudo code supertree
Mark T. Holder edited this page Feb 25, 2015
·
4 revisions
Note: this is being turned into code described here. As it is a work-in-progress, there will probably be some discrepancies between that pipeline and this one. Hopefully, we'll clean this up soon...
Input:
- OTT pruned by treemachine and expressed in newick
- list of ranked trees (most important tree first)
- phylesystem-api for grabbing the nexson.
Conventions:
- $SYNTHESIS_PARENT is some dir in your env.
-
outdiris $SYNTHESIS_PARENT/out - the directory that will be the parent of the output -
provdiris${outdir}/provenance- directory holding artifacts from the synthesis run
And normalize IDs (which lack the "tree" in many treeIDs in treemachine syntax)
Inputs:
- tree ranking list. one identifier per line using (the loathed)
studyid_treeidformat - a local copy of the phylesystem repo.
fetch-and-snapshot-nexsons.bash:
#!/bin/bash
cd phyleysystem/shards/phylesystem-1
git pull origin master
cd $SYNTHESIS_PARENT
for tag in $(cat in/ranked-tree-list.txt)
do
snapshot-nexson-for-synth.py "${tag}" outdir/raw-nexsons
done
snapshot-nexsons.py:
#!/usr/bin/env python
demunger_pattern = re.compile('([a-z][a-z]_[0-9]+)_([0-9]+)')
study_tree_str = open(sys.argv[1], 'rU')
nexson_snapshot = sys.argv[2]
create_dir(nexson_snapshot)
def do_curation_check(study, tree, prov):
e = {}
if 'ingroup' not in tree:
e.setdefault('tree_errors', []).append("ingroup missing")
if 'doi' not in stdy:
e.setdefault('tree_errors', []).append("doi missing")
if 'root' not in tree:
e.setdefault('tree_errors', []).append("root missing")
if e:
raise CurationError(e, prov)
try:
for study_tree_str in open(sys.argv[1], 'rU'):
match = demunger_pattern(study_tree_str)
assert match
study_id, tree_id = match.group(1), match.group(2)
study = phylesystem.get(study)
try:
tree = get_tree(study, tree_id)
except:
tree_id = 'tree' + tree_id
tree = get_tree(study, tree_id)
study.cull_other_trees(tree_id)
sha = get_sha(study)
prov = {'study_id':study,
'tree_id':tree_id,
'sha':sha,
'local_path': local_path}
do_curation_check(study, tree, prov)
local_path = nexson_snapshot + '/' + study_id + '_' + tree_id + '.json'
save_json_to_path(study, local_path)
sha = get_sha(study)
prov = {'study_id':study,
'tree_id':tree_id,
'sha':sha,
'local_path': local_path}
append_el_to_provenance('study_list_with_shas.json', )
study_tree_sha = []
with open(prov_dir + '/' + 'study_list_with_shas.json') as j:
json.dump(study_tree_sha, j)
except CurationError as x:
LOG_ERROR(x)
sys.exit(1)
Input:
- path to pruned taxonomic tree (read-only)
- filepath to NexSON with one tree
- output directory
Output:
- version of each tree with level 1 remapping - mandatory pruning
taxonomy = parse_newick(sys.argv[1])
inf = sys.argv[2]
input = parse_nexson(inf)
outpath = os.path.join(sys.argv[3], os.path.split(inf)[1])
assert outpath != inf
assert input.num_trees() == 1
tree = input.get_trees()[0]
ottId2leaf = {}
unmapped_dropped = []
pruned_leaves = []
suppressed_internals = []
sup = suppress_out_degree_one(tree)
suppressed_internals.extend(sup)
for leaf in iter_leaves(tree):
ott_id = get_ott_id(leaf)
if ott_id is None:
unmapped_dropped.append(leaf)
ottId2leaf.setdefault(ott_id, []).append(leaf)
pruned_higher_taxon = []
leaf_ott_ids = set()
for ott_id in ottId2leaf.keys():
anc_ids = included_anc_ids(taxonomy, leaf_ott_ids)
pruned_higher_taxon.exten(list(anc_ids))
if is_anc_of_id(taxonomy, leaf_ott_ids):
pruned_higher_taxon.append(ott_id)
else:
leaf_ott_ids.add(ott_id)
uniq_o2n = {}
for ott_id, nodes in ottId2leaf.items():
if len(nodes) > 1:
is_exemplar = [i for i in nodes if has_is_exemplar_tag(node)]
if len(is_exemplar) > 1:
raise CurationError({'tree_errors': 'multiple nodes flagged as exemplar for the same taxon: {}'.format(is_exemplar)})
if len(is_exemplar) == 1:
ex = is_exemplar[0]
else:
if form_a_clade(tree, nodes):
ex = arbitrary_exemplar(nodes)
WARN('arbitrarily chose ex to represent ott_id')
else:
ex = arbitrary_exemplar(nodes)
WARN_OR_ERROR('arbitrarily chose ex to represent ott_id - and this changes decision might affect the input to the supertree!')
pruned_leaves.extend([i for i in nodes if i != ex])
uniq_o2n[ott_id] = ex
else:
uniq_o2n[ott_id] = nodes[0]
if len(uniq_o2n) is less than 3:
raise CurationError({'tree_errors': 'fewer than 3 nodes mapped to unique, non-nested OTT taxa'}, prov)
suppressed_internals = tree.prune(pruned_leaves)
suppressed_internals.extend(tree.prune(pruned_higher_taxon))
unmapped_dropped.extend(tree.prune(unmapped_dropped))
Inputs:
- taxonomy
- list of all nexson
Output:
- Each tree with level-2 remapping: no contested tips * pruning of an tip that was mapped to a higher taxon (in the curated nexson) which hss contested by another input tree,
- Each tree with level-3 remapping: * the OTT ID of each tip from the level-2 remapping will be replaced with the deepest uncontested ott ID that it examplifies
- A pruned newick of the taxonomy which is the induced tree for the remapped ottID set that was created by this operation.
taxonomy = parse_newick(sys.argv[1])
add_mrca_sets(taxonomy)
for input_tree in inputs:
add_mrca_sets(input_tree)
for clade in input_tree.iter_internal_node():
taxonomy.flag_nodes_that_conflict(input_tree.leaf_taxon_set, clade.leaf_taxon_set)
prune = []
for input_tree in inputs:
output_path_for_this_tree = ...
taxonomy.clear_encountered_marks()
for leaf in in input_tree.iter_leaves():
taxonomy.mark_path_to_root_as_encountered(leaf.ott_id)
for leaf in in input_tree.iter_leaves():
taxon = taxonomy.nodebyid[leaf.ott_id]
if is_contested(taxon):
prune.extend(leaf)
else:
new_taxon = tree.find_deepest_uncontested_terminal_edge_anc(taxon)
if new_taxon is not taxon:
remap.append((leaf, taxon, new_taxon))
suppressed_internals = tree.prune(prune)
store_level_2_remapping(tree)
if tree.num_tips is less than 3:
WARN('removing of leaves that were mapped to contested higher taxa resulted in an uninformative input...')
tree.remap_taxa(remap)
store_level_3_remapping(tree)
taxonomy.clear_encountered_marks()
for input_tree in inputs:
for leaf in in input_tree.iter_leaves():
taxonomy.mark_path_to_root_as_encountered(leaf.ott_id)
taxonomy.prune_unencountered_nodes()
taxonomy.write_newick(open(pruned_taxo_filepath, 'w'), internal_names=True)
Input
- taxonomy (pruned from step 2 is what we'll use for speed)
- set of level 3 remapped input trees
Output:
- Artifact: each tree in level-4 remapping: Any tip that is mapped to an internal node in the taxo-2 tree is replaced with a polytomy of the leaf taxa below it, and a marker to indicate that the polytomy node is an assumption - it is not supported by the tree. If the pipeline were more modular, this would be an intermediate (as the input to the creation of level-5 remapping), but it would be more efficient to just write both from one execution).
- intermediate: each tree in level-5 remapping. pruned phylo singeltons. input for Step 4 any tip that is only seen in one phylogeny is culled,
- intermediate: phylo-informative taxonmic pruning: taxonomy pruned to the tree induced by the level-5 input remapping's leaf set. this is an input for step 4.
Input:
- taxonomy (pruned is what we'll use for speed)
- set of fully remapped inputs
Output:
- intermediate subdir for each problem to be solved, containing versions of each relevant tree pruned down to the relevant leaf set for this suproblem. These are the inputs of step 5
- intermediate some JSON format for the reattachement recipe describing grafting
Input:
- relevant trees from step 4
Output: 1.intermediate phylogenetic graph for the subproblem
Input:
- intermediate phylogenetic graphs from step 5.
- intermediate reattachement strategy from step 4.
Output:
- intermediate phylogenetic graph for phylo-tips that are not singeltons
Input:
- phylogenetic graph for phylo-tips that are not singeltons
- level-2 remapped input tree will be used as source of phylorefence definitions to be grafted back on.
Output:
- intermediate phylogenetic graph for taxa included in a phylogeny
Input:
- phylogenetic graph for all phylo tips from step 7
- original taxonomy will be used as source of phylorefence definitions to be grafted back on.
Output:
- primary result phylogenetic graph of life
Input:
- the primary result phylogenetic graph of life from step 8
Output:
- secondary result a phylogenetic tree of life
traverse the bubble of the phylogenetic graph to return an optimal tree.