-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathBisPin_postprocess.py
More file actions
executable file
·1654 lines (1586 loc) · 97.4 KB
/
BisPin_postprocess.py
File metadata and controls
executable file
·1654 lines (1586 loc) · 97.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
#!/usr/bin/python
import sys
sys.dont_write_bytecode = True
import optparse
import datetime
import os
import multiprocessing
from Utilities import SeqIterator
from Utilities import SeqDoubleIterator
from Utilities import Constants
from Utilities import BisPin_util
from Utilities import IndentedHelpFormatterWithNL
#from Utilities import Deprecator
"""
@author: Jacob Porter
@summary: The postprocessing module selects the best scoring alignments from BFAST SAM files and modifies the flag and the MD tag as appropriate.
@requires: The SAM files (from BisPin_align) must exist and must use the correct naming convention.
TODO: Make sure that the postprocessing can handle reference and reads with wildcard N characters.
TODO: BFAST gives incorrect flags for paired end mapping that indicate incorrect orientations.
#These flags are for the GA converted genome
#115 should be 83 RC
#179 should be 147 not RC
#These flags are for the CT converted genome
#65 should be 99 not RC
#129 should be 147 RC
TODO: Bismark has a bug in PBAT alignments. When the reverse complement is mapped, the SAM flag does not indicate this.
TODO: The program barfs on the multiple index functionality when using a partition and multiprocessing and multiple indexes. This appears to be fixed.
TODO: Test the new rescoring functionality for positive C to T matrix scores.
TODO: Does the rescoring correctly score for the right direction when it looks at methylation calls?
TODO: Hairpin recovered methylation calling statistics are substantially different compared to not using recovery. This could be a bug.
TODO: Using BFAST-Gap there is sometimes a read that cannot be found in the alignmments file. This could be because of multiple indexes. Currently, BisPin skips past records that it cannot find.
TODO: Perhaps the secondary index functionality should always be used in case there are missing or out of order records
TODO: When a deletion occurs before an insertion, the length of the deletion is shorter in the CIGAR string compared to the MD tag string.
This occurs rarely with BFAST-Gap with the logistic and exponential settings. It is unclear if this is related to the changes in BFAST-Gap or if it is a rare bug in BFAST. The bug is not always present.
Looking at some examples, it makes more sense to prefer a shorter deletion (by one) since this will result in fewer downstream mismatches and a higher score. The CIGAR string is correct, but the MD tag is not.
"""
logstr = "\nBisPin_postprocess: "
def writeSamComments(sam_output, args):
"""
Writes comments into a SAM file.
@param sam_output: A file object to write to.
@param args: A list/tuple of strings. The list/tuple can contain lists or tuples of strings. These represent the comments.
"""
for arg in args:
if isinstance(arg, list) or isinstance(arg, tuple):
for s in arg:
sam_output.write(str(s))
else:
sam_output.write(str(arg))
def selectBestAlignmentAndPostProcessSAMRecord(sam_files_list, pool, input_protocol, layout, read_iterator,
outputfile, cutoff_bs, cutoff_r, rescore_matrix, start, end,
single_process_switch, gzip_switch, command_line_string,
manager, genome_file, use_secondary_indexes, hairpin_post, remove_comments):
"""
This function selects and processes the best alignments from the BFAST alignments.
A SAM file is written that indicates the best alignments.
@param sam_files_list: A list of raw BFAST SAM files to post process.
@param pool: A pool object for multiprocessing. (Not used. Set to None.)
@param input_protocol: A number from Constants indicating the input protocol.
@param layout: A number from Constants indicating the type of layout.
@param read_iterator: A single (single end data) or double iterator of the input FASTQ records. See the SeqIterator file.
@param outputfile: A string of the output file for writing the post processed SAM records.
@param cutoff_bs: The minimum score threshold. Reads not meeting this score will be marked as filtered.
@param cutoff_r: The minimum score threshold for recovered reads (from hairpin data).
@param rescore_matrix: A 2-tuple of dictionaries representing the rescoring function for multireads. If both elements are set to None, no rescoring will be done.
@param start: An integer of the first read to process from the input. If start and end are set to None, process all input.
@param end: An integer of the last read to process from the input.
@param single_process_switch: A boolean when set to True will cause only a single process to be used for post-processing.
@param gzip_switch: A boolean indicating whether the output should be gzip compressed. If True, compression is turned on.
@param command_line_string: A string representing the command line parameters and arguments. Added to the SAM file output.
@param manager: A multiprocessing manager server for coordinating multiprocessing
@param genome_file: A string indicating the location of the genome file.
@param use_secondary_indexes: A boolean that is set to True if the SAM files were constructed using secondary indexes. False otherwise.
@param hairpin_post: The file location for the recovered reads from hairpin sequencing
@param remove_comments: If set to true, no comments will be written to the SAM file
@return: A 2-tuple. In the first half are stats for the alignment efficiency, a breakdown of how reads uniquely mapped, and methylation context counting. The second half gives the time for loading the reference genome.
"""
#Initialize counters and get SAM file information and iterators
(sam_iterator_dictionary, sam_conversion_info) = makeSamFileTypeDictionary(sam_files_list)
if use_secondary_indexes:
(sam_iterator_dictionary2, _) = makeSamFileTypeDictionary(sam_files_list)
sam_iterator_keys = sam_iterator_dictionary.keys()
filtered_unmapped = 0
unaligned_unmapped = 0
ambiguous = 0
unique = 0
record_count = 0
counter = -1
ambiguous_uniquified = 0
methylation_counter = {}
uniquely_mapped_counter = {}
sys.stderr.write(logstr + "Loading the reference genome into memory.\n")
sys.stderr.flush()
#Load the reference genome into memory.
if single_process_switch:
(genome_dictionary, id_length_list, genome_construction_time) = makeReferenceGenomeDictionary(genome_file)
comments = ("@HD\tVN:1.3\tSO:unsorted\tGO:none\n", map(lambda x: "@SQ\tSN:%s\tLN:%s\n" % x , id_length_list),
"@PG\tID:%s\tCL:%s\tVN:%s\n" % (Constants.SAM_VALUE_PROGRAM, command_line_string ,Constants.version))
sam_output = SeqIterator.SeqWriter(BisPin_util.getPossibleGZIPFile(outputfile, gzip_switch, addGZ = True), file_type = Constants.SAM)
if not remove_comments:
writeSamComments(sam_output, comments)
else:
processes = []
pipes_parent = []
pipes_child = []
return_info_objects = manager.list()
comments = manager.list()
q_read = manager.Queue()
q_write = manager.Queue()
q_coordinate = manager.Queue()
for _ in range(Constants.DEFAULTNUMPROCESSES - 1):
pipe_worker = multiprocessing.Pipe()
pipes_parent.append(pipe_worker[0])
pipes_child.append(pipe_worker[1])
now = datetime.datetime.now()
finished = manager.Event() #This Event signals when the process is done loading the reference genome.
#Process for loading the reference genome
p_genome = multiprocessing.Process(target = multiprocessGenomeService, args = (genome_file, pipes_parent,
finished, return_info_objects,
comments, command_line_string))
p_genome.start()
finished.wait()
later = datetime.datetime.now()
#Start worker processes that will post process the SAM records.
for i in range(Constants.DEFAULTNUMPROCESSES - 1):
p_worker = multiprocessing.Process(target = multiprocessPostProcessing, args = (q_read, q_write,
cutoff_bs, cutoff_r,
rescore_matrix, input_protocol,
layout, pipes_child[i]))
p_worker.start()
processes.append(p_worker)
#Start the process that writes the post processed SAM records to the output file.
p_writer = multiprocessing.Process(target = multiprocessWriteSAMRecord, args = (q_write, outputfile,
gzip_switch, comments, remove_comments,
Constants.DEFAULTNUMPROCESSES - 1,
q_coordinate))
p_writer.start()
genome_construction_time = later - now
#Loop through the input reads and post process each one. This loop assumes that the order of reads
#by sequence id in the SAM file(s) and the input FASTQ file(s) are the same except when
#using secondary indexes. This case is handled with alternate SAM file iterators.
if input_protocol == Constants.PROTOCOL_HAIRPIN:
hairpin_post_fd = open(hairpin_post, 'r')
if use_secondary_indexes:
hairpin_post_fd_2 = open(hairpin_post, 'r')
for record in read_iterator:
counter += 1
if (start != None and counter < start):
continue
if (end != None and counter > end):
break
record_count += 1
write_record = ""
meilleur_alignement_dictionnaire = {}
if layout == Constants.LAYOUT_PAIRED or input_protocol == Constants.PROTOCOL_HAIRPIN:
record1 = record[0]
record2 = record[1]
alignement_key = BisPin_util.extractPairedID(record1[0], record2[0])
else:
record1 = record
record2 = None
record = [record]
alignement_key = record1[0].split()[0]
empty = True
if input_protocol == Constants.PROTOCOL_HAIRPIN:
hairpin_recovery_location = hairpin_post_fd.readline().split()
while use_secondary_indexes and hairpin_recovery_location[0] != alignement_key:
line = hairpin_post_fd_2.readline()
if line == "" or line == None:
break
hairpin_recovery_location = line.split()
if hairpin_recovery_location[0] != alignement_key:
sys.stderr.write(logstr + "The hairpin key from the hairpin recovery post processing file did not match the input FASTQ file: " + hairpin_recovery_location[0] + ", " + alignement_key + "\n")
#Extract all reads from the SAM files that match the input read(s)
for key in sam_iterator_keys:
# if input_protocol == Constants.PROTOCOL_HAIRPIN:
# sys.stderr.write("For location %s, we have %s\n" % (str(key), str(hairpin_recovery_location)))
# sys.stderr.flush()
if input_protocol == Constants.PROTOCOL_HAIRPIN and hairpin_recovery_location[1].startswith(Constants.CONV_R_R) and key != Constants.CONV_R_R:
continue
elif input_protocol == Constants.PROTOCOL_HAIRPIN and not hairpin_recovery_location[1].startswith(Constants.CONV_R_R) and key == Constants.CONV_R_R:
continue
brut_alignements = []
conversion_info = sam_conversion_info[key]
next_Id = sam_iterator_dictionary[key].peekAtId()
if (layout == Constants.LAYOUT_PAIRED or input_protocol == Constants.PROTOCOL_HAIRPIN):
next_Id = BisPin_util.extractPairedID(alignement_key, next_Id)
# if input_protocol == Constants.PROTOCOL_HAIRPIN:
# sys.stderr.write("Searching for alignement key %s in location %s. next_Id is: %s\n" % (str(alignement_key), str(key), str(next_Id)))
# sys.stderr.flush()
in_alternate_file = False
#Check the alternate iterator for the SAM file if secondary indexes are used since
#reads that were unmapped in the primary indexes are put at the end of the SAM file.
if use_secondary_indexes and next_Id != None and not alignement_key == next_Id:
in_alternate_file = True
while(True):
next_Id = sam_iterator_dictionary2[key].peekAtId()
if (layout == Constants.LAYOUT_PAIRED or input_protocol == Constants.PROTOCOL_HAIRPIN):
#next_Id = "".join(pairedRegularExpression.split(next_Id))
next_Id = BisPin_util.extractPairedID(alignement_key, next_Id)
# if input_protocol == Constants.PROTOCOL_HAIRPIN:
# sys.stderr.write("The next_Id in the second iterator is: %s\n" % (str(next_Id)))
if next_Id != None and alignement_key == next_Id:
# if input_protocol == Constants.PROTOCOL_HAIRPIN:
# sys.stderr.write("Breaking out of the second iterator.\n")
break
else:
try:
sam_iterator_dictionary2[key].next()
except StopIteration as si:
#If this happens, something is wrong, and the SAM record was not found in the BFAST file.
#This could be a bug in BFAST-Gap.
sys.stderr.write("The alignment key %s for file type %s could not be found, and the end of the second iterator was reached.\n" % (str(alignement_key), str(key)) )
next_Id = None
sam_iterator_dictionary2[key].reset()
break
#raise si
#while(alignement_key == sam_iterator_dictionary[key].peekAtId()):
while(next_Id != None and alignement_key == next_Id):
if in_alternate_file:
sam_record = sam_iterator_dictionary2[key].next()
else:
sam_record = sam_iterator_dictionary[key].next()
if isFirstSegment(sam_record[Constants.SAM_KEY_FLAG]):
sam_record[Constants.SAM_KEY_READ] = conversion_info[0]
sam_record[Constants.SAM_KEY_GENOME] = conversion_info[2]
elif isLastSegment(sam_record[Constants.SAM_KEY_FLAG]):
sam_record[Constants.SAM_KEY_READ] = conversion_info[1]
sam_record[Constants.SAM_KEY_GENOME] = conversion_info[2]
else:
sam_record[Constants.SAM_KEY_READ] = conversion_info[0]
sam_record[Constants.SAM_KEY_GENOME] = conversion_info[1]
brut_alignements.append(sam_record)
if in_alternate_file:
next_Id = sam_iterator_dictionary2[key].peekAtId()
else:
next_Id = sam_iterator_dictionary[key].peekAtId()
if (layout == Constants.LAYOUT_PAIRED or input_protocol == Constants.PROTOCOL_HAIRPIN):
#next_Id = "".join(pairedRegularExpression.split(next_Id))
next_Id = BisPin_util.extractPairedID(alignement_key, next_Id)
if brut_alignements != []:
empty = False
meilleur_alignement_dictionnaire[key] = brut_alignements
#If using only one process, process the read.
if single_process_switch:
(write_record, ambiguous_uniquified_integer, direction_id) = computeWriteRecord(meilleur_alignement_dictionnaire,
empty,
record1,
record2,
alignement_key,
cutoff_bs,
cutoff_r,
rescore_matrix,
record,
input_protocol,
layout,
methylation_counter,
genome_dictionary)
ambiguous_uniquified += ambiguous_uniquified_integer
if write_record[0] == Constants.SAM_WRITE_UNMAPPED:
unaligned_unmapped += 1
elif write_record[0] == Constants.SAM_WRITE_FILTERED:
filtered_unmapped += 1
elif write_record[0] == Constants.SAM_WRITE_AMBIGUOUS:
ambiguous += 1
elif write_record[0] == Constants.SAM_WRITE_UNIQUE:
unique += 1
uniquely_mapped_counter[direction_id] = uniquely_mapped_counter.get(direction_id, 0) + 1
writeSAMRecord(write_record, sam_output)
#If using multiple processes, put the reads on a work queue and the read id on another queue
#so that the output file writer process will write the reads in the same order as the input.
else:
q_read.put((meilleur_alignement_dictionnaire, empty, record1, record2, alignement_key, record))
q_coordinate.put(alignement_key)
#Once all the work is put onto the work queue, join the processes and extract statistic information.
if not single_process_switch:
q_coordinate.put(None)
for _ in range(Constants.DEFAULTNUMPROCESSES - 1):
q_read.put(None)
for proc in processes:
proc.join()
p_genome.join()
for report_info in return_info_objects:
alignment_distribution = report_info[0]
unique += alignment_distribution[1]
ambiguous += alignment_distribution[2]
unaligned_unmapped += alignment_distribution[3]
filtered_unmapped += alignment_distribution[4]
ambiguous_uniquified += alignment_distribution[5]
p_uniquely_mapped_counter = report_info[1]
p_methylation_counter = report_info[2]
for key in p_uniquely_mapped_counter:
uniquely_mapped_counter[key] = uniquely_mapped_counter.get(key, 0) + p_uniquely_mapped_counter[key]
for key in p_methylation_counter:
methylation_counter[key] = methylation_counter.get(key, 0) + p_methylation_counter[key]
p_writer.join()
return (([record_count, unique, ambiguous, unaligned_unmapped, filtered_unmapped, ambiguous_uniquified],
uniquely_mapped_counter, methylation_counter), genome_construction_time)
def multiprocessGenomeService(genome_file, pipes_parent, finished, return_info_objects, comments, command_line_string):
"""
For multiprocessing, this function loads the reference genome into a dicitonary and services requests for reference genome info.
@param genome_file: A string indicating the location of the reference genome FASTA file.
@param pipes_parent: A list of pipe endpoints. Worker processes put requests for the reference genome on these pipes.
@param finished: An event object that indicates when the process is done loading the reference genome into memory.
@param return_info_objects: A process safe list that is used to report each processes stats.
@param comments: A multiprocess list for comments. The SAM writer process will write this to the SAM file.
@param command_line_string: The command line parameter and argument string to write to the SAM file
"""
#Load the reference genome
(genome_dictionary, id_length_list, _) = makeReferenceGenomeDictionary(genome_file)
genome_comments = map(lambda x: "@SQ\tSN:%s\tLN:%s\n" % x , id_length_list)
comments.append("@HD\tVN:1.3\tSO:unsorted\tGO:none\n")
for g in genome_comments:
comments.append(g)
comments.append("@PG\tID:%s\tCL:%s\tVN:%s\n" % (Constants.SAM_VALUE_PROGRAM, command_line_string, Constants.version))
finished.set()
#Service requests for reference genome information using full duplex pipes.
while pipes_parent != []:
remove_pipes = []
for pipe in pipes_parent:
if pipe.poll(0):
pipe_object = pipe.recv()
if pipe_object[0] == None:
remove_pipes.append(pipe)
return_info_objects.append(pipe_object[1])
else:
pipe.send(genome_dictionary[pipe_object[0]][pipe_object[1] : pipe_object[2]])
for pipe in remove_pipes:
pipes_parent.remove(pipe)
pipe.close()
def multiprocessPostProcessing(q_read, q_write, cutoff_bs, cutoff_r, rescore_matrix, input_protocol, layout, genome_pipe):
"""
A function used by multiprocess workers for doing all the post processing of SAM records. Work is put onto a work queue.
@param q_read: A multiprocess work queue for extracting work.
@param q_write: A multiprocess work queue for depositing finished work.
@param cutoff_bs: A float for filtering low quality reads.
@param cutoff_r: A float for filtering low quality hairpin recovered reads.
@param rescore_matrix: A dictionary representing the rescoring function for multireads. If None, it will not be used.
@param input_protocol: A number from Constants indicating the input protocol.
@param layout: A number from Constants indicating the type of layout.
@param genome_pipe: A pipe to ask for genome reference information.
"""
#Initializes counters for statistics.
methylation_counter = {}
uniquely_mapped_counter = {}
filtered_unmapped = 0
unaligned_unmapped = 0
ambiguous = 0
unique = 0
ambiguous_uniquified = 0
record_count = 0
#An infinite loop for getting work from a work queue and doing the post processing.
while(True):
info = q_read.get()
if info is None:
q_write.put(None)
return_info = ([record_count, unique, ambiguous, unaligned_unmapped, filtered_unmapped, ambiguous_uniquified],
uniquely_mapped_counter, methylation_counter)
genome_pipe.send((None, return_info))
return
else:
record_count += 1
(meilleur_alignement_dictionnaire, empty, record1, record2, alignement_key, record) = info
(write_record, ambiguous_uniquified_integer, direction_id) = computeWriteRecord(meilleur_alignement_dictionnaire,
empty,
record1,
record2,
alignement_key,
cutoff_bs,
cutoff_r,
rescore_matrix,
record,
input_protocol,
layout,
methylation_counter,
None,
genome_pipe=genome_pipe)
ambiguous_uniquified += ambiguous_uniquified_integer
if write_record[0] == Constants.SAM_WRITE_UNMAPPED:
unaligned_unmapped += 1
elif write_record[0] == Constants.SAM_WRITE_FILTERED:
filtered_unmapped += 1
elif write_record[0] == Constants.SAM_WRITE_AMBIGUOUS:
ambiguous += 1
elif write_record[0] == Constants.SAM_WRITE_UNIQUE:
unique += 1
uniquely_mapped_counter[direction_id] = uniquely_mapped_counter.get(direction_id, 0) + 1
q_write.put(write_record)
def multiprocessWriteSAMRecord(q_write, outputfile, gzip_switch, comments, remove_comments, num_processes, q_coordinate):
"""
This function writes post processed SAM records to a file with multiprocessing.
@param q_write: A queue for getting records to write to the SAM file.
@param outputfile: A string representing the location of the SAM output file.
@param gzip_switch: A boolean indicating whether the output should be gzip compressed. If True, compression is turned on.
@param comments: A multiprocess list for comments. The SAM writer process will write this to the SAM file.
@param num_processes: The number of worker processes doing work. This is used to kill the process when the work is done.
@param q_coordinate: A coordination data structure that is used to ensure that SAM records have the same order as the input file.
"""
poison_pills = 0
write_cache = {} #A cache of SAM records that have been received out of order.
sam_output = SeqIterator.SeqWriter(BisPin_util.getPossibleGZIPFile(outputfile, gzip_switch, addGZ = True),
file_type = Constants.SAM)
if not remove_comments:
writeSamComments(sam_output, comments)
sam_output.flush()
coordinateFull = True
writerFull = True
while (coordinateFull and writerFull):
while (coordinateFull): # An infinite loop that gets the next sequence id to write and checks the write cache if it is there.
alignement_key = q_coordinate.get()
if alignement_key == None:
coordinateFull = False
else:
write_cache_record = write_cache.get(alignement_key, None)
if write_cache_record != None:
writeSAMRecord(write_cache_record, sam_output)
else:
break
while (writerFull): #If a sequence id is not in the write cache, search for it on the queue until it is found.
write_record = q_write.get()
if write_record == None: #Terminates the process if enough 'poison pills' are received.
poison_pills += 1
if poison_pills == num_processes:
writerFull = False
else:
record_key = write_record[1][0][0][Constants.SAM_KEY_QNAME]
if record_key.startswith(alignement_key):
writeSAMRecord(write_record, sam_output)
break
else:
write_cache[record_key] = write_record #Put records that do not match the record id on the cache.
sam_output.flush()
sam_output.close()
def computeWriteRecord(meilleur_alignement_dictionnaire, empty, record1, record2, alignement_key,
cutoff_bs, cutoff_r, rescore_matrix, record, input_protocol, layout,
methylation_counter, genome_dictionary, genome_pipe = None):
"""
Does the work of post processing a SAM record and generates a write_record object.
@param meilleur_alignement_dictionnaire: A dictionary representing the SAM records to choose and process.
@param empty: A boolean indicating if the meilleur_alignement_dictionnaire is empty.
@param record1: The first pair FASTQ input record for paired end data. The sole single record for single end data.
@param record2: The second pair FASTQ record for paired end data (and hairpin data). None for single end data.
@param alignement_key: The sequence id of the FASTQ record. Must match the SAM QNAME field.
@param cutoff_bs: A float for filtering low quality reads.
@param cutoff_r: A float for filtering low quality hairpin recovered reads.
@param rescore_matrix: A 2-tuple of dictionaries representing the rescoring function for multireads. If both elements are set to None, no rescoring will be done.
@param record: The two records record1 and record2 in a list.
@param input_protocol: A number from Constants indicating the input protocol.
@param layout: A number from Constants indicating the type of layout.
@param methylation_counter: A dictionary for counting methylation context information.
@param genome_dictionary: A dictionary of the reference genome. Set to None for multiprocessing.
@param genome_pipe: A pipe to ask for genome reference information for multiprocessing.
@return A tuple object for writing a record, the number of multireads uniquified, and the alignment direction information.
"""
ambiguous_uniquified = 0
direction_id = None
if empty: #An unmapped SAM record will be written if the FASTQ record could not be found in the SAM file. This shouldn't happen, but it probably will.
write_record = (Constants.SAM_WRITE_UNMAPPED, [[BisPin_util.changeFastqRecordToSAMRecord(record1, seq_id = alignement_key)]])
if record2 != None:
write_pair = BisPin_util.changeFastqRecordToSAMRecord(record2, seq_id = alignement_key)
write_record[1][0].append(write_pair)
else:
#Choose the SAM records with the highest alignment scores.
for key in meilleur_alignement_dictionnaire:
meilleur_alignement_dictionnaire[key] = choisirAlignement(meilleur_alignement_dictionnaire[key])
meilleur_alignements = choisirEntreDictionnaires(meilleur_alignement_dictionnaire,
record1, record2, alignement_key,
cutoff_bs, cutoff_r, input_protocol)
if meilleur_alignements[0] == Constants.SAM_WRITE_ALIGNED:
if rescore_matrix[0] != None:
len_ma1 = len(meilleur_alignements[1])
#Process the best SAM records adding the methylation information, updating the FLAG, etc.
meilleur_alignements = traiterAlignement(meilleur_alignements, record,
input_protocol, layout, methylation_counter,
rescore_matrix, genome_dictionary, genome_pipe = genome_pipe)
len_ma2 = len(meilleur_alignements)
if rescore_matrix[0] != None and len_ma2 == 1 and len_ma1 > 1: #Detects if a multiread was uniquified.
ambiguous_uniquified += 1
if len_ma2 > 1: #What about hairpin?
write_record = (Constants.SAM_WRITE_AMBIGUOUS, meilleur_alignements)
else:
read_direction = ""
for record in meilleur_alignements[0]:
read_direction += "_" + record[Constants.SAM_KEY_READ]
genome_direction = meilleur_alignements[0][0][Constants.SAM_KEY_GENOME]
direction_id = read_direction[1:] + "_" + genome_direction
write_record = (Constants.SAM_WRITE_UNIQUE, meilleur_alignements)
elif meilleur_alignements[0] == Constants.SAM_WRITE_UNMAPPED:
write_record = (Constants.SAM_WRITE_UNMAPPED, meilleur_alignements[1])
elif meilleur_alignements[0] == Constants.SAM_WRITE_FILTERED:
for paired_record in meilleur_alignements[1]:
for single_record in paired_record:
for key in single_record.keys():
if not key in Constants.SAM_KEYS_TO_KEEP:
del single_record[key]
write_record = (Constants.SAM_WRITE_FILTERED, meilleur_alignements[1])
return (write_record, ambiguous_uniquified, direction_id)
def makeSamFileTypeDictionary(sam_file_list):
"""
Create a dictionary of SAM file iterator objects. The SAM file must follow a specific naming convention.
@param sam_file_list: A list of strings representing the location of raw BFAST SAM files.
@return: A dictionary of the SAM file iterator objects and a dictionary giving the conversion information for the SAM files.
"""
filetypes = Constants.CONV_TYPES
sam_iterator_dict = {}
sam_conversion_dict = {}
onlyfiles = sam_file_list
for a_file in onlyfiles:
split_a_file = a_file.split(".")
for i in range(len(split_a_file)-1, -1, -1):
filetypes_equal = filter((lambda x: x == split_a_file[i]), filetypes)
if filetypes_equal != [] and filetypes_equal != '':
s_type = filetypes_equal[0]
sam_iterator_dict[s_type] = SeqIterator.SeqIterator(a_file, file_type=Constants.SAM)
sam_conversion_dict[s_type] = s_type.split("_")
break
return (sam_iterator_dict, sam_conversion_dict)
def choisirAlignement(brut_alignements):
"""
Chooses the highest scoring alignment(s) from a list of SAM records. Returns a list of lists of grouped SAM records.
This is the format used throughout the post processing module to represent SAM records after this function has been called.
@param brut_alignements: a list of SAM records from the same SAM file. (Each SAM record is a dictionary)
@return: a tuple of the best value and an iterable of lists of SAM records that represent the best alignment(s)
Each list of SAM records in the iterable can either be a 2-list for paired end data or a 1-list for single end data.
e.g. [[pair1, pair2], [single1]]
If the alignments are all unmapped, then return None
"""
unmapped_bits = map((lambda record: isUnmapped(record[Constants.SAM_KEY_FLAG])),
brut_alignements)
all_unmapped = reduce((lambda x, y: x and y), unmapped_bits)
if all_unmapped: #If all the records are unmapped, return None
return None
else:
grouped_records = []
firstSegments = []
lastSegments = []
#Search through each record and group them according to whether they are paired or not.
for record in brut_alignements:
flag = record[Constants.SAM_KEY_FLAG]
if hasMultipleSegments(flag) and isFirstSegment(flag):
firstSegments.append(record)
elif hasMultipleSegments(flag) and isLastSegment(flag):
lastSegments.append(record)
else:
grouped_records.append([record])
grouped_records += [list(group) for group in zip(firstSegments, lastSegments)]
valeurs_et_records = []
#Get the average alignment score for each group.
for group in grouped_records:
group_sum = reduce(lambda x, y: x + float(y[Constants.SAM_KEY_ALIGNMENT_SCORE]), group, 0.0)
group_avg = group_sum / (len(group) + 0.0)
valeurs_et_records.append((group_avg, group))
#Find the maximum alignment score and return only those grouped records.
valeur_max = max(valeurs_et_records, key = (lambda x : x[0]))[0]
valeurs_et_records_max = filter((lambda x: x[0] == valeur_max), valeurs_et_records)
retvalue = (valeur_max, map((lambda x: x[1]), valeurs_et_records_max))
return retvalue
def isUnmapped(flag):
"""Checks if the SAM flag indicates an unmapped read."""
return ((int(flag) >> 2) % 2) == 1
def isReversed(flag):
"""Checks if the SAM flag indicates a reversed read."""
return ((int(flag) >> 4) % 2) == 1
def setFilterBit(flag):
"""Sets the filter bit for a SAM flag."""
return int(flag) | 512
def hasMultipleSegments(flag):
"""Checks if the SAM flag indicates multiple segments."""
return (int(flag) % 2) == 1
def isFirstSegment(flag):
"""Checks if the SAM flag indicates the first segment."""
return ((int(flag) >> 6) % 2) == 1
def isLastSegment(flag):
"""Checks if the SAM flag indicates the last segment."""
return ((int(flag) >> 7) % 2) == 1
def choisirEntreDictionnaires(meilleur_alignement_dictionnaire, fastq_record1, fastq_record2,
alignement_key, cutoff_bs, cutoff_r, input_protocol):
"""
Chooses the best scoring SAM records from different SAM files representing different conversion protocols
@param: meilleur_alignement_dictionnaire: A dictionary where the key represents a SAM file and the value is the best SAM records
for that key corresponding to one FASTQ record.
@param: fastq_record1: The first FASTQ record.
@param: fastq_record2: None for single end data. The second FASTQ record for paired end data.
@param: alignement_key: The sequence id for the FASTQ record. This should be identical to the SAM records.
@param: cutoff_bs: A float for filtering low quality reads.
@param: cutoff_r: A float for filtering low quality hairpin recovered reads.
@param: input_protocol: A number from Constants indicating the input protocol.
@return: a tuple where the first position indicates the type of read and the second
position is an iterable of SAM records. (Each SAM record is a dictionary)
"""
alignement_liste = [meilleur_alignement_dictionnaire[key]
for key in meilleur_alignement_dictionnaire
if meilleur_alignement_dictionnaire[key] != None]
if alignement_liste == []: #If the dictionary is empty, then the reads are unmapped.
grouped_record = [BisPin_util.changeFastqRecordToSAMRecord(fastq_record1, seq_id = alignement_key)]
if fastq_record2 != None:
grouped_record.append(BisPin_util.changeFastqRecordToSAMRecord(fastq_record2, seq_id = alignement_key))
write_record = (Constants.SAM_WRITE_UNMAPPED, [grouped_record])
return write_record
#Find the maximum score for the records.
len_seq = float(len(fastq_record1[1]))
if fastq_record2 != None:
len_seq += len(fastq_record2[1])
len_seq /= 2.0
valeur_max = max(alignement_liste, key=(lambda x: x[0]))[0]
meilleur_alignement_liste = []
records_added = 0
conversion_type = None
#Choose records that have the highest alignment score.
for m in alignement_liste:
if m[0] == valeur_max:
if conversion_type == None or conversion_type == Constants.CONV_R:
conversion_type = m[1][0][0][Constants.SAM_KEY_READ]
meilleur_alignement_liste += m[1]
records_added += 1
#Report these SAM records as filtered if the maximum alignment score is too low. Otherwise report aligned.
if (conversion_type == Constants.CONV_R
and float(valeur_max) / (len_seq + 0.0) <= cutoff_r) or (
conversion_type != Constants.CONV_R and
float(valeur_max) / (len_seq + 0.0) <= cutoff_bs):
return (Constants.SAM_WRITE_FILTERED, changerFiltre(meilleur_alignement_liste))
else:
return (Constants.SAM_WRITE_ALIGNED, meilleur_alignement_liste)
def changerFiltre(sam_record_liste):
"""
Transforms the SAM records in the input list into filtered (low quality alignment) SAM records
This is done by setting a bit in the FLAG field.
@param sam_record_liste: a list of of lists of grouped SAM records. (Each SAM record is a dictionary)
@return: a list of of lists of grouped SAM records that represent a filtered set of records
"""
for record in sam_record_liste:
for subrecord in record:
subrecord[Constants.SAM_KEY_FLAG] = str(int(subrecord[Constants.SAM_KEY_FLAG])
| int(Constants.SAM_VALUE_FILTERED))
return sam_record_liste
def traiterAlignement(meilleur_alignement, fastq_record_group, input_protocol, layout, methylation_count,
rescore_matrix, genome_dictionary, genome_pipe = None):
"""
Processes alignments corresponding to one FASTQ record by modifying the MD tag, updating the flag,
computing the methylation calling string, and other things.
@param meilleur_alignement: SAM alignement records to be processed that correspond to the same FASTQ record
@param fastq_record_group: A list of FASTQ records. There are two records for paired end data.
@param input_protocol: A number from Constants indicating the input protocol.
@param layout: A number from Constants indicating the type of layout.
@param methylation_count: A dictionary for counting methylation context information.
@param rescore_matrix: A 2-tuple of dictionaries representing the rescoring function for multireads. If both elements are set to None, no rescoring will be done.
@param genome_dictionary: A dictionary of the reference genome. Set to None for multiprocessing.
@param genome_pipe: A pipe to ask for genome reference information for multiprocessing.
@return: the processed SAM records as a list of lists of grouped SAM dictionary records
"""
len_meilleur_alignement = len(meilleur_alignement[1])
is_ambiguous = (len_meilleur_alignement > 1)
#Look at each grouped SAM record
for paired_record in meilleur_alignement[1]:
#Look at the SAM record and its corresponding FASTQ record from the input
for subrecord, fastq_record in zip(paired_record, fastq_record_group):
#Delete unnecessary fields.
for key in subrecord.keys():
if not key in Constants.SAM_KEYS_TO_KEEP:
del subrecord[key]
#Extract the CIGAR string and the MD tag for computing and modifying the alignment strings
token_cigar = BisPin_util.tokenizeCigar(subrecord[Constants.SAM_KEY_CIGAR])
token_md = BisPin_util.tokenizeMDtag(subrecord[Constants.SAM_KEY_MD])
read_conversion = subrecord[Constants.SAM_KEY_READ]
genome_conversion = subrecord[Constants.SAM_KEY_GENOME]
#There is a bug in BFAST's SAM flag that reports the incorrect orientation
#for the second segment for paired end reads.
#These flags are for the GA converted genome
#115 should be 83 RC
#179 should be 147 not RC
#These flags are for the CT converted genome
#65 should be 99 not RC
#129 should be 147 RC
# if isReversed(subrecord[Constants.SAM_KEY_FLAG]):
# direction = Constants.CONV_REVERSE
if read_conversion != genome_conversion:
direction = Constants.CONV_REVERSE
else:
direction = Constants.CONV_FORWARD
fastq_seq = fastq_record[1]
len_fastq_record = len(fastq_seq)
try: #Get the reference sequence and little extra (cdr) to determine context.
reference_sequence, ref_cdr = getReferenceSequence(subrecord[Constants.SAM_KEY_RNAME].split()[0],
int(subrecord[Constants.SAM_KEY_POS]),
len_fastq_record, token_cigar,
read_conversion, genome_conversion, direction,
genome_dictionary, genome_pipe = genome_pipe)
except TypeError:
sys.stderr.write(logstr + "A problem occured getting the reference sequence\n")
sys.stderr.write(logstr + "fastq_record\t%s\n" % str(fastq_record))
raise
if direction == Constants.CONV_REVERSE:
fastq_seq = BisPin_util.reverseComplement(fastq_seq)
#Compute the mismatch, indel, and deletion list (MI_string) and the new MD tag and the hamming distance.
new_md, MI_string, hamming_distance = makeMismatchString(fastq_seq,
subrecord,
reference_sequence,
token_cigar,
token_md)
#Rescore the alignment with the rescore_matrix
if is_ambiguous and rescore_matrix[0] != None:
subrecord[Constants.SAM_KEY_RESCORE] = str(rescoreAlignment(
fastq_seq, MI_string, direction,
rescore_matrix, read_conversion, genome_conversion))
#Create the methylation calling string.
if read_conversion == Constants.CONV_R:
subrecord[Constants.SAM_KEY_RECOVERED] = subrecord[Constants.SAM_KEY_SEQ]
(methylation_CT, methylation_GA) = hairpinRecoveredMethylationCall(fastq_record_group[0][1],
fastq_record_group[1][1],
subrecord[Constants.SAM_KEY_RECOVERED])
subrecord[Constants.SAM_KEY_METHYLATION] = methylation_CT
subrecord[Constants.SAM_KEY_METHYLATION_GA] = methylation_GA
else:
methyl_string = makeMethylationCallString(fastq_seq,
ref_cdr,
MI_string,
direction,
read_conversion)
subrecord[Constants.SAM_KEY_METHYLATION] = methyl_string
subrecord[Constants.SAM_KEY_MD] = new_md
subrecord[Constants.SAM_KEY_SEQ] = fastq_seq
subrecord[Constants.SAM_KEY_PROGRAM] = Constants.SAM_VALUE_PROGRAM
subrecord[Constants.SAM_KEY_DISTANCE] = str(hamming_distance)
if is_ambiguous and rescore_matrix[0] != None: # selectionnez le meilleur alignement pour multireads
valeurs_et_records = map(calculateScore, meilleur_alignement[1])
valeur_max = max(valeurs_et_records, key = (lambda x : x[0]))[0]
valeurs_et_records_max = filter((lambda x: x[0] == valeur_max), valeurs_et_records)
meilleur_alignement_choice = map((lambda x: x[1]), valeurs_et_records_max)
else:
meilleur_alignement_choice = meilleur_alignement[1]
if is_ambiguous or rescore_matrix[0] != None: #Adjust the bit on the flag to indicate the presence of secondary alignments
primary_locations = [i for i in range(len(meilleur_alignement_choice)) if reduce(lambda x, y: x and y ,
map(lambda single_record:
((Constants.SAM_VALUE_SECONDARY &
int(single_record[Constants.SAM_KEY_FLAG])) >> 8) ==
0, meilleur_alignement_choice[i])) ]
num_primaries = len(primary_locations)
if num_primaries == 0: # No primary sequences. Set a sequence to be the primary one.
for subrecord in meilleur_alignement_choice[0]:
subrecord[Constants.SAM_KEY_FLAG] = str(int(subrecord[Constants.SAM_KEY_FLAG]) ^ Constants.SAM_VALUE_SECONDARY)
elif num_primaries > 1: # Change other sequences to be secondary sequences.
for i in primary_locations[1:num_primaries]:
for subrecord in meilleur_alignement_choice[i]:
subrecord[Constants.SAM_KEY_FLAG] = str(int(subrecord[Constants.SAM_KEY_FLAG]) | Constants.SAM_VALUE_SECONDARY)
if meilleur_alignement[0] == Constants.SAM_WRITE_ALIGNED: #Count the methylation marks
for alignement_record in meilleur_alignement_choice:
for subrecord in alignement_record:
for char in subrecord[Constants.SAM_KEY_METHYLATION]:
methylation_count[char] = methylation_count.get(char, 0) + 1
return meilleur_alignement_choice
def calculateScore(record, key = Constants.SAM_KEY_RESCORE):
"""
Return the average alignment score for grouped SAM dictionary records.
@param record: A grouping (list) of SAM dictionary records
@param key: Governs which alignment score to use. Defaults to the rescored score.
@return: The average alignment score for grouped SAM records.
"""
if isinstance(record, list):
score = sum(map(lambda x: float(x[key]), record))
score /= (len(record) + 0.0)
return (score, record)
else:
return (float(record[key]), record)
def rescoreAlignment(fastq_seq, MI_string, direction, score_matrix, read_conversion, genome_conversion):
""""
Rescores an alignment for uniquifying multireads
@param fastq_seq: The FASTQ sequence
@param MI_string: The MI_string which represents the alignment. Includes mismatches and indels.
@param direction: Whether the alignment is forward or reverse.
@param score_matrix: A 2-tuple of dictionaries representing the rescoring matrix.
@param read_conversion: (Not used.)
@param genome_conversion: A string representing the genome conversion for the alignment.
@return Returns a new score for an alignment based on the my_score_matrix
"""
if score_matrix[1] == None or genome_conversion == Constants.CONV_CT:
my_score_matrix = score_matrix[0]
else:
my_score_matrix = score_matrix[1]
score = sum( #Add the contributions of deletions.
map((lambda x: my_score_matrix[Constants.OPEN_DEL] + my_score_matrix[Constants.EXT_DEL]*(len(x) - 1)) ,
filter((lambda x: x.startswith("^")), MI_string)))
#Filter out the deletions so that insertions and mismatches can be counted
MI_string = filter((lambda x: not x.startswith("^")), MI_string)
openInsert = False
for i in range(len(MI_string)):
MI_char = MI_string[i].upper()
if MI_char == Constants.CIGAR_I: #Calculate the gap penalty.
score += my_score_matrix[Constants.EXT_INS]
if not openInsert:
openInsert = True
score += my_score_matrix[Constants.OPEN_INS]
else: #Calculate the match or mismatch score
openInsert = False
if MI_char == Constants.CIGAR_M: #Match
score += my_score_matrix[fastq_seq[i].upper()][fastq_seq[i].upper()]
elif genome_conversion == Constants.CONV_GA and MI_char == 'G' and fastq_seq[i].upper() == 'A': #GA methylation
score += my_score_matrix["GA"]
elif genome_conversion == Constants.CONV_CT and MI_char == 'C' and fastq_seq[i].upper() == 'T': #CT methylation
score += my_score_matrix["CT"]
else: #Mismatch
try:
score += my_score_matrix[MI_char][fastq_seq[i].upper()]
except KeyError:
sys.stderr.write("%sThe rescoring function did not recognize a character at %d. Using 'N' instead. MI_string: %s fastq_seq: %s\n" % (logstr, i, str(MI_string), fastq_seq))
score += my_score_matrix["N"][fastq_seq[i].upper()]
return score
def getAlignmentDirection(read_direction, genome_direction):
"""
Returns whether the alignment should be a forward or a reverse alignment.
This function uses string constants found in the Constants module.
@param read_direction: The conversion state of the reads.
@param genome_direction: The conversion state of the reference genome.
@return: A string representing the alignment direction
"""
if (read_direction == Constants.CONV_CT and genome_direction == Constants.CONV_CT) or \
(read_direction == Constants.CONV_GA and genome_direction == Constants.CONV_GA) or \
(read_direction == Constants.CONV_R):
return Constants.CONV_FORWARD
else:
return Constants.CONV_REVERSE
#
# def getReferenceSequence(key, position, seq_len, token_cigar, direction, genome_dictionary, genome_pipe = None):
# """
# Gets the reference sequence for a SAM record. This is so that the MD tag can be updated.
#
# @param key: A key for the sequence (the id for the fasta file record)
# @param position: The starting position of the string starting at the string at the key location
# in the genome dictionary
# @param seq_len: The length of the sequence in the SAM record
# @param token_cigar: The tokenized cigar string from the unprocessed SAM record
# @param direction: A string determining whether the alignment corresponding to the requested
# reference sequence is forward or reverse.
# @param genome_dictionary: A dictionary representing the reference fasta file
# @param genome_pipe: A pipe to send and receive genome pipe information if access to the genome
# is controlled by another process. (Not used.)
# @return: The reference sequence that corresponds to the SAM record.
# """
# position = int(position)
# len_insertions = 0
# len_deletions = 0
# for token in token_cigar:
# if token[1] == Constants.CIGAR_D:
# len_deletions += token[0]
# elif token[1] == Constants.CIGAR_I:
# len_insertions += token[0]
# reference_sequence_length = seq_len - len_insertions + len_deletions
# position -= 1
# if direction == Constants.CONV_FORWARD:
# index1 = position
# index2 = position + reference_sequence_length + 2
# elif direction == Constants.CONV_REVERSE and position >= 2:
# index1 = position - 2
# index2 = position + reference_sequence_length
# else:
# index1 = position
# index2 = position + reference_sequence_length
# if genome_pipe == None:
# ref_string = genome_dictionary[key][index1: index2]
# else:
# genome_pipe.send((key, index1, index2))
# ref_string = genome_pipe.recv()
# if direction == Constants.CONV_FORWARD:
# ref_string_car = ref_string[0:reference_sequence_length]
# ref_string_cdr = ref_string[reference_sequence_length: reference_sequence_length + 2]
# if len(ref_string_cdr) < 2:
# ref_string_cdr = 'XX'
# elif direction == Constants.CONV_REVERSE and position >=2:
# ref_string_car = ref_string[2:len(ref_string)]
# ref_string_cdr = "".join(reversed(ref_string[0:2]))
# else:
# ref_string_car = copy.copy(ref_string)
# ref_string_cdr = 'XX'
# return (ref_string_car.upper(), ref_string_cdr.upper())
def getReferenceSequence(key, position, seq_len, token_cigar, read_conversion, genome_conversion,
direction, genome_dictionary, genome_pipe = None):
"""
Gets the reference sequence for a SAM record. This is so that the MD tag can be updated.
@param key: A key for the sequence (the id for the fasta file record)
@param position: The starting position of the string starting at the string at the key location
in the genome dictionary
@param seq_len: The length of the sequence in the SAM record
@param token_cigar: The tokenized cigar string from the unprocessed SAM record
@param direction: A string determining whether the alignment corresponding to the requested
reference sequence is forward or reverse.
@param genome_dictionary: A dictionary representing the reference fasta file
@param genome_pipe: A pipe to send and receive genome pipe information if access to the genome
is controlled by another process. (Not used.)
@return: The reference sequence that corresponds to the SAM record.
"""
position = int(position)
len_insertions = 0
len_deletions = 0
for token in token_cigar:
if token[1] == Constants.CIGAR_D:
len_deletions += token[0]
elif token[1] == Constants.CIGAR_I:
len_insertions += token[0]
reference_sequence_length = seq_len - len_insertions + len_deletions
position -= 1
if genome_pipe == None:
ref_string_car = genome_dictionary[key][position: position + reference_sequence_length]
if (read_conversion == Constants.CONV_CT and genome_conversion == Constants.CONV_CT) or \
(read_conversion == Constants.CONV_GA and genome_conversion == Constants.CONV_CT) or \
(read_conversion == Constants.CONV_R and direction == Constants.CONV_FORWARD): #Get the upstream cdr
if genome_pipe != None:
genome_pipe.send((key, position, position + reference_sequence_length + 2))
ref_string = genome_pipe.recv()
ref_string_car = ref_string[0:reference_sequence_length]
ref_string_cdr = ref_string[reference_sequence_length:reference_sequence_length + 2]
else:
ref_string_cdr = genome_dictionary[key][position + reference_sequence_length : position + reference_sequence_length + 2]
if len(ref_string_cdr) < 2: #The upstream cdr is near the end of the reference sequence.
ref_string_cdr = 'NN'
elif position >= 2 and ((read_conversion == Constants.CONV_CT and genome_conversion == Constants.CONV_GA) or \
(read_conversion == Constants.CONV_GA and genome_conversion == Constants.CONV_GA) or \
(read_conversion == Constants.CONV_R and direction == Constants.CONV_REVERSE)): #Get the downstream cdr
if genome_pipe != None:
genome_pipe.send((key, position - 2, position + reference_sequence_length))
ref_string = genome_pipe.recv()
ref_string_car = ref_string[2 : 2 + reference_sequence_length]
ref_string_cdr = ref_string[0 : 2]
else:
ref_string_cdr = genome_dictionary[key][position - 2 : position]
ref_string_cdr = ref_string_cdr[::-1]
else: # The downstream cdr is at the beginning of the reference sequence.
if genome_pipe != None:
genome_pipe.send((key, position, position + reference_sequence_length))
ref_string_car = genome_pipe.recv()
ref_string_cdr = 'NN'
return (ref_string_car.upper(), ref_string_cdr.upper())
def makeReferenceGenomeDictionary(reference_genome_file_location):
"""
Makes a dictionary for the reference genome indexed by contig id.
@param reference_genome_file_location: A string representing the location of the reference genome file.
@return: a dictionary object indexed by contig id where the value is the reference sequence,
A list of ids (keys) and the length of each contig, and the time it took to load the genome.
"""
now = datetime.datetime.now()
genome_iterator = SeqIterator.SeqIterator(reference_genome_file_location, file_type=Constants.FASTA)
id_length_list = []
genome_dictionary = {}
for record in genome_iterator:
key = record[0].split()[0]
genome_dictionary[key] = record[1]
id_length_list.append((key, str(len(record[1]))))
later = datetime.datetime.now()
return (genome_dictionary, id_length_list, later - now)
def checkDeletionInsertionProblem(token_cigar, token_md):
"""
This functions checks that the length of any deletion in the token md list is the
same as the length of the deletion in the cigar list. The MD list is corrected if
there is a discrepancy. This hacks a solution to a bug encountered in BFAST-Gap.
@param token_cigar: The token list representing the cigar string
@param param: The The token list representing the cigar string
@return A token cigar list followed by a corrected token md list