Skip to content

Why do we use of hard-clipping when calculating the mate's start and end? #1030

Open
@nh13

Description

@nh13

Should we be using hard-clapping the mate's start and end in our consensus calling tools? I see two places where we do (or will do):

  1. GroupReadsByUmi
  2. In Do not trim reads when both ends are clipped in consensus calling #1026 (see this discussion with @clintval)

I make up that the reason for adjusting the start and end based on soft-clipping is because those bases could be aligned, and may actually be aligned in the mate, which can happen if we have short inserts. But why do we adjust it also based on hard-clipping? Those bases are removed. Perhaps if there's hard clipping on only one of the reads in a pair on the one end of the molecule? Something else?

Note: there are other places we adjust based on hard-clipping as well:

  1. https://github.com/fulcrumgenomics/fgbio/blob/f93fdfbb427da5f5d60304de5761b19ea8209b33/src/main/scala/com/fulcrumgenomics/util/AmpliconDetector.scala#L177C15-L181
  2. val matePos = if (rec.unpaired || rec.mateUnmapped) Int.MaxValue else if (mateNeg) SAMUtils.getMateUnclippedEnd(rec.asSam) else SAMUtils.getMateUnclippedStart(rec.asSam)

And of course, there are a number of other places that use the unclipped start and unclipped end for the current record. I think examining those is worthwhile, but we should focus on the consensus calling tools first.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions