Skip to content

Commit a54050d

Browse files
Refactor io (#59)
* removed dead code in io.go. * merged Sequence structure into AnnotatedSequence and renamed AnnotatedSequence Sequence. * added new mvp fasta io. * rewrote FASTA IO to be sturdier. * renamed feature.ParentAnnotatedSequence to feature.ParentSequence. * removed whitespace constants and replaced with whitespace function. * made .AddFeature() method public. * added comment to .AddFeature() method. * fixed comments in FASTA IO.
1 parent 7304bf6 commit a54050d

File tree

12 files changed

+334
-374
lines changed

12 files changed

+334
-374
lines changed

docs/library-hashing.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Hashes make incredibly powerful unique identifiers and with a wide array of hash
1212
The golang team is currently figuring out the best way to implement blake3 into the standard library but in the meantime `poly` provides this special function and method wrapper to hash sequences using blake3. This will eventually be deprecated in favor of only using the `GenericSequenceHash()` function and `.Hash()` method wrapper.
1313

1414
```go
15-
// getting our example AnnotatedSequence struct
15+
// getting our example Sequence struct
1616
puc19AnnotatedSequence := ReadJSON("data/puc19static.json")
1717

1818
// there are two ways to use the blake3 Least Rotation hasher.
@@ -21,7 +21,7 @@ The golang team is currently figuring out the best way to implement blake3 into
2121
puc19Blake3Hash := puc19AnnotatedSequence.Blake3Hash()
2222
fmt.Println(puc19Blake3Hash)
2323

24-
// the second is with the Blake3SequenceHash(annotatedSequence AnnotatedSequence) function.
24+
// the second is with the Blake3SequenceHash(sequence Sequence) function.
2525
puc19Blake3Hash = puc19AnnotatedSequence.Blake3Hash()
2626
fmt.Println(puc19Blake3Hash)
2727
```
@@ -33,7 +33,7 @@ Again, this will be deprecated in favor of using generic hashing with blake3 in
3333
`poly` also provides a generic hashing function and method wrapper for hashing sequences with arbitrary hashing functions that use the golang standard library's hash function interface. Check out this switch statement in the [hash command source code](https://github.com/TimothyStiles/poly/blob/f51ec1c08820394d7cab89a5a4af92d9b803f0a4/commands.go#L261) to see all that `poly` provides in the command line utility alone.
3434

3535
```go
36-
// getting our example AnnotatedSequence struct
36+
// getting our example Sequence struct
3737
puc19AnnotatedSequence := ReadJSON("data/puc19static.json")
3838

3939
// there are two ways to use the Least Rotation generic hasher.
@@ -42,7 +42,7 @@ Again, this will be deprecated in favor of using generic hashing with blake3 in
4242
puc19Sha1Hash := puc19AnnotatedSequence.Hash(crypto.SHA1)
4343
fmt.Println(puc19Sha1Hash)
4444

45-
// the second is with the GenericSequenceHash() function where you pass an AnnotatedSequence along with a hash function as arguments.
45+
// the second is with the GenericSequenceHash() function where you pass an Sequence along with a hash function as arguments.
4646
puc19Sha1Hash = GenericSequenceHash(puc19AnnotatedSequence, crypto.SHA1)
4747
fmt.Println(puc19Sha1Hash)
4848
```

docs/library-io.md

Lines changed: 26 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -3,37 +3,37 @@ id: library-io
33
title: Sequence Input Output
44
---
55

6-
At the center of `poly`'s annotated sequence support is the `AnnotatedSequence` struct. Structs are kind of Go's answer to objects in other languages. They provide a way of making custom datatypes and methods for developers to use. More on that [here](https://tour.golang.org/moretypes/2), [here](https://gobyexample.com/methods), and [here](https://www.golang-book.com/books/intro/9).
6+
At the center of `poly`'s annotated sequence support is the `Sequence` struct. Structs are kind of Go's answer to objects in other languages. They provide a way of making custom datatypes and methods for developers to use. More on that [here](https://tour.golang.org/moretypes/2), [here](https://gobyexample.com/methods), and [here](https://www.golang-book.com/books/intro/9).
77

8-
Anywho. `poly` centers around reading in various annotated sequence formats like genbank, or gff and parsing them into an `AnnotatedSequence` to do stuff with them. Whether that's being written out to JSON or being used by `poly` itself. Here are some examples.
8+
Anywho. `poly` centers around reading in various annotated sequence formats like genbank, or gff and parsing them into an `Sequence` to do stuff with them. Whether that's being written out to JSON or being used by `poly` itself. Here are some examples.
99

1010
## Readers
1111

12-
For all supported file formats `poly` supports a reader. A reader is a function literally named `ReadJSON(path)`, `ReadGbk(path)`, or `ReadGff(path)` that takes one argument - a filepath where your file is located, and returns an `AnnotatedSequence` struct.
12+
For all supported file formats `poly` supports a reader. A reader is a function literally named `ReadJSON(path)`, `ReadGbk(path)`, or `ReadGff(path)` that takes one argument - a filepath where your file is located, and returns an `Sequence` struct.
1313

1414
```go
1515
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
1616
ecoliAnnotatedSequence := ReadGff("data/ecoli-mg1655.gff")
1717
puc19AnnotatedSequence := ReadJSON("data/puc19static.json")
1818
```
1919

20-
These AnnotatedSequence structs contain all sorts of goodies but can be broken down into three sub main structs. `AnnotatedSequence.Meta`, `AnnotatedSequence.Features`, and `AnnotatedSequence.Sequence`.
20+
These Sequence structs contain all sorts of goodies but can be broken down into three sub main structs. `Sequence.Meta`, `Sequence.Features`, and `Sequence.Sequence`.
2121

2222
> Before we move on with the rest of IO I think it'd be good to go over these sub structs in the next section but of course you can skip to [writers](#writers) if you'd like.
2323
24-
## AnnotatedSequence structs
24+
## Sequence structs
2525

26-
Like I just said these AnnotatedSequence structs contain all sorts of goodies but can be broken down into three main sub structs:
26+
Like I just said these Sequence structs contain all sorts of goodies but can be broken down into three main sub structs:
2727

28-
* [AnnotatedSequence.Meta](#annotatedsequencemeta)
29-
* [AnnotatedSequence.Features](#annotatedsequencefeatures)
30-
* [AnnotatedSequence.Sequence](#annotatedsequencesequence)
28+
* [Sequence.Meta](#annotatedsequencemeta)
29+
* [Sequence.Features](#annotatedsequencefeatures)
30+
* [Sequence.Sequence](#annotatedsequencesequence)
3131

32-
Here's how the AnnotatedSequence struct is actually implemented as of [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108).
32+
Here's how the Sequence struct is actually implemented as of [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108).
3333

3434
```go
35-
// AnnotatedSequence holds all sequence information in a single struct.
36-
type AnnotatedSequence struct {
35+
// Sequence holds all sequence information in a single struct.
36+
type Sequence struct {
3737
Meta Meta
3838
Features []Feature
3939
Sequence Sequence
@@ -42,11 +42,11 @@ Here's how the AnnotatedSequence struct is actually implemented as of [commit c4
4242

4343
> You can check out the original implementation [here](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L108) but I warn you that this is a snapshot and likely has been updated since last writing.
4444
45-
### AnnotatedSequence.Meta
45+
### Sequence.Meta
4646

4747
The Meta substruct contains various meta information about whatever record was parsed. Things like name, version, genbank references, etc.
4848

49-
So if I wanted to get something like the Genbank Accession number for a AnnotatedSequence I'd get it like this:
49+
So if I wanted to get something like the Genbank Accession number for a Sequence I'd get it like this:
5050

5151
```go
5252
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
@@ -68,7 +68,7 @@ Same goes for a lot of other stuff:
6868
Here's how the Meta struct is actually implemented in [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L34) which is the latest as of writing.
6969

7070
```go
71-
// Meta Holds all the meta information of an AnnotatedSequence struct.
71+
// Meta Holds all the meta information of an Sequence struct.
7272
type Meta struct {
7373
Name string
7474
GffVersion string
@@ -93,9 +93,9 @@ Here's how the Meta struct is actually implemented in [commit c4fc7e](https://gi
9393

9494
You'll notice that there are actually three more substructs towards the bottom. They hold extra genbank specific information that's handy to have grouped together. More about how genbank files are structered can be found [here](https://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html).
9595

96-
### AnnotatedSequence.Features
96+
### Sequence.Features
9797

98-
The `Features` substruct is actually a slice (golang term for what is essentially a dynamic length list) of `Feature` structs that can be iterated through. For example if you wanted to iterate through an `AnnotatedSequence`'s features and get their name (i.e GFP) and type (i.e CDS) you'd do it like this.
98+
The `Features` substruct is actually a slice (golang term for what is essentially a dynamic length list) of `Feature` structs that can be iterated through. For example if you wanted to iterate through an `Sequence`'s features and get their name (i.e GFP) and type (i.e CDS) you'd do it like this.
9999

100100
```go
101101
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
@@ -106,12 +106,12 @@ The `Features` substruct is actually a slice (golang term for what is essentiall
106106

107107
The `Feature` struct has about 10 or so fields which you can learn more about from this section in [commit c4fc7e](https://github.com/TimothyStiles/poly/blob/c4fc7e6f6cdbd9e5ed2d8ffdbeb206d1d5a8d720/io.go#L80).
108108

109-
### AnnotatedSequence.Sequence
109+
### Sequence.Sequence
110110

111-
The AnnotatedSequence Sequence substruct is by far the most basic and critical. Without it well, you ain't go no DNA. The substruct itself has 4 simple fields.
111+
The Sequence Sequence substruct is by far the most basic and critical. Without it well, you ain't go no DNA. The substruct itself has 4 simple fields.
112112

113113
```go
114-
// Sequence holds raw sequence information in an AnnotatedSequence struct.
114+
// Sequence holds raw sequence information in an Sequence struct.
115115
type Sequence struct {
116116
Description string
117117
Hash string
@@ -122,7 +122,7 @@ The AnnotatedSequence Sequence substruct is by far the most basic and critical.
122122

123123
The `Description`, `Hash`, and `HashFunction` are at all identifying fields of the Sequence string. The `Description` is the same kind of short description you'd find in a `fasta` or `fastq` file. The `Hash` and `HashFunction` are used to create a unique identifier specify to the sequence string which you'll learn more about in the next chapter on sequence hashing.
124124

125-
To get an AnnotatedSequence sequence you can address it like so:
125+
To get an Sequence sequence you can address it like so:
126126

127127
```go
128128
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
@@ -133,10 +133,10 @@ To get an AnnotatedSequence sequence you can address it like so:
133133

134134
`poly` tries to supply a writer for all supported file formats that have a reader.
135135

136-
Writers take two arguments. The first is an AnnotatedSequence struct, the second is a path to write out to.
136+
Writers take two arguments. The first is an Sequence struct, the second is a path to write out to.
137137

138138
```go
139-
// getting AnnotatedSequence(s) to write out again.
139+
// getting Sequence(s) to write out again.
140140
bsubAnnotatedSequence := ReadGbk("data/bsub.gbk")
141141

142142
// writing out gbk file input as json.
@@ -154,7 +154,7 @@ To get an AnnotatedSequence sequence you can address it like so:
154154

155155
## Parsers
156156

157-
`poly` parsers are what actually parse input files from a string without any of the system IO. This is particularly useful if you're like me and have an old database holding genbank files as strings. You can take those strings from a database or whatever and just pass them to `ParseGbk()`, or `ParseGff()` and they'll convert them into AnnotatedSequence structs.
157+
`poly` parsers are what actually parse input files from a string without any of the system IO. This is particularly useful if you're like me and have an old database holding genbank files as strings. You can take those strings from a database or whatever and just pass them to `ParseGbk()`, or `ParseGff()` and they'll convert them into Sequence structs.
158158

159159
```go
160160
puc19AnnotatedSequence := ParseGbk("imagine this is actually a gbk in string format.")
@@ -164,10 +164,10 @@ That's it. The reason we don't have a `ParseJSON()` is that golang, like almost
164164

165165
## Builders
166166

167-
`poly` builders take AnnotatedSequence structs and use them to build strings for different file formats.
167+
`poly` builders take Sequence structs and use them to build strings for different file formats.
168168

169169
```go
170-
// generating an AnnotatedSequence struct from a gff file.
170+
// generating an Sequence struct from a gff file.
171171
ecoliAnnotatedSequence := ReadGff("data/ecoli-mg1655.gff")
172172

173173
// generating a gff string that then can be piped to stdout or written to a database.

hash.go

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -95,12 +95,12 @@ func RotateSequence(sequence string) string {
9595
return sequence
9696
}
9797

98-
// Hash is a method wrapper for hashing AnnotatedSequence structs.
99-
func (annotatedSequence AnnotatedSequence) Hash(hash hash.Hash) string {
100-
if annotatedSequence.Meta.Locus.Circular {
101-
annotatedSequence.Sequence.Sequence = RotateSequence(annotatedSequence.Sequence.Sequence)
98+
// Hash is a method wrapper for hashing Sequence structs.
99+
func (sequence Sequence) Hash(hash hash.Hash) string {
100+
if sequence.Meta.Locus.Circular {
101+
sequence.Sequence = RotateSequence(sequence.Sequence)
102102
}
103-
seqHash, _ := hashSequence(annotatedSequence.Sequence.Sequence, hash)
103+
seqHash, _ := hashSequence(sequence.Sequence, hash)
104104
return seqHash
105105
}
106106

hash_test.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,10 @@ func TestHashRegression(t *testing.T) {
2929
}
3030

3131
func TestLeastRotation(t *testing.T) {
32-
annotatedSequence := ReadGbk("data/puc19.gbk")
32+
sequence := ReadGbk("data/puc19.gbk")
3333
var sequenceBuffer bytes.Buffer
3434

35-
sequenceBuffer.WriteString(annotatedSequence.Sequence.Sequence)
35+
sequenceBuffer.WriteString(sequence.Sequence)
3636
bufferLength := sequenceBuffer.Len()
3737

3838
var rotatedSequence string

0 commit comments

Comments
 (0)