FastaBuff encoding¶
eccLib.FastaBuff is a class wrapping around a memory buffer
that holds a sequence of nucleotides. Thanks to it’s unique encoding scheme
utilized on runtime, it provides efficient storage and retrieval of nucleotide
sequences. To put it simply: eccLib instead of storing sequences as characters
in a string, uses a mapping that maps characters to 4 bit integers, which when
packed together, allow for memory usage reduction of 50%.
IUPAC Nucleotide Codes¶
IUPAC specifies a set of nucleotide codes that should be used to represent nucleotides, and combinations or ambiguities in sequences. There are 17 possible codes laid out in the table below.
Code |
Meaning |
Equivalent symbols |
|---|---|---|
A |
Adenine |
{A} |
C |
Cytosine |
{C} |
G |
Guanine |
{G} |
T |
Thymine |
{T} |
U |
Uracil |
{U} |
R |
Purine |
{A, G} |
Y |
Pyrimidine |
{C, T} |
S |
Guanine or Cytosine |
{C, G} |
W |
Adenine or Thymine |
{A, T} |
K |
Guanine or Thymine |
{T, G} |
M |
Adenine or Cytosine |
{A, C} |
B |
Not Adenine |
{C, G, T} |
D |
Not Cytosine |
{A, G, T} |
H |
Not Guanine |
{A, C, T} |
V |
Not Thymine |
{A, C, G} |
N |
Any |
{A, C, G, T} |
. or - |
Gap |
{} |
In the second column, you can see a set of equivalent possible symbols. It’s worth noting that if we store information, whether the sequence is DNA or RNA, the number of these symbols can be reduced to 16 (as is actually recommended). It just so happens, that 16 is a power of 2, fourth power to be exact. This means that we can represent each IUPAC symbol using a 4 bit integer, assuming we store whether the sequence is DNA or RNA separately of course.
Mapping symbols to numbers¶
Well but how do we construct such a mapping? In the second column of the table above, we already have a mapping of IUPAC symbols to the power set of {A, C, G, T}. Now we just need to map that power set, to 4 bit binary numbers. We chose to do it through a bit field, where each bit represents a nucleotide. The first bit represents Adenine (A), the second bit represents Cytosine (C), the third bit represents Guanine (G), and the fourth bit represents Thymine (T).
Symbol |
Binary |
Hex |
|---|---|---|
. or - |
0000 |
0x00 |
T or U |
0001 |
0x01 |
G |
0010 |
0x02 |
K |
0011 |
0x03 |
C |
0100 |
0x04 |
Y |
0101 |
0x05 |
S |
0110 |
0x06 |
B |
0111 |
0x07 |
A |
1000 |
0x08 |
W |
1001 |
0x09 |
R |
1010 |
0x0A |
D |
1011 |
0x0B |
M |
1100 |
0x0C |
H |
1101 |
0x0D |
V |
1110 |
0x0E |
N |
1111 |
0x0F |
This mapping is then used on runtime when creating or accessing
eccLib.FastaBuff objects, or FASTA parsing when binary parsing is
toggled.
Advantages¶
Naturally this mapping on itself doesn’t achieve much, but this mapping does allow for packing two characters into a single byte. Such a packing has the advantage, of reducing memory usage by half, and indexing within the buffer is trivial, unlike with a 5 bit packed buffer. Additionally, during debugging, a reading the buffer is trivial and can be done with a simple hexdump. For example 0x8421 stands for ‘ACGT’.
Note
Yes, protein IUPAC sequences are not supported by this encoding. There are 24 possible protein IUPAC characters, meaning a total of 5 bits would be needed to represent each character. This minor issue, cascades into a larger issue of 5 bits not fitting into a byte, making runtime indexing and memory usage inefficient.
Serialisation¶
All eccLib.FastaBuff objects can be easily serialised through the
eccLib.FastaBuff.dump() method. That method returns a bytes
object, which can be written to a file opened in binary mode.
import eccLib
fasta_buffer = eccLib.FastaBuff('ACGT')
with open('output.fb', 'wb') as f:
f.write(fasta_buffer.dump())
This opens the door for this encoding to be used for long term storage, though
ideally you should consider compressing this data. This can be accomplished
with the standard gzip module.
import gzip
import eccLib
fasta_buffer = eccLib.FastaBuff('ACGT')
with gzip.open('output.fb.gz', 'wb') as f:
f.write(fasta_buffer.dump())
For more details regarding serialisation, see Serialisation.