FastaBuff encoding¶
eccLib.FastaBuff
is a class wrapping around a memory buffer
that holds a sequence of nucleotides. Thanks to it’s unique encoding scheme
utilized on runtime, it provides efficient storage and retrieval of nucleotide
sequences. To put it simply: instead of storing sequences as characters in a
string, uses a mapping that maps characters to 4 bit integers, which when packed
together, allow for memory usage reduction of 50%.
IUPAC Nucleotide Codes¶
IUPAC specifies a set of nucleotide codes that should be used to represent nucleotides, and combinations or ambiguities in sequences. There are 17 possible codes laid out in the table below.
Code |
Meaning |
Equivalent sybmols |
---|---|---|
A |
Adenine |
{A} |
C |
Cytosine |
{C} |
G |
Guanine |
{G} |
T |
Thymine |
{T} |
U |
Uracil |
{U} |
R |
Purine |
{A, G} |
Y |
Pyrimidine |
{C, T} |
S |
Guanine or Cytosine |
{C, G} |
W |
Adenine or Thymine |
{A, T} |
K |
Guanine or Thymine |
{T, G} |
M |
Adenine or Cytosine |
{A, C} |
B |
Not Adenine |
{C, G, T} |
D |
Not Cytosine |
{A, G, T} |
H |
Not Guanine |
{A, C, T} |
V |
Not Thymine |
{A, C, G} |
N |
Any |
{A, C, G, T} |
. or - |
Gap |
{} |
In the second column, you can see a set of equivalent possible symbols. It’s worth noting, that if we store information, whether the sequence is DNA or RNA, the number of these symbols can be reduced to 16 (as is actually recommended). It just so happens, that 16 is a power of 2, fourth power to be exact. This means that we can represent each IUPAC symbol using a 4 bit integer, assuming we store whether the sequence is DNA or RNA separately of course.
Mapping symbols to numbers¶
Well but how do we construct such a mapping? In the second column of the table above, we already have a mapping of IUPAC symbols to the power set of {A, C, G, T}. Now we just need to map that power set, to 4 bit binary numbers. We chose to do it through a bit field, where each bit represents a nucleotide. The first bit represents Adenine (A), the second bit represents Cytosine (C), the third bit represents Guanine (G), and the fourth bit represents Thymine (T).
Symbol |
Binary |
Hex |
---|---|---|
. or - |
0000 |
0x00 |
T or U |
0001 |
0x01 |
G |
0010 |
0x02 |
K |
0011 |
0x03 |
C |
0100 |
0x04 |
Y |
0101 |
0x05 |
S |
0110 |
0x06 |
B |
0111 |
0x07 |
A |
1000 |
0x08 |
W |
1001 |
0x09 |
R |
1010 |
0x0A |
D |
1011 |
0x0B |
M |
1100 |
0x0C |
H |
1101 |
0x0D |
V |
1110 |
0x0E |
N |
1111 |
0x0F |
This mapping is then used on runtime when creating or accessing FastaBuff objects, or FASTA parsing when binary parsing is toggled.
Advantages¶
Naturally this mapping on itself doesn’t achieve much, but this mapping does allow for packing two characters into a single byte. Such a packing has the advantage, of reducing memory usage by half, and indexing within the buffer is trivial, unlike with a 5 bit packed buffer. Additionally, during debugging, a reading the buffer is trivial and can be done with a simple hexdump. For example 0x8421 stands for ‘ACGT’.
Serialisation¶
All FastaBuff objects can be easily serialised through the
eccLib.FastaBuff.dump()
method. That method returns a :py:type:`bytes`
object, which can be written to a file opened in binary mode.
fasta_buffer = FastaBuff('ACGT')
with open('output.fb', 'wb') as f:
f.write(fasta_buffer.dump())
This opens the door for this encoding to be used for long term storage, though ideally you should consider compressing this data.