FastaBuff encoding¶

eccLib.FastaBuff is a class wrapping around a memory buffer that holds a sequence of nucleotides. Thanks to it’s unique encoding scheme utilized on runtime, it provides efficient storage and retrieval of nucleotide sequences. To put it simply: instead of storing sequences as characters in a string, uses a mapping that maps characters to 4 bit integers, which when packed together, allow for memory usage reduction of 50%.

IUPAC Nucleotide Codes¶

IUPAC specifies a set of nucleotide codes that should be used to represent nucleotides, and combinations or ambiguities in sequences. There are 17 possible codes laid out in the table below.

IUPAC Nucleotide Codes source¶
Code	Meaning	Equivalent sybmols
A	Adenine	{A}
C	Cytosine	{C}
G	Guanine	{G}
T	Thymine	{T}
U	Uracil	{U}
R	Purine	{A, G}
Y	Pyrimidine	{C, T}
S	Guanine or Cytosine	{C, G}
W	Adenine or Thymine	{A, T}
K	Guanine or Thymine	{T, G}
M	Adenine or Cytosine	{A, C}
B	Not Adenine	{C, G, T}
D	Not Cytosine	{A, G, T}
H	Not Guanine	{A, C, T}
V	Not Thymine	{A, C, G}
N	Any	{A, C, G, T}
. or -	Gap	{}

In the second column, you can see a set of equivalent possible symbols. It’s worth noting, that if we store information, whether the sequence is DNA or RNA, the number of these symbols can be reduced to 16 (as is actually recommended). It just so happens, that 16 is a power of 2, fourth power to be exact. This means that we can represent each IUPAC symbol using a 4 bit integer, assuming we store whether the sequence is DNA or RNA separately of course.

Mapping symbols to numbers¶

Well but how do we construct such a mapping? In the second column of the table above, we already have a mapping of IUPAC symbols to the power set of {A, C, G, T}. Now we just need to map that power set, to 4 bit binary numbers. We chose to do it through a bit field, where each bit represents a nucleotide. The first bit represents Adenine (A), the second bit represents Cytosine (C), the third bit represents Guanine (G), and the fourth bit represents Thymine (T).

Mapping IUPAC symbols to 4 bit binary numbers¶
Symbol	Binary	Hex
. or -	0000	0x00
T or U	0001	0x01
G	0010	0x02
K	0011	0x03
C	0100	0x04
Y	0101	0x05
S	0110	0x06
B	0111	0x07
A	1000	0x08
W	1001	0x09
R	1010	0x0A
D	1011	0x0B
M	1100	0x0C
H	1101	0x0D
V	1110	0x0E
N	1111	0x0F

This mapping is then used on runtime when creating or accessing FastaBuff objects, or FASTA parsing when binary parsing is toggled.

Advantages¶

Naturally this mapping on itself doesn’t achieve much, but this mapping does allow for packing two characters into a single byte. Such a packing has the advantage, of reducing memory usage by half, and indexing within the buffer is trivial, unlike with a 5 bit packed buffer. Additionally, during debugging, a reading the buffer is trivial and can be done with a simple hexdump. For example 0x8421 stands for ‘ACGT’.

Serialisation¶

All FastaBuff objects can be easily serialised through the eccLib.FastaBuff.dump() method. That method returns a :py:type:`bytes` object, which can be written to a file opened in binary mode.

fasta_buffer = FastaBuff('ACGT')

with open('output.fb', 'wb') as f:
    f.write(fasta_buffer.dump())

This opens the door for this encoding to be used for long term storage, though ideally you should consider compressing this data.

FastaBuff encoding¶

IUPAC Nucleotide Codes¶

Mapping symbols to numbers¶

Advantages¶

Serialisation¶

eccLib

Navigation

Related Topics