FastaBuff encoding

eccLib.FastaBuff is a class wrapping around a memory buffer that holds a sequence of nucleotides. Thanks to it’s unique encoding scheme utilized on runtime, it provides efficient storage and retrieval of nucleotide sequences. To put it simply: instead of storing sequences as characters in a string, uses a mapping that maps characters to 4 bit integers, which when packed together, allow for memory usage reduction of 50%.

IUPAC Nucleotide Codes

IUPAC specifies a set of nucleotide codes that should be used to represent nucleotides, and combinations or ambiguities in sequences. There are 17 possible codes laid out in the table below.

IUPAC Nucleotide Codes source

Code

Meaning

Equivalent sybmols

A

Adenine

{A}

C

Cytosine

{C}

G

Guanine

{G}

T

Thymine

{T}

U

Uracil

{U}

R

Purine

{A, G}

Y

Pyrimidine

{C, T}

S

Guanine or Cytosine

{C, G}

W

Adenine or Thymine

{A, T}

K

Guanine or Thymine

{T, G}

M

Adenine or Cytosine

{A, C}

B

Not Adenine

{C, G, T}

D

Not Cytosine

{A, G, T}

H

Not Guanine

{A, C, T}

V

Not Thymine

{A, C, G}

N

Any

{A, C, G, T}

. or -

Gap

{}

In the second column, you can see a set of equivalent possible symbols. It’s worth noting, that if we store information, whether the sequence is DNA or RNA, the number of these symbols can be reduced to 16 (as is actually recommended). It just so happens, that 16 is a power of 2, fourth power to be exact. This means that we can represent each IUPAC symbol using a 4 bit integer, assuming we store whether the sequence is DNA or RNA separately of course.

Mapping symbols to numbers

Well but how do we construct such a mapping? In the second column of the table above, we already have a mapping of IUPAC symbols to the power set of {A, C, G, T}. Now we just need to map that power set, to 4 bit binary numbers. We chose to do it through a bit field, where each bit represents a nucleotide. The first bit represents Adenine (A), the second bit represents Cytosine (C), the third bit represents Guanine (G), and the fourth bit represents Thymine (T).

Mapping IUPAC symbols to 4 bit binary numbers

Symbol

Binary

Hex

. or -

0000

0x00

T or U

0001

0x01

G

0010

0x02

K

0011

0x03

C

0100

0x04

Y

0101

0x05

S

0110

0x06

B

0111

0x07

A

1000

0x08

W

1001

0x09

R

1010

0x0A

D

1011

0x0B

M

1100

0x0C

H

1101

0x0D

V

1110

0x0E

N

1111

0x0F

This mapping is then used on runtime when creating or accessing FastaBuff objects, or FASTA parsing when binary parsing is toggled.

Advantages

Naturally this mapping on itself doesn’t achieve much, but this mapping does allow for packing two characters into a single byte. Such a packing has the advantage, of reducing memory usage by half, and indexing within the buffer is trivial, unlike with a 5 bit packed buffer. Additionally, during debugging, a reading the buffer is trivial and can be done with a simple hexdump. For example 0x8421 stands for ‘ACGT’.

Serialisation

All FastaBuff objects can be easily serialised through the eccLib.FastaBuff.dump() method. That method returns a :py:type:`bytes` object, which can be written to a file opened in binary mode.

fasta_buffer = FastaBuff('ACGT')

with open('output.fb', 'wb') as f:
    f.write(fasta_buffer.dump())

This opens the door for this encoding to be used for long term storage, though ideally you should consider compressing this data.