eccLib Cookbook

On this page you will find a collection of examples that demonstrate how to use eccLib. These examples should well demonstrate the capabilities, and the simplicity of eccLib.

GTF file parsing

The following example demonstrates how to parse a GTF file using eccLib.

import eccLib

with open('test/example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f)

With the above code snippet, you can easily parse a GTF file. Now what if we wanted to parse a GTF file with console output?

Parsing a GTF file with console output
import eccLib
from sys import stdout

with open('test/example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f, echo=stdout)

Note

Any sort of logging will slow down the parsing, so weigh the benefits of console output against the performance impact.

This will print the parsing progress to the console. You can in theory use any file-like object as the echo argument, so you could also write echo to a file or a socket. Ok so we have seen simple parsing, but what if we wanted to get only the first 10 lines of the GTF file?

import eccLib

with eccLib.GtfFile("test/example.gtf") as file:
    target = eccLib.GtfList()
    for i, line in enumerate(file):
        if i == 10:
            break
        target.append(line)

assert len(target) == 9

A bit more verbose, but with the iterative approach you can easily control how many lines you want to read and how many entries remain in memory. And what if we wanted to parse GTF from a string or an io object because we retrieved it from a database or a network connection?

import eccLib
from io import StringIO

gtf_string = "1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n"

gtf1 = eccLib.parseGTF(gtf_string)

io_obj = StringIO(gtf_string)

gtf2 = eccLib.parseGTF(io_obj)

assert gtf1 == gtf2
print(repr(gtf1))
[{'seqname': '1', 'source': 'ensembl_havana', 'feature': 'intron', 'start': 1471765, 'end': 1497848, 'score': 0.0, 'reverse': False, 'frame': None}]

What about iterative parsing of that string?

import eccLib
from io import StringIO

gtf_string = "1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n"

gtf = eccLib.GtfReader(StringIO(gtf_string))
# GtfReader is not an Iterable, it's an Iterator

while True:
    try:
        line = next(gtf)
        print(repr(line))
    except StopIteration:
        break
{'seqname': '1', 'source': 'ensembl_havana', 'feature': 'intron', 'start': 1471765, 'end': 1497848, 'score': 0.0, 'reverse': False, 'frame': None}

But what if we have serialized data within attributes? Well we have a way of parsing attributes into something other than a string. Both parsers accept an argument called attr_tp which maps attribute names, to functions that should parse the attribute value into the desired type. So for example, let’s say we want to tweak an attribute called gene_id, and parse gene_version into an integer value.

import eccLib

with open("test/example.gtf", "r") as f:
    gtf = eccLib.parseGTF(f, attr_tp={"gene_id": lambda x: x.replace("ENSG", ""), "gene_version": int})

assert all(isinstance(line["gene_id"], str) and not line["gene_id"].startswith("ENSG") for line in gtf)
assert all(isinstance(line["gene_version"], int) for line in gtf)

Thanks to this nifty feature, we can easily serialize and serialize more complex datasets into GTFs.

GTF processing

While the feature set of eccLib is small, it still provides the core utilities one would expect from a GTF parser. Here’s how you can filter GTF entries using utilities provided by eccLib.GtfList:

import eccLib

with open("test/example.gtf", "r") as f:
    gtf = eccLib.parseGTF(f)

# Filter by chromosome

first_chr = gtf.find(seqname='chr1')
assert len(first_chr) < len(gtf)

# Filter using a function

score_greater_than_length = gtf.find(lambda x: x.score > len(x) if x.score is not None else False)
assert len(score_greater_than_length) == 0

# Filter using a function called on key-value pair

gene_id_present = gtf.find(gene_id=lambda x: x is not None)
assert len(gene_id_present) == len(gtf)

So now we know how to filter GTF entries, but what if we wanted to sort them?

import eccLib

sorted_gtf = sorted(gtf)

assert len(sorted_gtf) == len(gtf)
assert sorted_gtf[0].start < sorted_gtf[-1].start

from copy import copy

sorted_gtf_2 = copy(gtf)
sorted_gtf_2.sort()
assert sorted_gtf == sorted_gtf_2

By default GTF entries are compared to each other by their position in the sequence, so simply calling the list.sort() method will sort the entries by their position in the sequence. In most cases the order of the entries in the GTF doesn’t matter, but often you might find yourself working with a GTF file containing entries that are from different sequences. In this c ase you might want to separate the entries by sequence.

import eccLib

grouped_gtf = gtf.sq_split()

assert isinstance(grouped_gtf, dict)

chr1 = grouped_gtf["1"] # Get all entries from chromosome 1
assert isinstance(chr1, eccLib.GtfList)

Quite handy especially with those large GTF files. And what if we wanted to get all the unique values of a column?

Column View

While GTFs might not specify a common row-width accross entries, it’s rather common for files to actually be structured with a fixed row-width. As such, eccLib allows you to easily view and manipulate the columns of a GTF file. This is most easily done using the eccLib.GtfList.column() method.

import eccLib

columns = gtf.column("feature")
assert isinstance(columns, eccLib.GtfList_ColumnView)
assert len(columns) == len(gtf)

features = set(columns)
assert features == {'exon', 'transcript', 'gene', 'CDS'}

That method, returns a special type; eccLib.GtfList_ColumnView. It acts like a dict’s collections.abc.ValuesView, and a collections.abc.Sequence, thus meaning you can expect it to work like a tuple, while pointing to the same underlying data. This is useful for performance and memory efficiency.

Note

Since eccLib.GtfList_ColumnView is a view, it will not allocate any memory, unless you force it to by wrapping it in a list or tuple.

One major upside of using the eccLib.GtfList_ColumnView, is that it’s easily converted to most Python sequence and iterable types. Most notably, it can be explicitly converted to numpy.ndarray using numpy.array().

import eccLib
import numpy as np

arr = np.array(gtf.column("start"), dtype=np.int32)
assert isinstance(arr, np.ndarray)
assert arr.shape == (len(gtf),)

For more details regarding converting data in eccLib.GtfList to more advanced Python types see Conversion guide.

Warning

The view does hold a reference to the eccLib.GtfList it was created from, so memory associated with the eccLib.GtfList will not be freed until the view is garbage collected.

Annotation processing

So we have shown off some of the GTF processing capabilities of eccLib, but what about individual GTF entry processing? Well the eccLib.GtfDict class is quite handy as well.

import eccLib

new = eccLib.GtfDict(
    "1", "ensembl_havana", "intron", 1471765, 1497848, 0.0, True, None, extra="extra"
)

# We can access the values of the GTF entry like this

print(new.seqname) # 1

# or like this

print(new['seqname']) # 1

# additional attributes can only be accessed like this

print(new['extra']) # extra
1
1
extra

Do you want to quickly check if two sequences overlap? One is contained within the other? Or if one is upstream of the other? eccLib has you covered.

import eccLib

new = eccLib.GtfDict(
    None, None, "seq", 0, 5, None, True, None
)

other = eccLib.GtfDict(
    None, None, "seq", 1, 4, None, True, None
)

# Check if the two entries overlap

assert new.overlaps(other) is True
assert new.coverage(other) > 0

# Check if one entry is contained within the other

assert new.contains(other) is True

# Check if one entry is upstream of the other

assert not new > other

eccLib overall has adapted a lot of the standard Python operators to work in the context of GTF entries. All of the standard dict’s functionality has been implemented, so you can still do things like get the keys, values, or items of a GtfDict. Iteration also works as expected.

import eccLib

new = eccLib.GtfDict(
    None, None, "seq", 0, 5, None, True, None
)

for key, value in new.items():
    print(key, value)

for key in new:
    print(key)
seqname None
source None
feature seq
start 0
end 5
score None
reverse True
frame None
seqname
source
feature
start
end
score
reverse
frame

FASTA file parsing

Well here the matter is quite simpler. You can parse a FASTA file with a single line of code.

import eccLib

with open("test/example.fna", "r") as f:
    fasta = eccLib.parseFASTA(f)

assert len(fasta) == 2
assert fasta[0][0] == "test"
assert len(fasta[0][1]) == 240
print(fasta[0][1])
CCTTTAAGCACTCCTTTTTAGTTATCCCCACCTGCCCAGCTCCCTTATTAGGCTGAGACACTTTAACTAAATTATCTGCTTCCCTGACTATTCCTGGATTACAGCTATATCTCATTGCTGCCCTTCTTCCCAATCCAAAGCCTCCTTTGTGTCCTCCTCTTGTATCCCCCCACCTTAACCCACAAGTATAAGATACCTCTACTCCCTCCTTGGCGACTGATCATGCACCCCTTACCATCT

There’s also an iterative parser, in the spirit of the previously shown eccLib.GtfFile called eccLib.FastaFile.

import eccLib

with eccLib.FastaFile("test/example.fna") as fasta:
    for name, seq in fasta:
        print(name, len(seq))
test 240
test 240

And that’s it. You can now access the sequences in the FASTA file sequentially. Want to retrieve an annotated sequence from a FASTA file?

import eccLib

with open("test/example.fna", "r") as f:
    fasta = eccLib.parseFASTA(f)

new = eccLib.GtfDict(
    None, None, "seq", 0, 5, None, True, None
)

seq = fasta[0][1].get_annotated(new)

print(seq)
CCTTT

FastaBuff Viewing

Fasta sequences can be rather long. As such, it would be rather self-defeating to the whole idea of eccLib.FastaBuff, to force users to load the entire sequence into str to view it. To avoid this, eccLib.FastaBuff will avoid creating a str representation of the sequence, instead returning eccLib.FastaBuff_view objects, wherever possible.

import eccLib

fasta_buffer = eccLib.FastaBuff("ACGT")
view = fasta_buffer[0:3]
print(view)
ACG

These views alias the same underlying memory buffer, so you can consider their creation and storage as zero-cost. These views should allow you to do any slicing or indexing you need to do with the sequence, without loading the entire thing into memory.

Note

These views are read-only. You cannot modify the sequence through them, however, changes to the underlying buffer will be reflected in the view.

eccLib.FastaBuff_view does support advanced indexing and slicing, meaning you can use negative indices, slicing with step sizes, and slicing with slice objects.

import eccLib

fasta_buffer = eccLib.FastaBuff("ACGTACGTACGTACGT")
view = fasta_buffer[::4]
print(view)
assert fasta_buffer[0:4] == fasta_buffer[-4:]
AAAA

Warning

Same as with eccLib.GtfList_ColumnView, these views do hold a reference to the underlying memory, meaning, the base eccLib.FastaBuff object must be kept alive as long as the view is in use.

If you wish to have an owned memory buffer, you can always just convert the view into a buffer.

import eccLib

assert fasta_buffer == eccLib.FastaBuff(fasta_buffer[:])