eccLib Cookbook

On this page you will find a collection of examples that demonstrate how to use eccLib. These examples should well demonstrate the capabilities, and the simplicity of eccLib.

GTF file parsing

The following example demonstrates how to parse a GTF file using eccLib.

Parsing a GTF file
import eccLib

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f)

With the above code snippet, you can easily parse a GTF file. Now what if we wanted to parse a GTF file with console output?

Parsing a GTF file with console output
import eccLib
from sys import stdout

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f, echo=stdout)

This will print the parsing progress to the console. You can in theory use any file-like object as the echo argument, so you could also write echo to a file or a socket. Ok so we have seen simple parsing, but what if we wanted to get only the first 10 lines of the GTF file?

Iterative parsing of a GTF file with a limit
import eccLib

file = eccLib.GtfFile('example.gtf')

target = eccLib.GtfList()
for i, line in enumerate(file):
    if i == 10:
        break
    target.append(line)

A bit more verbose, but with the iterative approach you can easily control how many lines you want to read and how many entries remain in memory. And what if we wanted to parse GTF from a string or an io object because we retrieved it from a database or a network connection?

Parsing a GTF file from a string or io object
import eccLib
from io import StringIO

gtf_string = '1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n'

gtf = eccLib.parseGTF(gtf_string)

io_obj = StringIO(gtf_string)

gtf = eccLib.parseGTF(io_obj)

What about iterative parsing of that string?

Iterative parsing of a GTF file from a string
import eccLib
from io import StringIO

gtf_string = '1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n'

gtf = eccLib.GtfReader(StringIO(gtf_string))

for line in gtf:
    print(line)

But what if we have serialized data within attributes? Well we have a way of parsing attributes into something other than a string. Both parsers accept an argument called attr_tp which maps attribute names, to functions that should parse the attribute value into the desired type. So for example, let’s say we want to parse an attribute called clustered into a boolean value, and score_2 into a float value.

Parsing a GTF file with attribute type conversion
import eccLib

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f, attr_tp={'clustered': lambda x: x.lower() == 'true', 'score_2': float})

Thanks to this nifty feature, we can easily serialize and serialize more complex datasets into GTFs.

GTF processing

While the feature set of eccLib is small, it still provides the core utilities one would expect from a GTF parser. Here’s how you can filter GTF entries:

Filtering GTF entries
import eccLib

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f)

# Filter by chromosome

first_chr = gtf.find(seqname='chr1')

# Filter using a function

score_greater_than_length = gtf.find(lambda x: x.score > len(x))

# Filter using a function called on key-value pair

gene_id_present = gtf.find(gene_id=lambda x: x is not None)

So now we know how to filter GTF entries, but what if we wanted to sort them?

Sorting GTF entries by position in sequence
import eccLib

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f)

sorted_gtf = gtf.sort()

By default GTF entries are compared to each other by their position in the sequence, so simply calling the sort method will sort the entries by their position in the sequence. In most cases the order of the entries in the GTF doesn’t matter, but often you might find yourself working with a GTF file containing entries that are from different sequences. In this case you might want to seperate the entries by sequence.

Separating GTF entries by sequence
import eccLib

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f)

grouped_gtf = gtf.sq_split()

print(grouped_gtf['chr1']) # Print all entries from chromosome 1

Quite handy especially with those large GTF files. And what if we wanted to get all the unique values of a column?

Getting unique values of a column
import eccLib

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f)

features = set(gtf.column('feature'))

print(features)

And how do I then export the GTF entries to a file?

Exporting GTF entries to a file
import eccLib

with open('example.gtf', 'r') as f:
    gtf = eccLib.parseGTF(f)

with open('output.gtf', 'w') as f:
    f.write(str(gtf))

Sequence processing

So we have shown off some of the GTF processing capabilities of eccLib, but what about individual GTF entry processing? Well the GtfDict class is quite handy as well.

Creating a GTF entry
import eccLib

new = eccLib.GtfDict(
    "1", "ensembl_havana", "intron", 1471765, 1497848, 0.0, True, None, extra="extra"
)

# We can access the values of the GTF entry like this

print(new.seqname) # 1

# or like this

print(new['seqname']) # 1

# additional attributes can only be accessed like this

print(new['extra']) # extra

Do you want to quickly check if two sequences overlap? One is contained within the other? Or if one is upstream of the other? eccLib has you covered.

Checking sequence relative positions
import eccLib

new = eccLib.GtfDict(
    None, None, "seq", 0, 5, None, True, None
)

other = eccLib.GtfDict(
    None, None, "seq", 1, 4, None, True, None
)

# Check if the two entries overlap

print(new.overlaps(other)) # True

# Check if one entry is contained within the other

print(new.contains(other)) # True

# Check if one entry is upstream of the other

print(new > other) # False

eccLib overall has adapted a lot of the standard Python operators to work in the context of GTF entries. So you can easily compare, get the length of, or iterate over GTF entries. Do you want to get the GTF representation of a sequence?

Getting the GTF representation of a sequence
import eccLib

new = eccLib.GtfDict(
    None, None, "seq", 0, 5, None, True, None
)

print(str(new)) # .\t.\tseq\t0\t5\t.\t+\t.
print(repr(new)) # {'seqname': None, 'source': None, 'feature': 'seq', 'start': 0, 'end': 5, 'score': None, 'strand': True, 'frame': None}

All of the standard dict’s functionality has been implemented, so you can still do things like get the keys, values, or items of a GtfDict. Iteration also works as expected.

Iterating over a GTF entry
import eccLib

new = eccLib.GtfDict(
    None, None, "seq", 0, 5, None, True, None
)

for key, value in new.items():
    print(key, value)

for key in new:
    print(key)

FASTA file parsing

Well here the matter is quite simpler. You can parse a FASTA file with a single line of code.

Parsing a FASTA file
import eccLib

with open('example.fna', 'r') as f:
    fasta = eccLib.parseFASTA(f)

There’s also an iterative parser, in the spirit of the previously shown GtfFile called FastaFile.

Iterating over a FASTA file
import eccLib

with eccLib.FastaFile('example.fna') as fasta:
    for name, seq in fasta:
        print(name, seq)

And that’s it. You can now access the sequences in the FASTA file sequentially. Want to retrieve an annotated sequence from a FASTA file?

Retrieving an annotated sequence from a FASTA file
import eccLib

with open('example.fna', 'r') as f:
    fasta = eccLib.parseFASTA(f)

new = eccLib.GtfDict(
    None, None, "seq", 0, 5, None, True, None
)

seq = fasta[0].get_annotated(new)

print(seq)