eccLib Cookbook¶
On this page you will find a collection of examples that demonstrate how to use eccLib. These examples should well demonstrate the capabilities, and the simplicity of eccLib.
GTF file parsing¶
The following example demonstrates how to parse a GTF file using eccLib.
import eccLib
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f)
With the above code snippet, you can easily parse a GTF file. Now what if we wanted to parse a GTF file with console output?
import eccLib
from sys import stdout
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f, echo=stdout)
This will print the parsing progress to the console. You can in theory use any file-like object as the echo argument, so you could also write echo to a file or a socket. Ok so we have seen simple parsing, but what if we wanted to get only the first 10 lines of the GTF file?
import eccLib
file = eccLib.GtfFile('example.gtf')
target = eccLib.GtfList()
for i, line in enumerate(file):
if i == 10:
break
target.append(line)
A bit more verbose, but with the iterative approach you can easily control how many lines you want to read and how many entries remain in memory. And what if we wanted to parse GTF from a string or an io object because we retrieved it from a database or a network connection?
import eccLib
from io import StringIO
gtf_string = '1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n'
gtf = eccLib.parseGTF(gtf_string)
io_obj = StringIO(gtf_string)
gtf = eccLib.parseGTF(io_obj)
What about iterative parsing of that string?
import eccLib
from io import StringIO
gtf_string = '1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n'
gtf = eccLib.GtfReader(StringIO(gtf_string))
for line in gtf:
print(line)
But what if we have serialized data within attributes? Well we have a way of parsing attributes into something other than a string. Both parsers accept an argument called attr_tp which maps attribute names, to functions that should parse the attribute value into the desired type. So for example, let’s say we want to parse an attribute called clustered into a boolean value, and score_2 into a float value.
import eccLib
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f, attr_tp={'clustered': lambda x: x.lower() == 'true', 'score_2': float})
Thanks to this nifty feature, we can easily serialize and serialize more complex datasets into GTFs.
GTF processing¶
While the feature set of eccLib is small, it still provides the core utilities one would expect from a GTF parser. Here’s how you can filter GTF entries:
import eccLib
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f)
# Filter by chromosome
first_chr = gtf.find(seqname='chr1')
# Filter using a function
score_greater_than_length = gtf.find(lambda x: x.score > len(x))
# Filter using a function called on key-value pair
gene_id_present = gtf.find(gene_id=lambda x: x is not None)
So now we know how to filter GTF entries, but what if we wanted to sort them?
import eccLib
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f)
sorted_gtf = gtf.sort()
By default GTF entries are compared to each other by their position in the sequence, so simply calling the sort method will sort the entries by their position in the sequence. In most cases the order of the entries in the GTF doesn’t matter, but often you might find yourself working with a GTF file containing entries that are from different sequences. In this case you might want to seperate the entries by sequence.
import eccLib
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f)
grouped_gtf = gtf.sq_split()
print(grouped_gtf['chr1']) # Print all entries from chromosome 1
Quite handy especially with those large GTF files. And what if we wanted to get all the unique values of a column?
import eccLib
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f)
features = set(gtf.column('feature'))
print(features)
And how do I then export the GTF entries to a file?
import eccLib
with open('example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f)
with open('output.gtf', 'w') as f:
f.write(str(gtf))
Sequence processing¶
So we have shown off some of the GTF processing capabilities of eccLib, but what about individual GTF entry processing? Well the GtfDict class is quite handy as well.
import eccLib
new = eccLib.GtfDict(
"1", "ensembl_havana", "intron", 1471765, 1497848, 0.0, True, None, extra="extra"
)
# We can access the values of the GTF entry like this
print(new.seqname) # 1
# or like this
print(new['seqname']) # 1
# additional attributes can only be accessed like this
print(new['extra']) # extra
Do you want to quickly check if two sequences overlap? One is contained within the other? Or if one is upstream of the other? eccLib has you covered.
import eccLib
new = eccLib.GtfDict(
None, None, "seq", 0, 5, None, True, None
)
other = eccLib.GtfDict(
None, None, "seq", 1, 4, None, True, None
)
# Check if the two entries overlap
print(new.overlaps(other)) # True
# Check if one entry is contained within the other
print(new.contains(other)) # True
# Check if one entry is upstream of the other
print(new > other) # False
eccLib overall has adapted a lot of the standard Python operators to work in the context of GTF entries. So you can easily compare, get the length of, or iterate over GTF entries. Do you want to get the GTF representation of a sequence?
import eccLib
new = eccLib.GtfDict(
None, None, "seq", 0, 5, None, True, None
)
print(str(new)) # .\t.\tseq\t0\t5\t.\t+\t.
print(repr(new)) # {'seqname': None, 'source': None, 'feature': 'seq', 'start': 0, 'end': 5, 'score': None, 'strand': True, 'frame': None}
All of the standard dict’s functionality has been implemented, so you can still do things like get the keys, values, or items of a GtfDict. Iteration also works as expected.
import eccLib
new = eccLib.GtfDict(
None, None, "seq", 0, 5, None, True, None
)
for key, value in new.items():
print(key, value)
for key in new:
print(key)
FASTA file parsing¶
Well here the matter is quite simpler. You can parse a FASTA file with a single line of code.
import eccLib
with open('example.fna', 'r') as f:
fasta = eccLib.parseFASTA(f)
There’s also an iterative parser, in the spirit of the previously shown GtfFile called FastaFile.
import eccLib
with eccLib.FastaFile('example.fna') as fasta:
for name, seq in fasta:
print(name, seq)
And that’s it. You can now access the sequences in the FASTA file sequentially. Want to retrieve an annotated sequence from a FASTA file?
import eccLib
with open('example.fna', 'r') as f:
fasta = eccLib.parseFASTA(f)
new = eccLib.GtfDict(
None, None, "seq", 0, 5, None, True, None
)
seq = fasta[0].get_annotated(new)
print(seq)