eccLib Cookbook¶
On this page you will find a collection of examples that demonstrate how to use
eccLib. These examples should well demonstrate the capabilities, and the
simplicity of eccLib.
GTF file parsing¶
The following example demonstrates how to parse a GTF file using eccLib.
import eccLib
with open('test/example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f)
With the above code snippet, you can easily parse a GTF file. Now what if we wanted to parse a GTF file with console output?
import eccLib
from sys import stdout
with open('test/example.gtf', 'r') as f:
gtf = eccLib.parseGTF(f, echo=stdout)
Note
Any sort of logging will slow down the parsing, so weigh the benefits of console output against the performance impact.
This will print the parsing progress to the console. You can in theory use any file-like object as the echo argument, so you could also write echo to a file or a socket. Ok so we have seen simple parsing, but what if we wanted to get only the first 10 lines of the GTF file?
import eccLib
with eccLib.GtfFile("test/example.gtf") as file:
target = eccLib.GtfList()
for i, line in enumerate(file):
if i == 10:
break
target.append(line)
assert len(target) == 9
A bit more verbose, but with the iterative approach you can easily control how many lines you want to read and how many entries remain in memory. And what if we wanted to parse GTF from a string or an io object because we retrieved it from a database or a network connection?
import eccLib
from io import StringIO
gtf_string = "1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n"
gtf1 = eccLib.parseGTF(gtf_string)
io_obj = StringIO(gtf_string)
gtf2 = eccLib.parseGTF(io_obj)
assert gtf1 == gtf2
print(repr(gtf1))
[{'seqname': '1', 'source': 'ensembl_havana', 'feature': 'intron', 'start': 1471765, 'end': 1497848, 'score': 0.0, 'reverse': False, 'frame': None}]
What about iterative parsing of that string?
import eccLib
from io import StringIO
gtf_string = "1\tensembl_havana\tintron\t1471765\t1497848\t0.0\t+\t.\n"
gtf = eccLib.GtfReader(StringIO(gtf_string))
# GtfReader is not an Iterable, it's an Iterator
while True:
try:
line = next(gtf)
print(repr(line))
except StopIteration:
break
{'seqname': '1', 'source': 'ensembl_havana', 'feature': 'intron', 'start': 1471765, 'end': 1497848, 'score': 0.0, 'reverse': False, 'frame': None}
But what if we have serialized data within attributes? Well we have a way of parsing attributes into something other than a string. Both parsers accept an argument called attr_tp which maps attribute names, to functions that should parse the attribute value into the desired type. So for example, let’s say we want to tweak an attribute called gene_id, and parse gene_version into an integer value.
import eccLib
with open("test/example.gtf", "r") as f:
gtf = eccLib.parseGTF(f, attr_tp={"gene_id": lambda x: x.replace("ENSG", ""), "gene_version": int})
assert all(isinstance(line["gene_id"], str) and not line["gene_id"].startswith("ENSG") for line in gtf)
assert all(isinstance(line["gene_version"], int) for line in gtf)
Thanks to this nifty feature, we can easily serialize and serialize more complex datasets into GTFs.
GTF processing¶
While the feature set of eccLib is small, it still provides the core utilities
one would expect from a GTF parser. Here’s how you can filter GTF entries using
utilities provided by eccLib.GtfList:
import eccLib
with open("test/example.gtf", "r") as f:
gtf = eccLib.parseGTF(f)
# Filter by chromosome
first_chr = gtf.find(seqname='chr1')
assert len(first_chr) < len(gtf)
# Filter using a function
score_greater_than_length = gtf.find(lambda x: x.score > len(x) if x.score is not None else False)
assert len(score_greater_than_length) == 0
# Filter using a function called on key-value pair
gene_id_present = gtf.find(gene_id=lambda x: x is not None)
assert len(gene_id_present) == len(gtf)
So now we know how to filter GTF entries, but what if we wanted to sort them?
import eccLib
sorted_gtf = sorted(gtf)
assert len(sorted_gtf) == len(gtf)
assert sorted_gtf[0].start < sorted_gtf[-1].start
from copy import copy
sorted_gtf_2 = copy(gtf)
sorted_gtf_2.sort()
assert sorted_gtf == sorted_gtf_2
By default GTF entries are compared to each other by their position in the
sequence, so simply calling the list.sort() method will sort
the entries by their position in the sequence. In most cases the order of the
entries in the GTF doesn’t matter, but often you might find yourself working
with a GTF file containing entries that are from different sequences. In this c
ase you might want to separate the entries by sequence.
import eccLib
grouped_gtf = gtf.sq_split()
assert isinstance(grouped_gtf, dict)
chr1 = grouped_gtf["1"] # Get all entries from chromosome 1
assert isinstance(chr1, eccLib.GtfList)
Quite handy especially with those large GTF files. And what if we wanted to get all the unique values of a column?
Column View¶
While GTFs might not specify a common row-width accross entries, it’s rather
common for files to actually be structured with a fixed row-width. As such,
eccLib allows you to easily view and manipulate the columns of a GTF file.
This is most easily done using the eccLib.GtfList.column() method.
import eccLib
columns = gtf.column("feature")
assert isinstance(columns, eccLib.GtfList_ColumnView)
assert len(columns) == len(gtf)
features = set(columns)
assert features == {'exon', 'transcript', 'gene', 'CDS'}
That method, returns a special type; eccLib.GtfList_ColumnView.
It acts like a dict’s collections.abc.ValuesView, and a
collections.abc.Sequence, thus meaning you can expect it to work
like a tuple, while pointing to the same underlying data. This is useful for
performance and memory efficiency.
Note
Since eccLib.GtfList_ColumnView is a view, it will not allocate
any memory, unless you force it to by wrapping it in a list or
tuple.
One major upside of using the eccLib.GtfList_ColumnView, is that
it’s easily converted to most Python sequence and iterable types. Most notably,
it can be explicitly converted to numpy.ndarray using
numpy.array().
import eccLib
import numpy as np
arr = np.array(gtf.column("start"), dtype=np.int32)
assert isinstance(arr, np.ndarray)
assert arr.shape == (len(gtf),)
For more details regarding converting data in eccLib.GtfList to
more advanced Python types see Conversion guide.
Warning
The view does hold a reference to the
eccLib.GtfList it was created from, so memory associated with
the eccLib.GtfList will not be freed until the view is garbage
collected.
Annotation processing¶
So we have shown off some of the GTF processing capabilities of eccLib, but
what about individual GTF entry processing? Well the eccLib.GtfDict
class is quite handy as well.
import eccLib
new = eccLib.GtfDict(
"1", "ensembl_havana", "intron", 1471765, 1497848, 0.0, True, None, extra="extra"
)
# We can access the values of the GTF entry like this
print(new.seqname) # 1
# or like this
print(new['seqname']) # 1
# additional attributes can only be accessed like this
print(new['extra']) # extra
1
1
extra
Do you want to quickly check if two sequences overlap? One is contained within the other? Or if one is upstream of the other? eccLib has you covered.
import eccLib
new = eccLib.GtfDict(
None, None, "seq", 0, 5, None, True, None
)
other = eccLib.GtfDict(
None, None, "seq", 1, 4, None, True, None
)
# Check if the two entries overlap
assert new.overlaps(other) is True
assert new.coverage(other) > 0
# Check if one entry is contained within the other
assert new.contains(other) is True
# Check if one entry is upstream of the other
assert not new > other
eccLib overall has adapted a lot of the standard Python operators to work in the context of GTF entries. All of the standard dict’s functionality has been implemented, so you can still do things like get the keys, values, or items of a GtfDict. Iteration also works as expected.
import eccLib
new = eccLib.GtfDict(
None, None, "seq", 0, 5, None, True, None
)
for key, value in new.items():
print(key, value)
for key in new:
print(key)
seqname None
source None
feature seq
start 0
end 5
score None
reverse True
frame None
seqname
source
feature
start
end
score
reverse
frame
FASTA file parsing¶
Well here the matter is quite simpler. You can parse a FASTA file with a single line of code.
import eccLib
with open("test/example.fna", "r") as f:
fasta = eccLib.parseFASTA(f)
assert len(fasta) == 2
assert fasta[0][0] == "test"
assert len(fasta[0][1]) == 240
print(fasta[0][1])
CCTTTAAGCACTCCTTTTTAGTTATCCCCACCTGCCCAGCTCCCTTATTAGGCTGAGACACTTTAACTAAATTATCTGCTTCCCTGACTATTCCTGGATTACAGCTATATCTCATTGCTGCCCTTCTTCCCAATCCAAAGCCTCCTTTGTGTCCTCCTCTTGTATCCCCCCACCTTAACCCACAAGTATAAGATACCTCTACTCCCTCCTTGGCGACTGATCATGCACCCCTTACCATCT
There’s also an iterative parser, in the spirit of the previously shown
eccLib.GtfFile called eccLib.FastaFile.
import eccLib
with eccLib.FastaFile("test/example.fna") as fasta:
for name, seq in fasta:
print(name, len(seq))
test 240
test 240
And that’s it. You can now access the sequences in the FASTA file sequentially. Want to retrieve an annotated sequence from a FASTA file?
import eccLib
with open("test/example.fna", "r") as f:
fasta = eccLib.parseFASTA(f)
new = eccLib.GtfDict(
None, None, "seq", 0, 5, None, True, None
)
seq = fasta[0][1].get_annotated(new)
print(seq)
CCTTT
FastaBuff Viewing¶
Fasta sequences can be rather long. As such, it would be rather self-defeating
to the whole idea of eccLib.FastaBuff, to force users to load the
entire sequence into str to view it. To avoid this,
eccLib.FastaBuff will avoid creating a str
representation of the sequence, instead returning
eccLib.FastaBuff_view objects, wherever possible.
import eccLib
fasta_buffer = eccLib.FastaBuff("ACGT")
view = fasta_buffer[0:3]
print(view)
ACG
These views alias the same underlying memory buffer, so you can consider their creation and storage as zero-cost. These views should allow you to do any slicing or indexing you need to do with the sequence, without loading the entire thing into memory.
Note
These views are read-only. You cannot modify the sequence through them, however, changes to the underlying buffer will be reflected in the view.
eccLib.FastaBuff_view does support advanced indexing and slicing,
meaning you can use negative indices, slicing with step sizes, and slicing
with slice objects.
import eccLib
fasta_buffer = eccLib.FastaBuff("ACGTACGTACGTACGT")
view = fasta_buffer[::4]
print(view)
assert fasta_buffer[0:4] == fasta_buffer[-4:]
AAAA
Warning
Same as with eccLib.GtfList_ColumnView, these views do hold a
reference to the underlying memory, meaning, the base
eccLib.FastaBuff object must be kept alive as long as the view
is in use.
If you wish to have an owned memory buffer, you can always just convert the view into a buffer.
import eccLib
assert fasta_buffer == eccLib.FastaBuff(fasta_buffer[:])