eccLib Module

A Python module written in C for fast parsing of genomic files and genomic context analysis. With the classes it provides you can easily parse GTF and FASTA files, and perform various operations on them. The module is written in C for speed and memory efficiency, and is meant to be used in bioinformatics applications.

class eccLib.GtfDict(seqname: str | None = None, source: str | None = None, feature: str | None = None, start: int | None = None, end: int | None = None, score: float | None = None, reverse: bool | None = None, frame: int | None = None, **kwargs: Any)
class eccLib.GtfDict(toConvert: Mapping[str, Any])

Bases: Mapping[str, Any]

A mapping object that is guaranteed to have all the necessary keys as specified per the GTF specification. You can access those keys via the attributes, or by using the mapping interface. With the couple of methods provided, you can easily compare, check for overlaps, containment, and more.

Please note, that this class overrides equality, greater than, less than comparisons and overrides containment checks. Additionally this class is hashable.

seqname: str | None

The name of the sequence being annotated

source: str | None

Where this annotation comes from

feature: str | None

What this annotation is meant to represent

start: int | None

The start nt

end: int | None

The end nt

score: float | None

A score associated with the sequence

reverse: bool | None

On which strand the sequence is located

frame: int | None

Indicates which base of the feature is the first base of a codon

overlaps(other: GtfDict | Mapping[str, Any]) bool

Returns true if the provided entry’s sequence overlaps with this. seqname must equal for both, and reverse must also equal or be None in either sequence

Parameters:

other (GtfDict | Mapping[str, Any]) – The sequence to check for overlap

Returns:

Whether the sequences overlap

Return type:

bool

contains(other: GtfDict | Mapping[str, Any]) bool

Returns true if sequence’s sequence is inside the other’s sequence’s sequence. seqname must equal for both entries, and reverse must also equal or be None in either entry

Parameters:

other (GtfDict | Mapping[str, Any]) – The sequence to check for containment

Returns:

Whether the sequence contains the other sequence

Return type:

bool

coverage(other: 'GtfDict' | Iterable['GtfDict']) float

Returns the coverage of the provided sequence over this sequence. The coverage is defined as the length of the overlap divided by the length of this sequence.

Parameters:

other (GtfDict | Iterable[GtfDict]) – The sequence to check for coverage

Returns:

The coverage of the provided sequence over this sequence

Return type:

float

keys() list[str]

Returns the keys of the GtfDict

Returns:

The keys of the GtfDict

Return type:

list[str]

values() list[Any]

Returns the values of the GtfDict

Returns:

The values of the GtfDict

Return type:

list[Any]

pop(key: str) Any

Removes the key and returns the value

Parameters:

key (str) – The key to remove

Returns:

The value of the key

Return type:

Any

get(key: str, default: Any = None) Any

Returns the value of the key or the default value if the key is not found

Parameters:
  • key (str) – The key to get the value for

  • default (Any, optional) – The default value to return. Defaults to None.

Returns:

The value of the key or the default value

Return type:

Any

items() list[tuple[str, Any]]

Returns the items of the GtfDict

Returns:

The items of the GtfDict

Return type:

list[tuple[str, Any]]

update(other: GtfDict | Mapping[str, Any]) None

Updates the GtfDict with the provided dict

Parameters:

other (GtfDict | Mapping[str, Any]) – The dict to update with

__eq__(check: GtfDict | Mapping[str, Any]) bool

Checks seqname, feature, start, end and reverse

Parameters:

check (GtfDict | Mapping[str, Any]) – The object to compare to

Returns:

Whether the objects are equal

Return type:

bool

__ne__(check: GtfDict | Mapping[str, Any]) bool

Checks seqname, feature, start, end and reverse

Parameters:

check (GtfDict | Mapping[str, Any]) – The object to compare to

Returns:

Whether the objects are not equal

Return type:

bool

__lt__(other: GtfDict | Mapping[str, Any]) bool

Checks if self is before other without overlap

Parameters:

other (GtfDict | Mapping[str, Any]) – The object to compare to

Returns:

self.end < other.start

Return type:

bool

__le__(other: GtfDict | Mapping[str, Any]) bool

Checks if self is before other with possible overlap

Parameters:

other (GtfDict | Mapping[str, Any]) – The object to compare to

Returns:

self.end <= other.end

Return type:

bool

__gt__(other: GtfDict | Mapping[str, Any]) bool

Checks if self is after other without overlap

Parameters:

other (GtfDict | Mapping[str, Any]) – The object to compare to

Returns:

self.start > other.end

Return type:

bool

__ge__(other: GtfDict | Mapping[str, Any]) bool

Checks if self is after other with possible overlap

Parameters:

other (GtfDict | Mapping[str, Any]) – The object to compare to

Returns:

self.start >= other.start

Return type:

bool

__getitem__(key: str) Any

Returns the value of the provided key

Parameters:

key (str) – The key to get the value for

Returns:

The value of the key

Return type:

Any

__setitem__(key: str, value: Any) None

Sets the value of the provided key

Parameters:
  • key (str) – The key to set the value for

  • value (Any) – The value to set

__delitem__(key: str) None

Deletes the provided key

Parameters:

key (str) – The key to delete

__iter__() Iterator[str]

Returns an iterable of the GtfDict

Returns:

An iterator of the GtfDict

Return type:

Iterator[str]

__hash__() int

Returns a hash of the GtfDict

Raises:

TypeError – If one of the stored values is unhashable, this can only happen with attributes

Returns:

A hash of the GtfDict

Return type:

int

__contains__(other: GtfDict | Mapping[str, Any]) bool

Returns true if sequence’s sequence is inside the other’s sequence’s sequence. seqname must equal for both entries, and reverse must also equal or be None in either entry

Parameters:

other (GtfDict | Mapping[str, Any]) – The sequence to check for containment

Returns:

Whether the sequence contains the other sequence

Return type:

bool

__len__() int

Returns the length of the sequence

Returns:

The length of the sequence

Return type:

int

__repr__() str

Returns a representation of the GtfDict This is equivalent to str(dict(self))

Returns:

A representation of the GtfDict

Return type:

str

__str__() str

Returns GTF representation of the sequence

Returns:

A valid GTF entry

Return type:

str

class eccLib.GtfList(*args: GtfDict)
class eccLib.GtfList(obj: Sequence[GtfDict])
class eccLib.GtfList(obj: Iterator[GtfDict])

Bases: list[GtfDict]

A subclass of a list that holds only GtfDicts. It’s worth noting that the list can be converted to a set. Equivalent to a parsed GTF file, so it’s ordered. However, since GtfDicts are hashable, you can easily convert instances of this class to a set.

find_closest_bound(sequence: GtfDict) GtfDict | None

Finds the entry that most closely bounds the provided sequence. The entry with the closest bound is the one that has the smallest hausdorff distance to the provided sequence.

Parameters:

sequence (GtfDict) – The sequence to find the closest bounding entry for

Returns:

The entry that most closely bounds the provided sequence

Return type:

GtfDict | None

sq_split() dict[str | None, eccLib.GtfList]

Splits the GtfList into separate GtfLists based on seqname. This is very useful for optimizing operations on the GtfList.

Returns:

A dictionary containing GtfLists split by seqname

Return type:

dict[str | None, GtfList]

find(*args: Callable[[GtfDict], bool], **kwargs: Any | Callable[[Any], bool]) GtfList

Finds all entries that match the provided conditions You can provide three types of arguments: 1. A function that takes a GtfDict and returns a boolean 2. A keyword argument that will be used to filter the GtfList 3. A keyword argument that is a function that takes an a value held under the given key and returns a boolean

Returns:

A GtfList containing all entries that match the provided conditions

Return type:

GtfList

column(key: str, pad: bool = True) list[Any]

Returns a list of values for the provided key

Parameters:
  • key (str) – The key to get the values for

  • pad (bool, optional) – Whether to pad the list with None values for missing keys. Defaults to True. If False, missing keys will cause an exception.

Returns:

A list of values for the provided key

Return type:

list[Any]

__iadd__(value: GtfList) Self

Appends the entries of the provided GtfList to the current GtfList

Parameters:

value (GtfList) – The GtfList to append to the current GtfList

Returns:

The current GtfList with the entries of the provided GtfList appended

Return type:

Self

__add__(value: GtfList) GtfList
__add__(value: Sequence[Any]) list[Any]

Helper for @overload to raise when called.

__str__() str

Exports the GtfList to a GTF file representation

Returns:

A GTF file representation of the GtfList

Return type:

str

class eccLib.GtfFile(filename: str, attr_tp: Mapping[str, Callable[[str], Any]] | None = None)

Bases: Iterable[GtfDict]

An iterable GTF parser. It reads a GTF file and returns GtfDicts. Once entered, it spawns a GtfReader instance that can be iterated over. This is the iterative parser for GTF files.

__iter__() GtfReader

Initializes a new iterator for the GtfFile instance

Returns:

A new GtfReader instance

Return type:

GtfReader

__enter__() GtfFile

Opens the file and gets ready for reading

Returns:

The GtfFile instance

Return type:

GtfFile

__exit__(*args, **kwargs) None

Closes the file

class eccLib.GtfReader(file: TextIOBase, attr_tp: Mapping[str, Callable[[str], Any]] | None = None)

Bases: Iterator[GtfDict]

A reader instance that iteratively returns GtfDicts from a file object. It can be used standalone, but it’s usually spawned by a GtfFile instance. Please note that standalone instances of GtfReader parse slower than those spawned by GtfFile due to limits imposed by the Python layer. This is the iterator to the iterable GtfFile class.

__next__() GtfDict

Fetches the next GTF record

Returns:

The next GTF record as a dictionary

Return type:

GtfDict

class eccLib.FastaFile(filename: str, binary: bool = True)

Bases: Iterable[tuple[str, str | FastaBuff]]

An iterable Fasta parser. It reads a FASTA file and returns tuples of (header, sequence). Depending on the binary argument, it can either return FastaBuff or str. This is the iterative parser for FASTA files.

__iter__() FastaReader

Prepares the reader for iteration and returns this instance

__enter__() FastaFile

Opens the file and gets ready for reading

Returns:

The FastaFile instance

Return type:

FastaFile

__exit__(*args, **kwargs) None

Closes the file

class eccLib.FastaReader(file: TextIOBase, binary: bool = True)

Bases: Iterator[tuple[str, str | FastaBuff]]

A reader instance that iteratively returns FASTA records from a file object. This is the iterator to the iterable FastaFile class.

__next__() tuple[str, str | 'FastaBuff']

Fetches the next FASTA record from the file

Returns:

The next FASTA record from the file

Return type:

tuple[str, str | FastaBuff]

class eccLib.FastaBuff(seq: str, RNA: bool = False)
class eccLib.FastaBuff(seq: bytes, RNA: bool = False)
class eccLib.FastaBuff(seq: TextIOBase, RNA: bool = False)

Bases: Sequence[str]

A class that holds a FASTA DNA sequence in an optimal binary format. Approximately twice as memory efficient than a string representation. It should function approximately the same as a string.

__hash__ = None
dump() bytes

Returns the bytes representation of the buffer. This operation discards some information, leaving only a binary representation of the sequence. The exact length of the sequence is lost, leading to a potential gap(. character) being additionally encoded, but the sequence is still valid.

Returns:

The binary representation of the sequence

Return type:

bytes

index(seq: str | 'FastaBuff', start: int = 0) int | None

Returns the index of the provided sequence

Parameters:
  • seq (str | FastaBuff) – The sequence to find

  • start (int, optional) – The index to start searching from. Defaults to 0.

Returns:

The index of the sequence or None if not found

Return type:

int | None

count(seq: str | 'FastaBuff') int

Counts the occurrences of the provided sequence

Parameters:

seq (str | FastaBuff) – The sequence to count

Returns:

The number of occurrences of the sequence

Return type:

int

get_annotated(entry: GtfDict | Mapping) str

Returns the annotated sequence of the provided entry

Parameters:

entry (GtfDict) – The annotation to apply

Returns:

The annotated sequence

Return type:

str

find(seq: str | 'FastaBuff') list[int]

Finds all occurrences of the provided sequence

Parameters:

seq (str | FastaBuff) – The sequence to find

Returns:

A list of indexes where the sequence was found

Return type:

list[int]

__str__() str

Returns the stored sequence

Returns:

The stored sequence

Return type:

str

__eq__(value: Any) bool

Checks if the stored sequence is equal to the provided value

Parameters:

value (Any) – The value to compare with

Returns:

True if the stored sequence is equal to the provided value, False otherwise

Return type:

bool

__ne__(value: Any) bool

Checks if the stored sequence is not equal to the provided value

Parameters:

value (Any) – The value to compare with

Returns:

True if the stored sequence is not equal to the provided value, False otherwise

Return type:

bool

__len__() int

Returns the length of the stored sequence

Returns:

The length of the stored sequence

Return type:

int

__getitem__(key: int | slice) str

Returns the sequence at the specified index or slice

Parameters:

key (int | slice) – The index or slice to retrieve

Returns:

The sequence at the specified index or slice

Return type:

str

__setitem__(key: int, value: str) None

Sets the sequence at the specified index

Parameters:
  • key (int) – The index to set

  • value (str) – The sequence to set

__contains__(seq: str | 'FastaBuff') bool

Checks if the buffer contains the provided sequence

Parameters:

seq (str | FastaBuff) – The sequence to check

Returns:

True if the buffer contains the sequence, False otherwise

Return type:

bool

eccLib.parseFASTA(file: str | TextIOBase, binary: bool = True, echo: TextIOBase | None = None) list[tuple[str, str | eccLib.FastaBuff]]

Parses raw FASTA data and returns a list of all entries. You may either pass raw file data as file, or a file object. Unexpected characters will be ignored. This parser loads the whole file at once into memory.

Parameters:
  • file (str | TextIOBase) – The file to parse

  • binary (bool) – Whether to parse the sequence as a FastaBuff. Defaults to True. Pass False to parse FASTA files that don’t contain exclusively IUPAC codes.

  • echo (TextIOBase | None, optional) – The IO to output echo into. Defaults to None.

Returns:

A list containing title, sequence tuples

Return type:

list[tuple[str, FastaBuff | str]]

eccLib.parseGTF(file: str | TextIOBase, echo: TextIOBase | None = None, attr_tp: Mapping[str, Callable[[str], Any]] | None = None) GtfList

Parses raw GTF, GFF2 and GFF3 data and returns a list containing parsed GtfDicts. You may either pass raw file data as file, or a file object. This parser loads the whole file at once into memory.

Parameters:
  • file (str | TextIOBase) – The file to parse

  • echo (TextIOBase | None, optional) – The IO to output echo into. Defaults to None.

  • attr_tp (Mapping[str, Callable[[str], Any]] | None, optional) – A mapping of attribute names to type conversion functions. Defaults to None.

Returns:

A list containing parsed GtfDicts

Return type:

GtfList