eccLib Module

A Python module written in C for fast parsing of genomic files and genomic context analysis. With the classes it provides you can easily parse GTF and FASTA files, and perform various operations on them. The module is written in C for speed and memory efficiency, and is meant to be used in bioinformatics applications.

Generally speaking, the parsers are as permissive as possible. This is to maximise the usability of the module, and with the amount of different slight variations in the GTF specification, it is important to be as permissive as possible.

Publication: https://doi.org/10.1093/bioinformatics/btaf558

class eccLib.GtfDict_ItemView(gtfdict: GtfDict)

Bases: ItemsView[str, Any]

A view into the items of a GtfDict.

class eccLib.GtfDict_KeyView(gtfdict: GtfDict)

Bases: KeysView[str]

A view into the keys of a GtfDict.

class eccLib.GtfDict_ValueView(gtfdict: GtfDict)

Bases: ValuesView[Any]

A view into the values of a GtfDict.

class eccLib.GtfDict(seqname: str | None = None, source: str | None = None, feature: str | None = None, start: int | None = None, end: int | None = None, score: float | None = None, reverse: bool | None = None, frame: int | None = None, **kwargs: Any)
class eccLib.GtfDict(toConvert: Mapping[str, Any])

Bases: MutableMapping[str, Any]

A mapping object that is guaranteed to have all the necessary keys as specified per the GTF specification. You can access those keys via the attributes, or by using the mapping interface. With the couple of methods provided, you can easily compare, check for overlaps, containment, and more.

Logically, this is meant to represent a GTF annotation.

Please note, that this class overrides equality, greater than, less than comparisons and overrides containment checks.

seqname: str | None

The name of the sequence being annotated

source: str | None

Where this annotation comes from

feature: str | None

What this annotation is meant to represent

start: int | None

The start nt

end: int | None

The end nt

score: float | None

A score associated with the sequence

reverse: bool | None

On which strand the sequence is located

frame: int | None

Indicates which base of the feature is the first base of a codon

overlaps(other: 'GtfDict' | Mapping[str, Any]) bool

Returns true if the provided entry’s sequence overlaps with this. seqname must equal for both, and reverse must also equal or be None in either sequence

Parameters:

other (GtfDict | Mapping[str, Any]) – The sequence to check for overlap

Returns:

Whether the sequences overlap

Return type:

bool

contains(other: 'GtfDict' | Mapping[str, Any]) bool

Returns true if sequence’s sequence is inside the other’s sequence’s sequence. seqname must equal for both entries, and reverse must also equal or be None in either entry

Parameters:

other (GtfDict | Mapping[str, Any]) – The sequence to check for containment

Returns:

Whether the sequence contains the other sequence

Return type:

bool

coverage(other: 'GtfDict' | tuple[int, int]) float

Returns the coverage of the provided sequence over this sequence. The coverage is defined as the length of the overlap divided by the length of this sequence.

Be sure to check out GtfList.get_contigs(), it’s very relevant if you wish to get the coverage of a sequence over a list of sequences.

Parameters:

other (GtfDict | tuple[int, int]) – The sequence to check for coverage

Returns:

The coverage of the provided sequence over this sequence

Return type:

float

keys() GtfDict_KeyView

Returns a view into the keys of the GtfDict

Returns:

The view of the keys of the GtfDict

Return type:

GtfDict_KeyView

values() GtfDict_ValueView

Returns a view into the values of the GtfDict

Returns:

The view of the values of the GtfDict

Return type:

GtfDict_ValueView

pop(key: str, default: Any = None) Any

Removes the key and returns the value

Parameters:
  • key (str) – The key to remove

  • default (Any, optional) – The default value to return if the key is not found. Defaults to None.

Returns:

The value of the key or the default value

Return type:

Any

clear() None

Clears all the additional attributes, and sets all the core fields to None

popitem() tuple[str, Any]

Removes a key and returns the key and value

Returns:

The key and value of the last key

Return type:

tuple[str, Any]

setdefault(key: str, default: Any = None) Any

Returns the value of the key, or sets the key to the default value if it is not found and returns the value

Parameters:
  • key (str) – The key to set the default value for

  • default (Any, optional) – The default value to set. Defaults to None.

Returns:

The value of the key or the default value

Return type:

Any

get(key: Literal['seqname', 'source', 'feature'], default: str | None = None) str | None
get(key: Literal['start', 'end', 'frame'], default: int | None = None) int | None
get(key: Literal['score'], default: float | None = None) float | None
get(key: Literal['reverse'], default: bool | None = None) bool | None
get(key: str, default: Any = None) Any

Returns the value of the key or the default value if the key is not found

Parameters:
  • key (str) – The key to get the value for

  • default (Any, optional) – The default value to return. Defaults to None.

Returns:

The value of the key or the default value

Return type:

Any

items() GtfDict_ItemView

Returns a view into the items of the GtfDict

Returns:

The view of the items of the GtfDict

Return type:

GtfDict_ItemView

attributes() dict[str, Any]

Returns the additional attributes of the GtfDict

Returns:

The additional attributes of the GtfDict

Return type:

dict[str, Any]

class eccLib.GtfList_ColumnIterator

Bases: Iterator, Generic

An iterator over a GtfList column.

class eccLib.GtfList_ColumnView(gtflist: GtfList, column: str)

Bases: ValuesView, Sequence, Generic

A view into a GtfList column. It acts like a read-only view into a GtfList column. You can easily expect it to work like a tuple.

class eccLib.GtfList(*args: GtfDict)
class eccLib.GtfList(obj: Sequence[GtfDict])
class eccLib.GtfList(obj: Iterator[GtfDict])

Bases: list[GtfDict]

A subclass of a list that holds only GtfDicts. Equivalent to a parsed GTF file, so it’s ordered.

find_closest_bound(sequence: GtfDict) GtfDict | None

Finds the entry that most closely bounds the provided sequence. The entry with the closest bound is the one that has the smallest hausdorff distance to the provided sequence.

Parameters:

sequence (GtfDict) – The sequence to find the closest bounding entry for

Returns:

The entry that most closely bounds the provided sequence

Return type:

GtfDict | None

sq_split() dict[str | None, eccLib.GtfList]

Splits the GtfList into separate GtfLists based on seqname. This is very useful for optimizing operations on the GtfList.

Returns:

A dictionary containing GtfLists split by seqname

Return type:

dict[str | None, GtfList]

find(*args: Callable[[GtfDict], bool], **kwargs: Any | Callable[[Any], bool]) GtfList

Finds all entries that match the provided conditions You can provide three types of arguments: 1. A function that takes a GtfDict and returns a boolean 2. A keyword argument that will be used to filter the GtfList 3. A keyword argument that is a function that takes an a value held under the given key and returns a boolean

Returns:

A GtfList containing all entries that match the provided conditions

Return type:

GtfList

column(key: Literal['seqname', 'source', 'feature']) GtfList_ColumnView[str | None]
column(key: Literal['start', 'end', 'frame']) GtfList_ColumnView[int | None]
column(key: Literal['score']) GtfList_ColumnView[float | None]
column(key: Literal['reverse']) GtfList_ColumnView[bool | None]
column(key: str) GtfList_ColumnView[Any]

Returns a view into a column of the GtfList

Parameters:

key (str) – The key to get the values for

Returns:

A view into the column of the GtfList

Return type:

GtfList_ColumnView[Any]

get_contigs() list[tuple[int, int]]

Returns a list of the contig ranges of the genes

This method does not consider different seqnames and strands, thus filter accordingly.

Returns:

A list of tuples representing the start and end positions of each contig

Return type:

list[tuple[int, int]]

class eccLib.GtfFile(filename: str, attr_tp: Mapping[str, Callable[[str], Any]] | None = None)

Bases: Iterable[GtfDict]

An iterable GTF parser. It reads a GTF file and returns GtfDicts. Once entered, it spawns a GtfReader instance that can be iterated over. This is the iterative parser for GTF files.

class eccLib.GtfReader(file: TextIOBase, attr_tp: Mapping[str, Callable[[str], Any]] | None = None)

Bases: Iterator[GtfDict]

A reader instance that iteratively returns GtfDicts from a file object. It can be used standalone, but it’s usually spawned by a GtfFile instance. Please note that standalone instances of GtfReader parse slower than those spawned by GtfFile due to limits imposed by the Python layer. This is the iterator to the iterable GtfFile class.

class eccLib.FastaFile(filename: str, binary: bool = True)

Bases: Iterable[tuple[str, str | FastaBuff]]

An iterable Fasta parser. It reads a FASTA file and returns tuples of (header, sequence). Depending on the binary argument, it can either return FastaBuff or str. This is the iterative parser for FASTA files.

class eccLib.FastaReader(file: TextIOBase, binary: bool = True)

Bases: Iterator[tuple[str, str | FastaBuff]]

A reader instance that iteratively returns FASTA records from a file object. This is the iterator to the iterable FastaFile class.

class eccLib.FastaBuff_iterator

Bases: Iterator[str]

class eccLib.FastaBuff_view(buff: FastaBuff, start: int = 0, stop: int = Ellipsis, step: int = 1)

Bases: ValuesView[str], Sequence

Read-only view of a FastaBuff sequence. Please note, that this is internally, an alias for a FastaBuff. Technically, due to the way this view is implemented, operations on this view may not be as efficient.

index(value: str | FastaBuff | FastaBuff_view, start: int = 0, stop: int = Ellipsis) int

Returns the index of the provided sequence

Parameters:
  • seq (str | FastaBuff) – The sequence to find

  • start (int, optional) – The index to start searching from. Defaults to 0.

  • stop (int, optional) – The index to stop searching at. Defaults to maximum int value.

Returns:

The index of the sequence

Return type:

int

count(value: str | FastaBuff | FastaBuff_view) int

Counts the occurrences of the provided sequence

Parameters:

seq (str | FastaBuff) – The sequence to count

Returns:

The number of occurrences of the sequence

Return type:

int

get_annotated(entry: GtfDict | Mapping[str, Any]) FastaBuff_view

Returns the annotated sequence of the provided entry

Parameters:

entry (GtfDict | Mapping[str, Any]) – The annotation to apply

Returns:

The annotated sequence

Return type:

FastaBuff_view

class eccLib.FastaBuff(seq: str, RNA: bool = False)
class eccLib.FastaBuff(seq: bytes, RNA: bool = False)
class eccLib.FastaBuff(seq: TextIOBase, RNA: bool = False)
class eccLib.FastaBuff(seq: FastaBuff_view, RNA: bool = False)

Bases: Sequence

A class that holds a FASTA DNA sequence in an optimal binary format. Approximately twice as memory efficient than a string representation. It should function approximately the same as a string.

dump() bytes

Returns the bytes representation of the buffer. This operation discards some information, leaving only a binary representation of the sequence. The exact length of the sequence is lost, leading to a potential gap(. character) being additionally encoded, but the sequence is still valid.

Returns:

The binary representation of the sequence

Return type:

bytes

index(value: str | FastaBuff, start: int = 0, stop: int = Ellipsis) int

Returns the index of the provided sequence

Parameters:
  • seq (str | FastaBuff) – The sequence to find

  • start (int, optional) – The index to start searching from. Defaults to 0.

  • stop (int, optional) – The index to stop searching at. Defaults to maximum int value.

Returns:

The index of the sequence

Return type:

int

count(value: str | FastaBuff) int

Counts the occurrences of the provided sequence

Parameters:

seq (str | FastaBuff) – The sequence to count

Returns:

The number of occurrences of the sequence

Return type:

int

get_annotated(entry: GtfDict | Mapping[str, Any]) FastaBuff_view

Returns the annotated sequence of the provided entry

Parameters:

entry (GtfDict | Mapping[str, Any]) – The annotation to apply

Returns:

The annotated sequence

Return type:

FastaBuff_view

find(seq: str | FastaBuff) list[int]

Finds all occurrences of the provided sequence

Parameters:

seq (str | FastaBuff) – The sequence to find

Returns:

A list of indexes where the sequence was found

Return type:

list[int]

eccLib.parseFASTA(file: str | TextIOBase, binary: bool = True, echo: TextIOBase | Any | None = None) list[tuple[str, str | eccLib.FastaBuff]]

Parses raw FASTA data and returns a list of all entries. You may either pass raw file data as file, or a file object. Unexpected characters will be ignored. This parser loads the whole file at once into memory.

Parameters:
  • file (str | TextIOBase) – The file to parse

  • binary (bool) – Whether to parse the sequence as a FastaBuff. Defaults to True. Pass False to parse FASTA files that don’t contain exclusively IUPAC codes.

  • echo (TextIOBase | Any | None, optional) – The IO to output echo into. Defaults to None.

Returns:

A list containing title, sequence tuples

Return type:

list[tuple[str, FastaBuff | str]]

eccLib.parseGTF(file: str | TextIOBase, echo: TextIOBase | Any | None = None, attr_tp: Mapping[str, Callable[[str], Any]] | None = None) GtfList

Parses raw GTF, GFF2 and GFF3 data and returns a list containing parsed GtfDicts. You may either pass raw file data as file, or a file object. This parser loads the whole file at once into memory.

Parameters:
  • file (str | TextIOBase) – The file to parse

  • echo (TextIOBase | Any | None, optional) – The IO to output echo into. Defaults to None.

  • attr_tp (Mapping[str, Callable[[str], Any]] | None, optional) – A mapping of attribute names to type conversion functions. Defaults to None.

Returns:

A list containing parsed GtfDicts

Return type:

GtfList