eccLib Module¶
A Python module written in C for fast parsing of genomic files and genomic context analysis. With the classes it provides you can easily parse GTF and FASTA files, and perform various operations on them. The module is written in C for speed and memory efficiency, and is meant to be used in bioinformatics applications.
- class eccLib.GtfDict(seqname: str | None = None, source: str | None = None, feature: str | None = None, start: int | None = None, end: int | None = None, score: float | None = None, reverse: bool | None = None, frame: int | None = None, **kwargs: Any)¶
- class eccLib.GtfDict(toConvert: Mapping[str, Any])
Bases:
Mapping
[str
,Any
]A mapping object that is guaranteed to have all the necessary keys as specified per the GTF specification. You can access those keys via the attributes, or by using the mapping interface. With the couple of methods provided, you can easily compare, check for overlaps, containment, and more.
Please note, that this class overrides equality, greater than, less than comparisons and overrides containment checks. Additionally this class is hashable.
- seqname: str | None¶
The name of the sequence being annotated
- source: str | None¶
Where this annotation comes from
- feature: str | None¶
What this annotation is meant to represent
- start: int | None¶
The start nt
- end: int | None¶
The end nt
- score: float | None¶
A score associated with the sequence
- reverse: bool | None¶
On which strand the sequence is located
- frame: int | None¶
Indicates which base of the feature is the first base of a codon
- overlaps(other: GtfDict | Mapping[str, Any]) bool ¶
Returns true if the provided entry’s sequence overlaps with this. seqname must equal for both, and reverse must also equal or be None in either sequence
- Parameters:
other (GtfDict | Mapping[str, Any]) – The sequence to check for overlap
- Returns:
Whether the sequences overlap
- Return type:
bool
- contains(other: GtfDict | Mapping[str, Any]) bool ¶
Returns true if sequence’s sequence is inside the other’s sequence’s sequence. seqname must equal for both entries, and reverse must also equal or be None in either entry
- Parameters:
other (GtfDict | Mapping[str, Any]) – The sequence to check for containment
- Returns:
Whether the sequence contains the other sequence
- Return type:
bool
- coverage(other: 'GtfDict' | Iterable['GtfDict']) float ¶
Returns the coverage of the provided sequence over this sequence. The coverage is defined as the length of the overlap divided by the length of this sequence.
- keys() list[str] ¶
Returns the keys of the GtfDict
- Returns:
The keys of the GtfDict
- Return type:
list[str]
- values() list[Any] ¶
Returns the values of the GtfDict
- Returns:
The values of the GtfDict
- Return type:
list[Any]
- pop(key: str) Any ¶
Removes the key and returns the value
- Parameters:
key (str) – The key to remove
- Returns:
The value of the key
- Return type:
Any
- get(key: str, default: Any = None) Any ¶
Returns the value of the key or the default value if the key is not found
- Parameters:
key (str) – The key to get the value for
default (Any, optional) – The default value to return. Defaults to None.
- Returns:
The value of the key or the default value
- Return type:
Any
- items() list[tuple[str, Any]] ¶
Returns the items of the GtfDict
- Returns:
The items of the GtfDict
- Return type:
list[tuple[str, Any]]
- update(other: GtfDict | Mapping[str, Any]) None ¶
Updates the GtfDict with the provided dict
- Parameters:
other (GtfDict | Mapping[str, Any]) – The dict to update with
- __eq__(check: GtfDict | Mapping[str, Any]) bool ¶
Checks seqname, feature, start, end and reverse
- Parameters:
check (GtfDict | Mapping[str, Any]) – The object to compare to
- Returns:
Whether the objects are equal
- Return type:
bool
- __ne__(check: GtfDict | Mapping[str, Any]) bool ¶
Checks seqname, feature, start, end and reverse
- Parameters:
check (GtfDict | Mapping[str, Any]) – The object to compare to
- Returns:
Whether the objects are not equal
- Return type:
bool
- __lt__(other: GtfDict | Mapping[str, Any]) bool ¶
Checks if self is before other without overlap
- Parameters:
other (GtfDict | Mapping[str, Any]) – The object to compare to
- Returns:
self.end < other.start
- Return type:
bool
- __le__(other: GtfDict | Mapping[str, Any]) bool ¶
Checks if self is before other with possible overlap
- Parameters:
other (GtfDict | Mapping[str, Any]) – The object to compare to
- Returns:
self.end <= other.end
- Return type:
bool
- __gt__(other: GtfDict | Mapping[str, Any]) bool ¶
Checks if self is after other without overlap
- Parameters:
other (GtfDict | Mapping[str, Any]) – The object to compare to
- Returns:
self.start > other.end
- Return type:
bool
- __ge__(other: GtfDict | Mapping[str, Any]) bool ¶
Checks if self is after other with possible overlap
- Parameters:
other (GtfDict | Mapping[str, Any]) – The object to compare to
- Returns:
self.start >= other.start
- Return type:
bool
- __getitem__(key: str) Any ¶
Returns the value of the provided key
- Parameters:
key (str) – The key to get the value for
- Returns:
The value of the key
- Return type:
Any
- __setitem__(key: str, value: Any) None ¶
Sets the value of the provided key
- Parameters:
key (str) – The key to set the value for
value (Any) – The value to set
- __delitem__(key: str) None ¶
Deletes the provided key
- Parameters:
key (str) – The key to delete
- __iter__() Iterator[str] ¶
Returns an iterable of the GtfDict
- Returns:
An iterator of the GtfDict
- Return type:
Iterator[str]
- __hash__() int ¶
Returns a hash of the GtfDict
- Raises:
TypeError – If one of the stored values is unhashable, this can only happen with attributes
- Returns:
A hash of the GtfDict
- Return type:
int
- __contains__(other: GtfDict | Mapping[str, Any]) bool ¶
Returns true if sequence’s sequence is inside the other’s sequence’s sequence. seqname must equal for both entries, and reverse must also equal or be None in either entry
- Parameters:
other (GtfDict | Mapping[str, Any]) – The sequence to check for containment
- Returns:
Whether the sequence contains the other sequence
- Return type:
bool
- __len__() int ¶
Returns the length of the sequence
- Returns:
The length of the sequence
- Return type:
int
- __repr__() str ¶
Returns a representation of the GtfDict This is equivalent to str(dict(self))
- Returns:
A representation of the GtfDict
- Return type:
str
- __str__() str ¶
Returns GTF representation of the sequence
- Returns:
A valid GTF entry
- Return type:
str
- class eccLib.GtfList(*args: GtfDict)¶
- class eccLib.GtfList(obj: Sequence[GtfDict])
- class eccLib.GtfList(obj: Iterator[GtfDict])
Bases:
list
[GtfDict
]A subclass of a list that holds only GtfDicts. It’s worth noting that the list can be converted to a set. Equivalent to a parsed GTF file, so it’s ordered. However, since GtfDicts are hashable, you can easily convert instances of this class to a set.
- find_closest_bound(sequence: GtfDict) GtfDict | None ¶
Finds the entry that most closely bounds the provided sequence. The entry with the closest bound is the one that has the smallest hausdorff distance to the provided sequence.
- sq_split() dict[str | None, eccLib.GtfList] ¶
Splits the GtfList into separate GtfLists based on seqname. This is very useful for optimizing operations on the GtfList.
- Returns:
A dictionary containing GtfLists split by seqname
- Return type:
dict[str | None, GtfList]
- find(*args: Callable[[GtfDict], bool], **kwargs: Any | Callable[[Any], bool]) GtfList ¶
Finds all entries that match the provided conditions You can provide three types of arguments: 1. A function that takes a GtfDict and returns a boolean 2. A keyword argument that will be used to filter the GtfList 3. A keyword argument that is a function that takes an a value held under the given key and returns a boolean
- Returns:
A GtfList containing all entries that match the provided conditions
- Return type:
- column(key: str, pad: bool = True) list[Any] ¶
Returns a list of values for the provided key
- Parameters:
key (str) – The key to get the values for
pad (bool, optional) – Whether to pad the list with None values for missing keys. Defaults to True. If False, missing keys will cause an exception.
- Returns:
A list of values for the provided key
- Return type:
list[Any]
- __iadd__(value: GtfList) Self ¶
Appends the entries of the provided GtfList to the current GtfList
- Parameters:
value (GtfList) – The GtfList to append to the current GtfList
- Returns:
The current GtfList with the entries of the provided GtfList appended
- Return type:
Self
- __add__(value: GtfList) GtfList ¶
- __add__(value: Sequence[Any]) list[Any]
Helper for @overload to raise when called.
- __str__() str ¶
Exports the GtfList to a GTF file representation
- Returns:
A GTF file representation of the GtfList
- Return type:
str
- class eccLib.GtfFile(filename: str, attr_tp: Mapping[str, Callable[[str], Any]] | None = None)¶
Bases:
Iterable
[GtfDict
]An iterable GTF parser. It reads a GTF file and returns GtfDicts. Once entered, it spawns a GtfReader instance that can be iterated over. This is the iterative parser for GTF files.
- __iter__() GtfReader ¶
Initializes a new iterator for the GtfFile instance
- Returns:
A new GtfReader instance
- Return type:
- __enter__() GtfFile ¶
Opens the file and gets ready for reading
- Returns:
The GtfFile instance
- Return type:
- __exit__(*args, **kwargs) None ¶
Closes the file
- class eccLib.GtfReader(file: TextIOBase, attr_tp: Mapping[str, Callable[[str], Any]] | None = None)¶
Bases:
Iterator
[GtfDict
]A reader instance that iteratively returns GtfDicts from a file object. It can be used standalone, but it’s usually spawned by a GtfFile instance. Please note that standalone instances of GtfReader parse slower than those spawned by GtfFile due to limits imposed by the Python layer. This is the iterator to the iterable GtfFile class.
- class eccLib.FastaFile(filename: str, binary: bool = True)¶
Bases:
Iterable
[tuple
[str
, str | FastaBuff]]An iterable Fasta parser. It reads a FASTA file and returns tuples of (header, sequence). Depending on the binary argument, it can either return FastaBuff or str. This is the iterative parser for FASTA files.
- __iter__() FastaReader ¶
Prepares the reader for iteration and returns this instance
- __enter__() FastaFile ¶
Opens the file and gets ready for reading
- Returns:
The FastaFile instance
- Return type:
- __exit__(*args, **kwargs) None ¶
Closes the file
- class eccLib.FastaReader(file: TextIOBase, binary: bool = True)¶
Bases:
Iterator
[tuple
[str
, str | FastaBuff]]A reader instance that iteratively returns FASTA records from a file object. This is the iterator to the iterable FastaFile class.
- class eccLib.FastaBuff(seq: str, RNA: bool = False)¶
- class eccLib.FastaBuff(seq: bytes, RNA: bool = False)
- class eccLib.FastaBuff(seq: TextIOBase, RNA: bool = False)
Bases:
Sequence
[str
]A class that holds a FASTA DNA sequence in an optimal binary format. Approximately twice as memory efficient than a string representation. It should function approximately the same as a string.
- __hash__ = None¶
- dump() bytes ¶
Returns the bytes representation of the buffer. This operation discards some information, leaving only a binary representation of the sequence. The exact length of the sequence is lost, leading to a potential gap(. character) being additionally encoded, but the sequence is still valid.
- Returns:
The binary representation of the sequence
- Return type:
bytes
- index(seq: str | 'FastaBuff', start: int = 0) int | None ¶
Returns the index of the provided sequence
- Parameters:
seq (str | FastaBuff) – The sequence to find
start (int, optional) – The index to start searching from. Defaults to 0.
- Returns:
The index of the sequence or None if not found
- Return type:
int | None
- count(seq: str | 'FastaBuff') int ¶
Counts the occurrences of the provided sequence
- Parameters:
seq (str | FastaBuff) – The sequence to count
- Returns:
The number of occurrences of the sequence
- Return type:
int
- get_annotated(entry: GtfDict | Mapping) str ¶
Returns the annotated sequence of the provided entry
- Parameters:
entry (GtfDict) – The annotation to apply
- Returns:
The annotated sequence
- Return type:
str
- find(seq: str | 'FastaBuff') list[int] ¶
Finds all occurrences of the provided sequence
- Parameters:
seq (str | FastaBuff) – The sequence to find
- Returns:
A list of indexes where the sequence was found
- Return type:
list[int]
- __str__() str ¶
Returns the stored sequence
- Returns:
The stored sequence
- Return type:
str
- __eq__(value: Any) bool ¶
Checks if the stored sequence is equal to the provided value
- Parameters:
value (Any) – The value to compare with
- Returns:
True if the stored sequence is equal to the provided value, False otherwise
- Return type:
bool
- __ne__(value: Any) bool ¶
Checks if the stored sequence is not equal to the provided value
- Parameters:
value (Any) – The value to compare with
- Returns:
True if the stored sequence is not equal to the provided value, False otherwise
- Return type:
bool
- __len__() int ¶
Returns the length of the stored sequence
- Returns:
The length of the stored sequence
- Return type:
int
- __getitem__(key: int | slice) str ¶
Returns the sequence at the specified index or slice
- Parameters:
key (int | slice) – The index or slice to retrieve
- Returns:
The sequence at the specified index or slice
- Return type:
str
- __setitem__(key: int, value: str) None ¶
Sets the sequence at the specified index
- Parameters:
key (int) – The index to set
value (str) – The sequence to set
- eccLib.parseFASTA(file: str | TextIOBase, binary: bool = True, echo: TextIOBase | None = None) list[tuple[str, str | eccLib.FastaBuff]] ¶
Parses raw FASTA data and returns a list of all entries. You may either pass raw file data as file, or a file object. Unexpected characters will be ignored. This parser loads the whole file at once into memory.
- Parameters:
file (str | TextIOBase) – The file to parse
binary (bool) – Whether to parse the sequence as a FastaBuff. Defaults to True. Pass False to parse FASTA files that don’t contain exclusively IUPAC codes.
echo (TextIOBase | None, optional) – The IO to output echo into. Defaults to None.
- Returns:
A list containing title, sequence tuples
- Return type:
list[tuple[str, FastaBuff | str]]
- eccLib.parseGTF(file: str | TextIOBase, echo: TextIOBase | None = None, attr_tp: Mapping[str, Callable[[str], Any]] | None = None) GtfList ¶
Parses raw GTF, GFF2 and GFF3 data and returns a list containing parsed GtfDicts. You may either pass raw file data as file, or a file object. This parser loads the whole file at once into memory.
- Parameters:
file (str | TextIOBase) – The file to parse
echo (TextIOBase | None, optional) – The IO to output echo into. Defaults to None.
attr_tp (Mapping[str, Callable[[str], Any]] | None, optional) – A mapping of attribute names to type conversion functions. Defaults to None.
- Returns:
A list containing parsed GtfDicts
- Return type: