Conversion guide

Here’s a general purpose guide for converting eccLib data to more advanced Python types.

Numpy conversion

numpy is a very powerful and popular Python library for numerical computing. In fact, we may even assume, that if data is easily converted to a numpy.ndarray, it can be used with any modern scientific computing library. As such, eccLib, offers simple ways of converting data to Numpy arrays. The best way, in our opinion, is to use the eccLib.GtfList.column() method to obtain column views, and then convert those to numpy.ndarray.

import eccLib
import numpy as np

with open("test/example.gtf", "r") as f:
    gtf = eccLib.parseGTF(f)

start = gtf.column("start")
end = gtf.column("end")

start_np = np.array(start, dtype=np.int32)
end_np = np.array(end, dtype=np.int32)
widths = end_np - start_np
assert np.all(widths >= 0)
assert (widths == [len(row) for row in gtf]).all()

If you wish to obtain full numpy arrays from the GTF data, you can use the, this is best done by writing column views to numpy.ndarray.

import eccLib
import numpy as np

dtype = [('start', np.int32), ('end', np.int32), ('score', np.float32)]
out = np.empty(len(gtf), dtype=dtype)
out['start'] = gtf.column('start')
out['end']   = gtf.column('end')
out['score'] = gtf.column('score')

assert out.shape == (len(gtf),)
assert out.dtype == dtype

Or, alternatively, you can always just stack the columns together into a single numpy.ndarray using numpy.stack().

import eccLib
import numpy as np

start = gtf.column("start")
end = gtf.column("end")
score = gtf.column("score")
out = np.stack([start, end, score], axis=1)
assert out.shape == (len(gtf), 3)

Pandas Conversion

pandas is a popular Python library for data manipulation and analysis. The main feature of that library is the pandas.DataFrame class, which is a two-dimensional table of data. Converting a eccLib.GtfList object to a pandas.DataFrame is straightforward.

import eccLib
import pandas as pd

df = pd.DataFrame({c: gtf.column(c) for c in ["seqname", "start", "end", "feature"]})
assert df.shape == (len(gtf), 4)

PyTorch Conversion

torch is a popular Python library for machine learning and deep learning. If you wish to do neural network training or inference, you can convert your eccLib.GtfList object to a torch.Tensor using the code below.

import eccLib
import torch

start = gtf.column("start")
end = gtf.column("end")
out = torch.tensor([start, end], dtype=torch.int32, device="cpu")
assert out.shape == (2, len(gtf))

Here, one snag you may run into, is that torch really doesn’t like None values in its tensors. As such, make sure to replace None values with a suitable default value before converting to a torch.Tensor.

Why isn’t this native?

In order to ensure maximum stability, and limit trouble with conditional compilation, dependency distribution and dependency versioning, we have chosen to explicitly not use any Python third-party dependencies in eccLib. As such, it’s impossible to provide native conversion to the types, as laid out in this guide. However, thanks to a rigorous application of Python collections.abc protocols, conversion to most third-party types should be straightforward and simple in your code.