Conversion guide¶
Here’s a general purpose guide for converting eccLib data to more
advanced Python types.
Numpy conversion¶
numpy is a very powerful and popular Python library for numerical
computing. In fact, we may even assume, that if data is easily converted to a
numpy.ndarray, it can be used with any modern scientific computing
library. As such, eccLib, offers simple ways of converting data to
Numpy arrays. The best way, in our opinion, is to use the
eccLib.GtfList.column() method to obtain
column views, and then convert those to numpy.ndarray.
import eccLib
import numpy as np
with open("test/example.gtf", "r") as f:
gtf = eccLib.parseGTF(f)
start = gtf.column("start")
end = gtf.column("end")
start_np = np.array(start, dtype=np.int32)
end_np = np.array(end, dtype=np.int32)
widths = end_np - start_np
assert np.all(widths >= 0)
assert (widths == [len(row) for row in gtf]).all()
If you wish to obtain full numpy arrays from the GTF data, you can use the,
this is best done by writing column views to numpy.ndarray.
import eccLib
import numpy as np
dtype = [('start', np.int32), ('end', np.int32), ('score', np.float32)]
out = np.empty(len(gtf), dtype=dtype)
out['start'] = gtf.column('start')
out['end'] = gtf.column('end')
out['score'] = gtf.column('score')
assert out.shape == (len(gtf),)
assert out.dtype == dtype
Or, alternatively, you can always just stack the columns together into a single
numpy.ndarray using numpy.stack().
import eccLib
import numpy as np
start = gtf.column("start")
end = gtf.column("end")
score = gtf.column("score")
out = np.stack([start, end, score], axis=1)
assert out.shape == (len(gtf), 3)
Pandas Conversion¶
pandas is a popular Python library for data manipulation and analysis.
The main feature of that library is the pandas.DataFrame class,
which is a two-dimensional table of data. Converting a eccLib.GtfList
object to a pandas.DataFrame is straightforward.
import eccLib
import pandas as pd
df = pd.DataFrame({c: gtf.column(c) for c in ["seqname", "start", "end", "feature"]})
assert df.shape == (len(gtf), 4)
PyTorch Conversion¶
torch is a popular Python library for machine learning and deep
learning. If you wish to do neural network training or inference, you can
convert your eccLib.GtfList object to a torch.Tensor
using the code below.
import eccLib
import torch
start = gtf.column("start")
end = gtf.column("end")
out = torch.tensor([start, end], dtype=torch.int32, device="cpu")
assert out.shape == (2, len(gtf))
Here, one snag you may run into, is that torch really doesn’t like
None values in its tensors. As such, make sure to replace None values with
a suitable default value before converting to a torch.Tensor.
Why isn’t this native?¶
In order to ensure maximum stability, and limit trouble with conditional
compilation, dependency distribution and dependency versioning, we have chosen
to explicitly not use any Python third-party dependencies in eccLib. As such,
it’s impossible to provide native conversion to the types, as laid out in this
guide. However, thanks to a rigorous application of Python
collections.abc protocols, conversion to most third-party types should
be straightforward and simple in your code.