Instructions to use ctheodoris/Geneformer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ctheodoris/Geneformer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="ctheodoris/Geneformer")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("ctheodoris/Geneformer") model = AutoModelForMaskedLM.from_pretrained("ctheodoris/Geneformer") - Inference
- Notebooks
- Google Colab
- Kaggle
int32 overflow when using a large enough dataset
Hi,
The
output_dataset = Dataset.from_dict(dataset_dict)
call from within the Tokenizer causes an Int32 overflow when the dataset is too big (exciting max(int32) in size):
This is due to this function in the Huggingface code:
def numpy_to_pyarrow_listarray(arr: np.ndarray, type: pa.DataType = None) -> pa.ListArray:
"""Build a PyArrow ListArray from a multidimensional NumPy array"""
arr = np.array(arr)
values = pa.array(arr.flatten(), type=type)
for i in range(arr.ndim - 1):
n_offsets = reduce(mul, arr.shape[: arr.ndim - i - 1], 1)
step_offsets = arr.shape[arr.ndim - i - 1]
offsets = pa.array(np.arange(n_offsets + 1) * step_offsets, type=pa.int32())
values = pa.ListArray.from_arrays(offsets, values)
return values
I found that in the dictionary values are saved as np.array rather than python List. This doesn't happen:
## DEBUG
print('Changing lists to np.arrays..')
dataset_dict['input_ids'] = np.array(dataset_dict['input_ids'], dtype='object')
dataset_dict['gene'] = np.array(dataset_dict['gene'], dtype='object')
## DEBUG
# create dataset
output_dataset = Dataset.from_dict(dataset_dict)
Has someone came across this?
Kind regards,
Eyal.
Thank you for noting this! We did not come across this when tokenizing Genecorpus-30M. It would be very helpful if you could check the Huggingface Datasets issues to see if there is a suggested solution to this, or open a new issue if this question has not already been raised. It would be great if you could update this discussion with any resolution you may find from Huggingface to help other users who may encounter this issue.

