Processing tips of huggingface datasets

April 4, 2024 1 minute read

Tips-1: Take sample of data from datasets

from datasets import load_dataset

dataset = load_dataset("data_name", split="split_name")

# randomize the data
shuffled_dataset = dataset.shuffle(seed=42)

sample_size = 500
sampled_dataset = shuffled_dataset['train'].select(range(sample_size))

# now process this sampled_dataset according to your needs

Tips-2: Apply map over the datasets

from datasets import load_dataset

dataset = load_dataset("data_name", split="split_name")

def tokenize_data(example):
  tokens = example['text'].split()
  example['tokens'] = tokens
  return example
  # you can return a new example dictionary to save momory just like
  # return {'tokens': tokens}

num_proc = 10 # number of processing (multiprocessing purpose)
tokenize_dataset = dataset.map(tokenize_data, num_proc=num_proc)

Upload large data to hub

Load the large datasets

Shard and save the datasets to local disk

from datasets import load_dataset

dataset = load_dataset(
    "json",
    data_files="/path/largedata.jsonl",
    cache_dir="/path/cached"
)

dataset.save_to_disk("/path/mybndatasets", max_shard_size="10GB", num_proc=64)

rename all data files to train-**-**.arrow format instead of data-**-**.arrow format
clone the huggingface dataset repository and move all the files to data folder
commit and push the datasets

Repository directory structure will be following

data/
  train-00000-of-00001.arrow
  test-00000-of-00001.arrow
.gitattributes
  

Save dataset as CSV

import pandas as pd
from datasets import load_dataset


dataset = load_dataset('hishab/boolq_bn')

df = pd.DataFrame(dataset['validation'])
df.to_csv('test.csv', index=False)

Dataset Subset and Split

This is the best option for pushing datasets to hub

from datasets import Dataset, DatasetDict
# load datasets from jsonl files
mydata = Dataset.from_json(
    "/path/mydataset/*.jsonl",
    cache_dir="/path/cache_dir"
)
mydata2 = Dataset.from_json(
    "/path/mydataset2/*.jsonl",
    cache_dir="/path/cache_dir"
)
# now add category and split
# Create the main dataset dictionary with subsets
mydata_dict = DatasetDict({
    "train": mydata
})

mydata1_dict = DatasetDict({
    "train": mydata1
})

# push to hub
mydata_dict.push_to_hub("sagor/mydata", config_name="mydata")
mydata1_dict.push_to_hub("sagor/mydata", config_name="mydata1")

In viewer it will show Subset as the mydata and mydata2 and split for mydata as train

NB: You need to login huggingface by

huggingface-cli login

Git LFS problem for JSONL file

If you clone repo and push by git then this option might help

Before adding file git lfs install
Also do this huggingface-cli lfs-enable-largefiles .
For JSONL file update .gitattribute file by git lfs track "*.jsonl"
Now add, commit and push

Share on

Twitter Facebook LinkedIn

Sagor Sarker

Processing tips of huggingface datasets

Tips-1: Take sample of data from datasets

Tips-2: Apply map over the datasets

Upload large data to hub

Save dataset as CSV

Dataset Subset and Split

Git LFS problem for JSONL file

Share on

Comments

You May Also Enjoy

Large text data token counting fast

Process large CSV file using pandas

Large CSV file multiprocessing

Multiprocessing example in python over pandas dataframe