Process Large Corpora Using Python Generator

May 1, 2021 less than 1 minute read

Suppose you have a large text corpora and you can’t process that large file in your small RAM computer.

Here is a solution for processing large corpora using python generator

class CorpusProcessing:
    def __init__(self, data_path):
        self.data_path = data_path

    def __iter__(self):
        for line in open(self.data_path):
            # do your process here
            # here I am doing white space tokenization
            tokens = line.split()
            yield tokens

process = CorpusProcessing('large_copora.txt')
for tokens in process:
    print(tokens)

References

Thanks TO

Faruk Ahmad vai for forcefully helping me learning python generator

Share on

Twitter Facebook LinkedIn

Comments

Large text data token counting fast

April 28, 2024 less than 1 minute read

In this blog I will share codes to count tokens from large dataset fast.

Processing tips of huggingface datasets

April 4, 2024 1 minute read

In this blog I will note down some tips to process huggingface datasets

Process large CSV file using pandas

January 17, 2024 1 minute read

In this blog I will share how to process large CSV file using pandas

Large CSV file multiprocessing

January 15, 2024 less than 1 minute read

In this blog I will share how to do multiprocessing on large CSV file with iterator

Sagor Sarker