less than 1 minute read

Suppose you have a large text corpora and you can’t process that large file in your small RAM computer.

Here is a solution for processing large corpora using python generator

class CorpusProcessing:
    def __init__(self, data_path):
        self.data_path = data_path

    def __iter__(self):
        for line in open(self.data_path):
            # do your process here
            # here I am doing white space tokenization
            tokens = line.split()
            yield tokens

process = CorpusProcessing('large_copora.txt')
for tokens in process:
    print(tokens)

References

Thanks TO

  • Faruk Ahmad vai for forcefully helping me learning python generator

Comments