Process Large Corpora Using Python Generator
Suppose you have a large text corpora and you can’t process that large file in your small RAM computer.
Here is a solution for processing large corpora using python generator
class CorpusProcessing:
def __init__(self, data_path):
self.data_path = data_path
def __iter__(self):
for line in open(self.data_path):
# do your process here
# here I am doing white space tokenization
tokens = line.split()
yield tokens
process = CorpusProcessing('large_copora.txt')
for tokens in process:
print(tokens)
References
Thanks TO
- Faruk Ahmad vai for forcefully helping me learning python generator
Comments