Multiprocessing using concurrent in python

January 10, 2024 1 minute read

Multiprocessing is an essential need to process long list or lage chunks of data. In this post I will share a simple python program to do the multiprocessing for tokenizing chunks of list of data.

Steps

Import necessaries

import os
import glob
import concurrent
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor

Create your specific function to do your task. In my case I have created a file tokenizer to read the file and return tokens

def file_tokenization(file):
    with open(file) as f:
      text = f.read()
      return text.split()

Inside main function get the cpu count using os library

cpu_count = os.cpu_count()

Create a threadpool using concurrent library and pass your function as iterator with input argument. Also add tqdm to show the progress.

# Using ThreadPoolExecutor for parallel downloads
with ThreadPoolExecutor(max_workers=cpu_count) as executor:
    futures = [executor.submit(file_tokenization, file) for file in files]

    # Using tqdm to display progress
    for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
        print(future.result())

Full Source Code

import os
import glob
import concurrent
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor


def file_tokenization(file):
    with open(file) as f:
      text = f.read()
      return text.split()

def main():
    files = glob.glob('sample_data/*')

    # Get CPU count for parallelism
    cpu_count = os.cpu_count()

    # Using ThreadPoolExecutor for parallel downloads
    with ThreadPoolExecutor(max_workers=cpu_count) as executor:
        futures = [executor.submit(file_tokenization, file) for file in files]

        # Using tqdm to display progress
        for future in tqdm(concurrent.futures.as_completed(futures), total=len(futures)):
            print(future.result())

if __name__ == "__main__":
    main()

Share on

Twitter Facebook LinkedIn

Sagor Sarker

Multiprocessing using concurrent in python

Steps

Full Source Code

Share on

Comments

You May Also Enjoy

Large text data token counting fast

Processing tips of huggingface datasets

Process large CSV file using pandas

Large CSV file multiprocessing