Sagor Sarker

Sagor Sarker

NLP, DL, ML practitioner

How to load vocab.txt file using huggingface transformers AutoTokenizer

November 5, 2023 less than 1 minute read

Steps

Download a save already exist pretrained tokenizer which contains vocab.txt. In our case we will download sagorsarker/bangla-bert-base

  from transformers import AutoTokenizer

  bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
  text = "আমি বাংলায় গান গাই।"
  bnbert_tokenizer.tokenize(text)
  # ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']
  # save the tokenizer by the following line
  tokenizer.save_pretrained('bangla-bert-base')
  # remember the folder bangla-bert-base must created before saving

Create your own folder and copy special_tokens_map.json and tokenizer_config.json there. Also keep your vocab.txt file there.
Open tokenizer_config.json file and check if special token index match with vocab.txt special token index. If not note the token index and update index in tokenizer_config.json

Now load your tokenizer folder using AutoTokenizer.from_pretrained()

  from transformers import AutoTokenizer

  bnbert_tokenizer = AutoTokenizer.from_pretrained("mytokenizer")
  text = "আমি বাংলায় গান গাই।"
  bnbert_tokenizer.tokenize(text)

Share on

Twitter Facebook LinkedIn

Comments

You May Also Enjoy

Large text data token counting fast

April 28, 2024 less than 1 minute read

In this blog I will share codes to count tokens from large dataset fast.

Processing tips of huggingface datasets

April 4, 2024 1 minute read

In this blog I will note down some tips to process huggingface datasets

Process large CSV file using pandas

January 17, 2024 1 minute read

In this blog I will share how to process large CSV file using pandas

Large CSV file multiprocessing

January 15, 2024 less than 1 minute read

In this blog I will share how to do multiprocessing on large CSV file with iterator