How to load vocab.txt file using huggingface transformers AutoTokenizer
Steps
-
Download a save already exist pretrained tokenizer which contains
vocab.txt. In our case we will downloadsagorsarker/bangla-bert-basefrom transformers import AutoTokenizer bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base") text = "আমি বাংলায় গান গাই।" bnbert_tokenizer.tokenize(text) # ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।'] # save the tokenizer by the following line tokenizer.save_pretrained('bangla-bert-base') # remember the folder bangla-bert-base must created before saving - Create your own folder and copy
special_tokens_map.jsonandtokenizer_config.jsonthere. Also keep yourvocab.txtfile there. - Open
tokenizer_config.jsonfile and check if special token index match withvocab.txtspecial token index. If not note the token index and update index intokenizer_config.json -
Now load your tokenizer folder using AutoTokenizer.from_pretrained()
from transformers import AutoTokenizer bnbert_tokenizer = AutoTokenizer.from_pretrained("mytokenizer") text = "আমি বাংলায় গান গাই।" bnbert_tokenizer.tokenize(text)
Comments