How to load vocab.txt file using huggingface transformers AutoTokenizer
Steps
-
Download a save already exist pretrained tokenizer which contains
vocab.txt
. In our case we will downloadsagorsarker/bangla-bert-base
from transformers import AutoTokenizer bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base") text = "আমি বাংলায় গান গাই।" bnbert_tokenizer.tokenize(text) # ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।'] # save the tokenizer by the following line tokenizer.save_pretrained('bangla-bert-base') # remember the folder bangla-bert-base must created before saving
- Create your own folder and copy
special_tokens_map.json
andtokenizer_config.json
there. Also keep yourvocab.txt
file there. - Open
tokenizer_config.json
file and check if special token index match withvocab.txt
special token index. If not note the token index and update index intokenizer_config.json
-
Now load your tokenizer folder using AutoTokenizer.from_pretrained()
from transformers import AutoTokenizer bnbert_tokenizer = AutoTokenizer.from_pretrained("mytokenizer") text = "আমি বাংলায় গান গাই।" bnbert_tokenizer.tokenize(text)
Comments