Bengali Natural Language Processing(BNLP)
BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.
Installation
-
pypi package installer(python 3.6, 3.7 tested okay)
pip install bnlp_toolkit
Pretrained Model
Trained on wikipedia dump
dataset
Tokenization
-
Bengali SentencePiece Tokenization
- tokenization using trained model
from bnlp.sentencepiece_tokenizer import SP_Tokenizer
bsp = SP_Tokenizer()
model_path = "./model/bn_spm.model"
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
tokens = bsp.tokenize(model_path, input_text)
print(tokens) - Training SentencePiece
from bnlp.sentencepiece_tokenizer import SP_Tokenizer
bsp = SP_Tokenizer(is_train=True)
data = "test.txt"
model_prefix = "test"
vocab_size = 5
bsp.train_bsp(data, model_prefix, vocab_size)
- tokenization using trained model
-
NLTK Tokenization
from bnlp.nltk_tokenizer import NLTK_Tokenizer
text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer(text)
word_tokens = bnltk.word_tokenize()
sentence_tokens = bnltk.sentence_tokenize()
print(word_tokens)
print(sentence_tokens)
Word Embedding
-
Bengali Word2Vec
-
Generate Vector using pretrain model
from bnlp.bengali_word2vec import Bengali_Word2Vec
bwv = Bengali_Word2Vec()
model_path = "model/wiki.bn.text.model"
word = 'আমার'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
print(vector) -
Find Most Similar Word Using Pretrained Model
from bnlp.bengali_word2vec import Bengali_Word2Vec
bwv = Bengali_Word2Vec()
model_path = "model/wiki.bn.text.model"
word = 'আমার'
similar = bwv.most_similar(model_path, word)
print(similar) -
Train Bengali Word2Vec with your own data
from bnlp.bengali_word2vec import Bengali_Word2Vec
data_file = "test.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train_word2vec(data_file, model_name, vector_name)
-
-
Bengali FastText
-
Download Bengali FastText Pretrained Model From Here
-
Generate Vector Using Pretrained Model
from bnlp.bengali_fasttext import Bengali_Fasttext
bft = Bengali_Fasttext()
word = "গ্রাম"
model_path = "cc.bn.300.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
print(word_vector) -
Train Bengali FastText Model
from bnlp.bengali_fasttext import Bengali_Fasttext
bft = Bengali_Fasttext(is_train=True)
data = "data.txt"
model_name = "saved_model.bin"
bft.train_fasttext(data, model_name)
-
Issue
- if
ModuleNotFoundError: No module named 'fasttext'
problem arise please do the next line
pip install fasttext
- if
nltk
issue arise please do the following line before importingbnlp
import nltk
nltk.download("punkt")