Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.

Installation

pypi package installer(python 3.6, 3.7 tested okay)

pip install bnlp_toolkit

Pretrained Model

Trained on wikipedia dump dataset

Tokenization

Bengali SentencePiece Tokenization

tokenization using trained model

from bnlp.sentencepiece_tokenizer import SP_Tokenizer

bsp = SP_Tokenizer()
model_path = "./model/bn_spm.model"
input_text = "আমি ভাত খাই। সে বাজারে যায়।"
tokens = bsp.tokenize(model_path, input_text)
print(tokens)

Training SentencePiece

from bnlp.sentencepiece_tokenizer import SP_Tokenizer

bsp = SP_Tokenizer(is_train=True)
data = "test.txt"
model_prefix = "test"
vocab_size = 5
bsp.train_bsp(data, model_prefix, vocab_size) 

NLTK Tokenization

from bnlp.nltk_tokenizer import NLTK_Tokenizer

text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer(text)
word_tokens = bnltk.word_tokenize()
sentence_tokens = bnltk.sentence_tokenize()
print(word_tokens)
print(sentence_tokens)

Word Embedding

Bengali Word2Vec

Generate Vector using pretrain model

from bnlp.bengali_word2vec import Bengali_Word2Vec

bwv = Bengali_Word2Vec()
model_path = "model/wiki.bn.text.model"
word = 'আমার'
vector = bwv.generate_word_vector(model_path, word)
print(vector.shape)
print(vector)

Find Most Similar Word Using Pretrained Model

from bnlp.bengali_word2vec import Bengali_Word2Vec

bwv = Bengali_Word2Vec()
model_path = "model/wiki.bn.text.model"
word = 'আমার'
similar = bwv.most_similar(model_path, word)
print(similar)

Train Bengali Word2Vec with your own data

from bnlp.bengali_word2vec import Bengali_Word2Vec

data_file = "test.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
bwv.train_word2vec(data_file, model_name, vector_name)

Bengali FastText

Download Bengali FastText Pretrained Model From Here

Generate Vector Using Pretrained Model

from bnlp.bengali_fasttext import Bengali_Fasttext

bft = Bengali_Fasttext()
word = "গ্রাম"
model_path = "cc.bn.300.bin"
word_vector = bft.generate_word_vector(model_path, word)
print(word_vector.shape)
print(word_vector)

Train Bengali FastText Model

from bnlp.bengali_fasttext import Bengali_Fasttext

bft = Bengali_Fasttext(is_train=True)
data = "data.txt"
model_name = "saved_model.bin"
bft.train_fasttext(data, model_name)

Issue

if ModuleNotFoundError: No module named 'fasttext' problem arise please do the next line

pip install fasttext

if nltk issue arise please do the following line before importing bnlp

import nltk
nltk.download("punkt")

Installation​

Pretrained Model​

Tokenization​

Word Embedding​

Issue​

Installation

Pretrained Model

Tokenization

Word Embedding

Issue