Word Embedding

Bengali Word2Vec

Generate Vector Using Pretrain Model

To use pretrained model do not pass model_path to BengaliWord2Vec(). It will download pretrained BengaliWord2Vec model itself.

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()

word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape)

Find Most Similar Word Using Pretrained Model

To use pretrained model do not pass model_path to BengaliWord2Vec(). It will download pretrained BengaliWord2Vec model itself.

from bnlp import BengaliWord2Vec

bwv = BengaliWord2Vec()

word = 'গ্রাম'
similar_words = bwv.get_most_similar_words(word, topn=10)
print(similar_words)

Generate Vector Using Own Model

To use own model pass model path as model_path argument to BengaliWord2Vec() like below snippet

from bnlp import BengaliWord2Vec

own_model_path = "own_directory/own_bwv_model.pkl"
bwv = BengaliWord2Vec(model_path=own_model_path)

word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape)

Find Most Similar Word Using Own Model

To use own model pass model path as model_path argument to BengaliWord2Vec() like below snippet

from bnlp import BengaliWord2Vec

own_model_path = "own_directory/own_bwv_model.pkl"
bwv = BengaliWord2Vec(model_path=own_model_path)

word = 'গ্রাম'
similar_words = bwv.get_most_similar_words(word, topn=10)
print(similar_words)

Train Bengali Word2Vec with your own data

Train Bengali word2vec with your custom raw data or tokenized sentences.

Custom tokenized sentence format example:

sentences = [['আমি', 'ভাত', 'খাই', '।'], ['সে', 'বাজারে', 'যায়', '।']]

Check gensim word2vec api for details of training parameter

from bnlp import Word2VecTraining

trainer = Word2VecTraining()

data_file = "raw_text.txt" # or you can pass custom sentence tokens as list of list
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.train(data_file, model_name, vector_name, epochs=5)

Pre-train or resume word2vec training with same or new corpus or tokenized sentences

Check gensim word2vec api for details of training parameter

from bnlp import Word2VecTraining

trainer = Word2VecTraining()

trained_model_path = "mytrained_model.model"
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.pretrain(trained_model_path, data_file, model_name, vector_name, epochs=5)

Bengali FastText

To use fasttext you need to install fasttext manually by pip install fasttext==0.9.2 or install via bnlp by pip install bnlp_toolkit[fasttext]

NB: To use fasttext on windows, install fasttext by following this article.

Generate Vector Using Pretrained Model

To use pretrained model do not pass model_path to BengaliFasttext(). It will download pretrained BengaliFasttext model itself.

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()

word = "গ্রাম"
word_vector = bft.get_word_vector(word)
print(word_vector.shape)

Generate Vector File from Fasttext Binary Model

To use pretrained model do not pass model_path to BengaliFasttext(). It will download pretrained BengaliFasttext model itself.

from bnlp.embedding.fasttext import BengaliFasttext

bft = BengaliFasttext()

out_vector_name = "myvector.txt"
bft.bin2vec(out_vector_name)

Generate Vector Using Pretrained Model

To use own model pass model path as model_path argument to BengaliFasttext() like below snippet.

from bnlp.embedding.fasttext import BengaliFasttext

own_model_path = "own_directory/own_fasttext_model.bin"
bft = BengaliFasttext(model_path=own_model_path)

word = "গ্রাম"
word_vector = bft.get_word_vector(model_path, word)
print(word_vector.shape)

Generate Vector File from Fasttext Binary Model

To use own model pass model path as model_path argument to BengaliFasttext() like below snippet.

from bnlp.embedding.fasttext import BengaliFasttext

own_model_path = "own_directory/own_fasttext_model.bin"
bft = BengaliFasttext(model_path=own_model_path)

out_vector_name = "myvector.txt"
bft.bin2vec(out_vector_name)

Train Bengali FastText Model

Check fasttext documentation for details of training parameter

from bnlp.embedding.fasttext import FasttextTrainer

trainer = FasttextTrainer()

data = "raw_text.txt"
model_name = "saved_model.bin"
epoch = 50
trainer.train(data, model_name, epoch)

Bengali GloVe Word Vectors

We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.

from bnlp import BengaliGlove

bengali_glove = BengaliGlove() # will automatically download pretrained model

word = "গ্রাম"
vector = bengali_glove.get_word_vector(word)
print(vector.shape)

similar_words = bengali_glove.get_closest_word(word)
print(similar_words)

Bengali Word2Vec​

Generate Vector Using Pretrain Model​

Find Most Similar Word Using Pretrained Model​

Generate Vector Using Own Model​

Find Most Similar Word Using Own Model​

Train Bengali Word2Vec with your own data​

Pre-train or resume word2vec training with same or new corpus or tokenized sentences​

Bengali FastText​

Generate Vector Using Pretrained Model​

Generate Vector File from Fasttext Binary Model​

Generate Vector Using Pretrained Model​

Generate Vector File from Fasttext Binary Model​

Train Bengali FastText Model​

Bengali GloVe Word Vectors​

Bengali Word2Vec

Generate Vector Using Pretrain Model

Find Most Similar Word Using Pretrained Model

Generate Vector Using Own Model

Find Most Similar Word Using Own Model

Train Bengali Word2Vec with your own data

Pre-train or resume word2vec training with same or new corpus or tokenized sentences

Bengali FastText

Generate Vector Using Pretrained Model

Generate Vector File from Fasttext Binary Model

Generate Vector Using Pretrained Model

Generate Vector File from Fasttext Binary Model

Train Bengali FastText Model

Bengali GloVe Word Vectors