Word Embedding
Bengali Word2Vec
Generate Vector Using Pretrain Model
To use pretrained model do not pass model_path
to BengaliWord2Vec()
. It will download pretrained BengaliWord2Vec
model itself.
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape)
Find Most Similar Word Using Pretrained Model
To use pretrained model do not pass model_path
to BengaliWord2Vec()
. It will download pretrained BengaliWord2Vec
model itself.
from bnlp import BengaliWord2Vec
bwv = BengaliWord2Vec()
word = 'গ্রাম'
similar_words = bwv.get_most_similar_words(word, topn=10)
print(similar_words)
Generate Vector Using Own Model
To use own model pass model path as model_path
argument to BengaliWord2Vec()
like below snippet
from bnlp import BengaliWord2Vec
own_model_path = "own_directory/own_bwv_model.pkl"
bwv = BengaliWord2Vec(model_path=own_model_path)
word = 'গ্রাম'
vector = bwv.get_word_vector(word)
print(vector.shape)
Find Most Similar Word Using Own Model
To use own model pass model path as model_path
argument to BengaliWord2Vec()
like below snippet
from bnlp import BengaliWord2Vec
own_model_path = "own_directory/own_bwv_model.pkl"
bwv = BengaliWord2Vec(model_path=own_model_path)
word = 'গ্রাম'
similar_words = bwv.get_most_similar_words(word, topn=10)
print(similar_words)
Train Bengali Word2Vec with your own data
Train Bengali word2vec with your custom raw data or tokenized sentences.
Custom tokenized sentence format example:
sentences = [['আমি', 'ভাত', 'খাই', '।'], ['সে', 'বাজারে', 'যায়', '।']]
Check gensim word2vec api for details of training parameter
from bnlp import Word2VecTraining
trainer = Word2VecTraining()
data_file = "raw_text.txt" # or you can pass custom sentence tokens as list of list
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.train(data_file, model_name, vector_name, epochs=5)
Pre-train or resume word2vec training with same or new corpus or tokenized sentences
Check gensim word2vec api for details of training parameter
from bnlp import Word2VecTraining
trainer = Word2VecTraining()
trained_model_path = "mytrained_model.model"
data_file = "raw_text.txt"
model_name = "test_model.model"
vector_name = "test_vector.vector"
trainer.pretrain(trained_model_path, data_file, model_name, vector_name, epochs=5)
Bengali FastText
To use fasttext
you need to install fasttext manually by pip install fasttext==0.9.2
or install via bnlp by pip install bnlp_toolkit[fasttext]
NB: To use fasttext
on windows
, install fasttext
by following this article.
Generate Vector Using Pretrained Model
To use pretrained model do not pass model_path
to BengaliFasttext()
. It will download pretrained BengaliFasttext
model itself.
from bnlp.embedding.fasttext import BengaliFasttext
bft = BengaliFasttext()
word = "গ্রাম"
word_vector = bft.get_word_vector(word)
print(word_vector.shape)
Generate Vector File from Fasttext Binary Model
To use pretrained model do not pass model_path
to BengaliFasttext()
. It will download pretrained BengaliFasttext
model itself.
from bnlp.embedding.fasttext import BengaliFasttext
bft = BengaliFasttext()
out_vector_name = "myvector.txt"
bft.bin2vec(out_vector_name)
Generate Vector Using Pretrained Model
To use own model pass model path as model_path
argument to BengaliFasttext()
like below snippet.
from bnlp.embedding.fasttext import BengaliFasttext
own_model_path = "own_directory/own_fasttext_model.bin"
bft = BengaliFasttext(model_path=own_model_path)
word = "গ্রাম"
word_vector = bft.get_word_vector(model_path, word)
print(word_vector.shape)
Generate Vector File from Fasttext Binary Model
To use own model pass model path as model_path
argument to BengaliFasttext()
like below snippet.
from bnlp.embedding.fasttext import BengaliFasttext
own_model_path = "own_directory/own_fasttext_model.bin"
bft = BengaliFasttext(model_path=own_model_path)
out_vector_name = "myvector.txt"
bft.bin2vec(out_vector_name)
Train Bengali FastText Model
Check fasttext documentation for details of training parameter
from bnlp.embedding.fasttext import FasttextTrainer
trainer = FasttextTrainer()
data = "raw_text.txt"
model_name = "saved_model.bin"
epoch = 50
trainer.train(data, model_name, epoch)
Bengali GloVe Word Vectors
We trained glove model with bengali data(wiki+news articles) and published bengali glove word vectors
You can download and use it on your different machine learning purposes.
from bnlp import BengaliGlove
bengali_glove = BengaliGlove() # will automatically download pretrained model
word = "গ্রাম"
vector = bengali_glove.get_word_vector(word)
print(vector.shape)
similar_words = bengali_glove.get_closest_word(word)
print(similar_words)