Skip to main content

Bengali Natural Language Processing(BNLP)

BNLP is a natural language processing toolkit for Bengali Language. This tool will help you to tokenize Bengali text, Embedding Bengali words, construct neural model for Bengali NLP purposes.

Installation

  • pypi package installer(python 3.6, 3.7 tested okay)

    pip install bnlp_toolkit

Pretrained Model

Trained on wikipedia dump dataset

Tokenization

  • Bengali SentencePiece Tokenization

    • tokenization using trained model
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer

      bsp = SP_Tokenizer()
      model_path = "./model/bn_spm.model"
      input_text = "আমি ভাত খাই। সে বাজারে যায়।"
      tokens = bsp.tokenize(model_path, input_text)
      print(tokens)

    • Training SentencePiece
      from bnlp.sentencepiece_tokenizer import SP_Tokenizer

      bsp = SP_Tokenizer(is_train=True)
      data = "test.txt"
      model_prefix = "test"
      vocab_size = 5
      bsp.train_bsp(data, model_prefix, vocab_size)

  • NLTK Tokenization

from bnlp.nltk_tokenizer import NLTK_Tokenizer

text = "আমি ভাত খাই। সে বাজারে যায়। তিনি কি সত্যিই ভালো মানুষ?"
bnltk = NLTK_Tokenizer(text)
word_tokens = bnltk.word_tokenize()
sentence_tokens = bnltk.sentence_tokenize()
print(word_tokens)
print(sentence_tokens)

Word Embedding

  • Bengali Word2Vec

    • Generate Vector using pretrain model

      from bnlp.bengali_word2vec import Bengali_Word2Vec

      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      vector = bwv.generate_word_vector(model_path, word)
      print(vector.shape)
      print(vector)

    • Find Most Similar Word Using Pretrained Model

      from bnlp.bengali_word2vec import Bengali_Word2Vec

      bwv = Bengali_Word2Vec()
      model_path = "model/wiki.bn.text.model"
      word = 'আমার'
      similar = bwv.most_similar(model_path, word)
      print(similar)

    • Train Bengali Word2Vec with your own data

      from bnlp.bengali_word2vec import Bengali_Word2Vec

      data_file = "test.txt"
      model_name = "test_model.model"
      vector_name = "test_vector.vector"
      bwv.train_word2vec(data_file, model_name, vector_name)


  • Bengali FastText

    • Download Bengali FastText Pretrained Model From Here

    • Generate Vector Using Pretrained Model

      from bnlp.bengali_fasttext import Bengali_Fasttext

      bft = Bengali_Fasttext()
      word = "গ্রাম"
      model_path = "cc.bn.300.bin"
      word_vector = bft.generate_word_vector(model_path, word)
      print(word_vector.shape)
      print(word_vector)


    • Train Bengali FastText Model

      from bnlp.bengali_fasttext import Bengali_Fasttext

      bft = Bengali_Fasttext(is_train=True)
      data = "data.txt"
      model_name = "saved_model.bin"
      bft.train_fasttext(data, model_name)

Issue

  • if ModuleNotFoundError: No module named 'fasttext' problem arise please do the next line

pip install fasttext

  • if nltk issue arise please do the following line before importing bnlp
import nltk
nltk.download("punkt")