Word Embedding

March 3, 2019 1 minute read

Word Embedding is the vector representation of a particular word.

In word embedding words or phrases from the vocabulary are mapped to vectors of real numbers.

Word Embedding Software

Word2Vec
GloVe
FastText
Gensim

word2vec

Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus.

Word2vec can utilize either of two model architectures to produce a distributed representation of words:

Word2Vec Model Using Gensim

Gensim’s Word2Vec implementation let’s you train your own word embedding model for a given corpus. Dependencis:

Python(2.7)
Gensim

from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]

# Split the data into 2 parts. Part 2 will be used later to update the model
data_part1 = data[:1000]
data_part2 = data[1000:]

# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, min_count = 0, workers=cpu_count())

# Get the word vector for given word
model['topic']
#> array([ 0.0512,  0.2555,  0.9393, ... ,-0.5669,  0.6737], dtype=float32)

model.most_similar('topic')
#> [('discussion', 0.7590423822402954),
#>  ('consensus', 0.7253159284591675),
#>  ('discussions', 0.7252693176269531),
#>  ('interpretation', 0.7196053266525269),
#>  ('viewpoint', 0.7053568959236145),
#>  ('speculation', 0.7021505832672119),
#>  ('discourse', 0.7001898884773254),
#>  ('opinions', 0.6993060111999512),
#>  ('focus', 0.6959210634231567),
#>  ('scholarly', 0.6884037256240845)]

# Save and Load Model
model.save('newmodel')
model = Word2Vec.load('newmodel')

We have trained and saved a Word2Vec model for our document. However, when a new dataset comes, you want to update the model so as to account for new words.

References

Share on

Twitter Facebook LinkedIn

Sagor Sarker

Word Embedding

Word Embedding Software

word2vec

Word2Vec Model Using Gensim

References

Share on

Comments

You May Also Enjoy

Large text data token counting fast

Processing tips of huggingface datasets

Process large CSV file using pandas

Large CSV file multiprocessing