Word Embedding
Word Embedding is the vector representation of a particular word.
In word embedding words or phrases from the vocabulary are mapped to vectors of real numbers.
Word Embedding Software
- Word2Vec
- GloVe
- FastText
- Gensim
word2vec
Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus.
Word2vec can utilize either of two model architectures to produce a distributed representation of words:
Word2Vec Model Using Gensim
Gensim’s Word2Vec implementation let’s you train your own word embedding model for a given corpus. Dependencis:
- Python(2.7)
- Gensim
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api
# Download dataset
dataset = api.load("text8")
data = [d for d in dataset]
# Split the data into 2 parts. Part 2 will be used later to update the model
data_part1 = data[:1000]
data_part2 = data[1000:]
# Train Word2Vec model. Defaults result vector size = 100
model = Word2Vec(data_part1, min_count = 0, workers=cpu_count())
# Get the word vector for given word
model['topic']
#> array([ 0.0512, 0.2555, 0.9393, ... ,-0.5669, 0.6737], dtype=float32)
model.most_similar('topic')
#> [('discussion', 0.7590423822402954),
#> ('consensus', 0.7253159284591675),
#> ('discussions', 0.7252693176269531),
#> ('interpretation', 0.7196053266525269),
#> ('viewpoint', 0.7053568959236145),
#> ('speculation', 0.7021505832672119),
#> ('discourse', 0.7001898884773254),
#> ('opinions', 0.6993060111999512),
#> ('focus', 0.6959210634231567),
#> ('scholarly', 0.6884037256240845)]
# Save and Load Model
model.save('newmodel')
model = Word2Vec.load('newmodel')
We have trained and saved a Word2Vec model for our document. However, when a new dataset comes, you want to update the model so as to account for new words.
Comments