DataTechNotes: Word Embeddings Example with Word2Vec in Python

Word embeddings are vector representations of words in a continuous vector space, capturing semantic relationships between words. These representations are learned from large text corpora and are useful in natural language processing (NLP) tasks. Each word is represented by a dense vector, and the geometric distance between vectors reflects semantic similarity.

In this blog post, We will explore the concept of word embedding and its application in Python.. The tutorial covers:

The concept of word embedding
Overview of Word2Vec
Word embedding with Word2Vec in Python
T-SNE visualization of Word2Vec
Word similarity detection
Conclusion

Let's get started.

The concept of word embedding

Word embedding is a technique in NLP where words are represented as numerical vectors in a continuous space. It's crucial because it transforms the semantic meaning of words into a format machines can understand. By capturing relationships between words, it enables algorithms to comprehend context, leading to more effective language understanding. Word embeddings facilitate tasks like sentiment analysis, language translation, and information retrieval, enhancing the efficiency and accuracy of various machine learning models dealing with human language. In essence, word embeddings bridge the gap between linguistic complexity and machine interpretability, making NLP applications more robust and context-aware.

Overview of Word2Vec

Word2Vec is a popular word embedding technique in NLP developed by a team at Google. It represents words as high-dimensional vectors in a continuous space, where words with similar meanings are positioned closer to each other. Word2Vec is trained on large datasets to learn the relationships between words based on their co-occurrence patterns. It introduces two models: Continuous Bag of Words (CBOW) predicts a target word from its context, while Skip-Gram predicts surrounding words from a target word. Word2Vec's efficient vector representations capture semantic relationships, making it valuable for various NLP tasks like sentiment analysis and machine translation.

Word embedding with Word2Vec in Python

Now, let's proceed with the implementation of Word2Vec for word embedding in Python. In this example, we'll be using the 'gensim' library. Ensure that you have 'gensim' installed; if not, you can easily install it with the following command:

 
pip install gensim
 

Using 'gensim' library, we can easily train our Word2Vec model on a corpus of text. The process involves feeding the model with sentences and letting it iteratively learn the vector representations of words. We tokenize the input sentences, train the Word2Vec model, and save the resulting model.

 
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize

# Sample sentences 
sentences = ["Word embeddings are fascinating.", 
             "They provide a rich representation of words.", 
             "Word2Vec is a popular technique.",
             "It helps machines to understand the context of words"]

# Tokenize the sentences
tokenized_sentences = [word_tokenize(sentence.lower()) 
   for sentence in sentences]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_sentences, 
   vector_size=100, window=5, min_count=1, workers=4)

# Save the model
model.save("word2vec_model")
 

T-SNE visualization of Word2Vec in Python

After training our Word2Vec model, we can visualize it using T-SNE. T-distributed Stochastic Neighbor Embedding (T-SNE) is a popular technique for reducing high-dimensional data to two or three dimensions, enabling the visualization of relationships between words in our Word2Vec space. The scikit-learn provides the TSNE class for visualizing high-dimensional data.

In the code below, we load the previously saved model, extract words and vectors, fit and transform vectors with TSNE, and finally visualize Word2Vec in a plot with annotated words.

 
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

loaded_model = Word2Vec.load("word2vec_model")

# Extract vectors and words from the Word2Vec model
vectors = model.wv[model.wv.key_to_index]
words = list(model.wv.index_to_key)

# Reduce perplexity to a lower value
perplexity = min(30, len(words)-1)

# Reduce dimensionality with T-SNE
tsne = TSNE(n_components=2, perplexity=perplexity, random_state=42)
vectors_2d = tsne.fit_transform(vectors)

# Plot the T-SNE visualization
plt.figure(figsize=(10, 8))
plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], marker='o', s=30, alpha=0.5)

# Annotate points with words
for word, (x, y) in zip(words, vectors_2d):
    plt.annotate(word, (x, y), fontsize=12)

plt.title('T-SNE Visualization of Word2Vec')
plt.show()
 

Word similarity detection

In addition to visualization, Word2Vec enables us to quantify the similarity between words. Below code example shows how to detect word similarity using the trained model.

# Example code snippet for word similarity detection
similarity_score = model.wv.similarity('word', 'popular')
print(f"Similarity between 'word' and 'popular': {similarity_score}")
 

    
 Similarity between 'word' and 'popular': 0.19613032042980194
 

Conclusion

Word embeddings are dense vector representations of words in a continuous vector space. These representations capture semantic relationships between words, enabling machines to understand the contextual meaning of words in natural language processing tasks.

In this tutorial, we've briefly explored word embedding, its representation with Word2Vec, and T-SNE visualization in Python.

References:

DataTechNotes

Pages

Word Embeddings Example with Word2Vec in Python

No comments:

Post a Comment