2023 Machine Learning Concept Embedding explained with two examples

21 August, 2023

Contributors

Michael

@xychenmsn

What is Categorical Embedding?

Categorical embedding is a mapping from discrete objects, such as words or categorical variables, into vectors of continuous numbers. In the context of deep learning, embeddings are typically used to reduce dimensionality and to convert categorical variables into a form that can be fed into neural networks.

Why Use Categorical Embedding?

Dimensionality Reduction: Categorical data with many unique categories can be difficult to handle using traditional one-hot encoding, especially when the number of categories is large. Embedding helps by reducing the dimensionality, mapping each category to a vector in a continuous space.
Capturing Relationships: Through training, the model may learn relationships between different categories. Categories that are somehow related might end up being closer to each other in the continuous space. This allows the model to generalize better from the training data to unseen data.
Efficiency: By reducing the dimensionality and capturing relationships between categories, embeddings can make the model training more efficient.

How Does It Work?

Initialization: Each category is initially represented as a random vector in a lower-dimensional space.
Training: During training, the vectors are optimized along with the rest of the model. The exact nature of the optimization depends on the model and the data, but the goal is generally to adjust the vectors so that they help the model make better predictions.
Using the Embeddings: Once trained, the embeddings represent each category as a vector in continuous space. These vectors can be used as inputs to other models or to understand relationships between the categories.

Example 1: Word Embeddings

One of the most popular applications of embedding is in natural language processing, where words are mapped to vectors. Techniques like Word2Vec or GloVe train word embeddings to capture semantic meanings and relationships between words. Words with similar meanings tend to have similar vectors.

import torch

import torch.nn as nn

import torch.optim as optim



# Vocabulary and data

vocab = {'<PAD>': 0, 'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5}

data = [['the', 'cat', 'sat'], ['cat', 'sat', 'on'], ['sat', 'on', 'the'], ['on', 'the', 'mat']]

data_indices = [[vocab[word] for word in sentence] for sentence in data] # Example: [[1, 2, 3], [2, 3, 4], [3, 4, 1], [4, 1, 5]]



# Hyperparameters

embedding_dim = 5  # Size of embedding vector

vocab_size = 6  # Size of vocabulary (6 words in vocab)

context_size = 2  # Number of context words



# Model for learning word embeddings

class WordEmbedding(nn.Module):

    def __init__(self, vocab_size, embedding_dim):

        super(WordEmbedding, self).__init__()

        self.embeddings = nn.Embedding(vocab_size, embedding_dim) # Shape: (6, 5), Example of weights: [[0.1, 0.2, 0.3, 0.4, 0.5], [0.3, 0.4, 0.5, 0.6, 0.7], ...]



    def forward(self, inputs):  # Shape of inputs: (2,), Example of inputs: [1, 2]

        return self.embeddings(inputs)  # Shape: (2, 5), Example: [[0.1, 0.2, 0.3, 0.4, 0.5], [0.3, 0.4, 0.5, 0.6, 0.7]]



# Training

model = WordEmbedding(vocab_size, embedding_dim)

loss_function = nn.MSELoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)



for epoch in range(10):

    for sentence in data_indices:

        context = [sentence[i] for i in range(len(sentence) - 1)]  # Shape: (2,), Example: [1, 2]

        target = sentence[-1]  # Scalar, Example: 3

        context_tensor = torch.tensor(context, dtype=torch.long)  # Shape: (2,), Example: [1, 2]

        target_tensor = torch.tensor(target, dtype=torch.long)  # Scalar, Example: 3



        optimizer.zero_grad()

        out = model(context_tensor)  # Shape: (2, 5), Example: [[0.1, 0.2, 0.3, 0.4, 0.5], [0.3, 0.4, 0.5, 0.6, 0.7]]

        loss = loss_function(out, model.embeddings(target_tensor).unsqueeze(0))  # Unsqueeze to match shapes, Example: [[0.1, 0.2, 0.3, 0.4, 0.5]]

        loss.backward()

        optimizer.step()



# Resulting word embeddings

word_embeddings = model.embeddings.weight.data  # Shape: (6, 5), Example: [[0.1, 0.2, 0.3, 0.4, 0.5], [0.3, 0.4, 0.5, 0.6, 0.7], ...]

Example 2: Entity Embeddings of Categorical Variables

In tabular data, categorical embeddings can be used to convert categorical columns into continuous embeddings. This technique has been particularly successful in some Kaggle competitions, where categorical variables with a large number of categories are common.

import torch

import torch.nn as nn



def custom_embedding(indices, weight):

    # Indices shape: (batch_size,)

    # Weight shape: (num_categories, embedding_dim)

    return weight[indices, :] # Lookup in the weight matrix using indices, Shape: (batch_size, embedding_dim)



class CustomEmbeddingModel(nn.Module):

    def __init__(self, num_categories, embedding_dim):

        super(CustomEmbeddingModel, self).__init__()

        self.embedding_weight = nn.Parameter(torch.randn(num_categories, embedding_dim)) # Shape: (3, 2), 3 unique categories, 2-dimensional embedding

        self.fc = nn.Linear(embedding_dim, 1) # Shape: (2, 1)



    def forward(self, x):

        # Input shape: (batch_size, 1), e.g., [[0], [1], [0], [2], [1]]

        x = x.view(-1) # Flatten to shape: (batch_size,), e.g., [0, 1, 0, 2, 1]

        x = custom_embedding(x, self.embedding_weight) # Custom embedding lookup, Shape: (batch_size, 2), e.g., [[0.1, 0.2], ...]

        x = self.fc(x) # Applying fully connected layer, Shape: (batch_size, 1), e.g., [[5.5], ...]

        return x



# Sample input (batch_size = 5)

X = torch.tensor([0, 1, 0, 2, 1], dtype=torch.long)



# Sample weight matrix (3 categories, 2-dimensional embedding)

weight = torch.tensor([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]], dtype=torch.float32)



# Using the custom_embedding function

embedded_X = custom_embedding(X, weight)

# Resulting shape: (5, 2), e.g., [[0.1, 0.2], [0.3, 0.4], [0.1, 0.2], [0.5, 0.6], [0.3, 0.4]]

Conclusion

Categorical embedding is a powerful tool for working with categorical data, allowing models to handle high-dimensional data more efficiently and to capture complex relationships between categories. It is widely used in various domains, from natural language processing to handling tabular data with many categorical variables.