Decoding LSTM using PyTorch 🔥

Yash Paneliya
9 min readJun 7, 2024

--

Hey there, data scientists! Today, we’re diving into the fascinating world of Long Short-Term Memory (LSTM) networks using PyTorch. LSTMs are like the superheroes of the neural network realm, equipped with the power to understand and remember patterns over time.

After reading a lot of StackOverflow answers and theories about LSTM, I tried my hands-on with LSTM. This blog is not a run-of-the-mill blog where we drone on about the inner workings of a cell. Nope, in this blog, we’ll peel back the layers of LSTM architecture, unravelling its secrets one line of code at a time. So, open up your jupyter-lab to code along!! And yeah one more thing, grab your coffee, this is going to take time.

Preliminaries

I assume here that you already have a little idea about how LSTM works and what its components are. If you don’t, then refer to this article. However, I will give a brief to make it easier.

Skeleton of an LSTM Cell/Unit

One LSTM cell/unit has 3 types of gates: Forget, input and output gates, as shown in the above figure. The yellow boxes are tiny neural networks with sigmoid and softmax activations. h is the hidden state, and c is the cell state. X(t) is the input at the current time step. You will get to know the values and shapes of these vectors in a while. The pink merge points are vector operations like addition. Y(t) is the output at each time step. The point where h(t-1) and X(t) are meeting (marked as a blue circle) denotes the vector concatenation.

Fire up your Jupyter-Lab

We’ll be doing a simple text-based Email spam classification to understand LSTM. For this blog, we are going to use the PyTorch library for the implementation (bcoz I only know that). The dataset is taken from Kaggle.

Before moving onto LSTM architecture, we first have to process the dataset, to make it suitable for LSTM.

# Load the dataset
data_ = pd.read_csv('../input/email-spam-ham-prediction/sms_spam.csv')
data_.head()

For any machine learning model, we need numerical vectors as input which we don’t have here. So, we have to convert the sentences and words into numbers. To do this, we build a vocabulary using the whole corpus (i.e. dataset). This will give a specific number to each unique word in the whole corpus.

def build_vocab(data):
for text in data:
yield word_tokenize(text)

# build vocab
vocab = build_vocab_from_iterator(build_vocab(data_['text']))

Now, we encode the sentences using the vocabulary.

encoded_train_data = []
def encode_word2int(data):
word2int = []
for text in data:
tokens = word_tokenize(text)
word2int.append([vocab[word] for word in tokens])
return word2int

# A list of lists representing encoded sentences in form of numbers
encoded_train_data = encode_word2int(data_['text'])

Another important thing for a machine learning model is the size or shape of the input. All inputs must be of the same dimensions which is not satisfied here (bcoz sentences are of different lengths). To tackle this, we will pad the sentences to a certain fixed length (which can be decided based on the longest sentence of the corpus or domain knowledge). Here I am taking the max length of any sentence as 5 for better explainability and padding all shorter sentences with 0 but you must take the longest sentence length or higher number for it.

MAX_SEQ_LEN = 5

# Padding the sentences
padded_X = []
for sentence in encoded_train_data:
if len(sentence) > MAX_SEQ_LEN:
padded_X.append(sentence[:MAX_SEQ_LEN])
else:
padded_X.append(sentence + [0]*(MAX_SEQ_LEN-len(sentence)))

# Updated array of encoded sentences
padded_X = np.array(padded_X)

Also, convert the email labels from categorical to numerical values.

data_[['label']] = data_[['type']].apply(lambda col:pd.Categorical(col).codes)
labels = np.array(data_['label'])

Huffff, preprocessing is done. You can take a sip of coffee now.

The data is now ready to get grilled on any ML model. So, let’s split it into train and test sets.

split_index = int(len(padded_X) * 0.8) # 80:20 Splitting

train_set = TensorDataset(torch.from_numpy(padded_X[:split_index]), torch.from_numpy(labels[:split_index]))
val_set = TensorDataset(torch.from_numpy(padded_X[split_index:]), torch.from_numpy(labels[split_index:]))

Create data loaders for training and testing sets and divide them into small batches. Batch size is taken 4 here to explain further outputs in a better way. You can change it to a higher number for fast processing.

batch_size = 4

train_loader = DataLoader(train_set, batch_size, pin_memory=True, shuffle=True)
valid_loader = DataLoader(val_set, batch_size, pin_memory=True, shuffle=False)

Now, it’s the time for the most awaited part. The LSTM architecture.

class LSTM(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, patience):
super(LSTM, self).__init__()
self.hidden_dim = hidden_dim
self.num_layers = num_layers

# embedding and LSTM layers
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, dropout=0.3, num_layers = num_layers)

# linear and sigmoid layers
self.fc = nn.Linear(hidden_dim, 1)
self.dropout = nn.Dropout(0.3)
self.sig = nn.Sigmoid()

def forward(self, x, h):
batch_size = x.size(0)
print("Batch size: ", batch_size)

x = x.long()
print("X shape: ", x.shape)

embeds = self.embedding(x)
print("embed size: ", embeds.shape)
print(embeds)

lstm_out, hidden = self.lstm(embeds, h)
print("lstm_out shape:", lstm_out.shape)
print(lstm_out)

print("h shape:", hidden[0].shape)
print(hidden[0])

print("c shape:", hidden[1].shape)
print(hidden[1])

lstm_out = lstm_out[:, -1, :] # getting the last time step output
print("last time step lstm_out shape:", lstm_out.shape)

lstm_out = self.dropout(lstm_out)
# fully-connected layer
out = self.fc(lstm_out)
# sigmoid function
out = self.sig(out)
# return last sigmoid output
return out

This LSTM architecture consists of an embedding layer, followed by an LSTM layer with specified hidden dimensions and layers. Dropout regularization is applied to the LSTM output, followed by a linear layer and a sigmoid activation function to produce binary classification output.

The embedding layer is used to provide a vector representation for each word of the corpus (It is just a lookup table having a row of each unique word). Earlier we had a number denoting the word, which now will be converted to a vector of size embdding_dim for better vector operations in the network. The output generated by the embedding layer for a sentence will serve as an input to the LSTM layer.

hidden_dim is the number of output neurons of the tiny neural networks inside the LSTM unit that we discussed in the very beginning. It will be same for all the neural networks in the cell.

Skeleton of internal tiny neural network

The LSTM layer will process the complete sentence word by word to learn the sequence pattern of words based on their vector representations and output a vector of size hidden_dim . The size of h and c will be the same as hidden_dim . num_layers is the number of LSTM layers stacked on top of each other.

The input to this tiny neural network shown in the figure will be the concatenation of previous time step hidden state and current time step input.

Take a sip before moving forward. In the next part, we’ll be traversing through the training loop and seeing all the vector dimensions.

First, we create the object for the model along with the required hyperparameters.

vocab_size = len(vocab)
embedding_dim = 10 # Each word will be represented as a vector of size 10
hidden_dim = 3 # Vector size of the output of LSTM cell
num_layers = 2

epochs = 100
lr = 0.001

model = LSTM(vocab_size, embedding_dim, hidden_dim, num_layers, 2)
optimizer = torch.optim.Adam(params = model.parameters())
model.to(device)

Code down the training loop,

for epoch in range(epochs):
for texts, labels in tqdm(train_loader):
texts = texts.to(device)
labels = labels.to(device)

bs = labels.shape[0]
zero_init = torch.zeros(num_layers, bs, hidden_dim)
print("Zero init: ", zero_init.shape)

h_c = tuple([zero_init, zero_init])

preds = model(texts, h_c)

loss = nn.BCELoss()(preds.squeeze(), labels.float())
optimizer.zero_grad()
loss.backward()

optimizer.step()
break

First of all, we take encoded sentences (texts) and labels from the current batch. Now, to record hidden state (h) and cell state (c) we create a tuple of empty tensors h_c. The dimensions of this tuple will be [2, batch_size, hidden_dim]. 2 is for h and c respectively, and then for each sentence of the batch, we will get an updated hidden state of size hidden_dim.

Next, we perform the forward pass. Our first layer is an embedding layer which will take a batch of encoded sentences and output the vector representations for each word of sentences. So, here the output of the embedding layer will be [batch_size, MAX_SEQ_LEN, embedding_dim]. This means for each sentence of the batch we get an array of size MAX_SEQ_LEN having each element of size embedding_dim. So, now each word of the sentence is represented as a vector of size embedding_dim. And this will be the input to our lstm cell.

Note that, if you are using LSTM for any other domain than text, the input dimension will remain the same as above. It shows that we are passing a sequence of length MAX_SEQ_LEN to learn, having the feature columns of size embedding_dim. For example, in time series analysis, you have to pass the data of past MAX_SEQ_LEN days/months/seconds as the sequence.

As input, we pass the generated embeddings and the tuple of h_c to the lstm layer. The lstm layer outputs two different tensors. One tensor is the final output Y after passing the input through all lstm layers and the second tensor is the pair of generated final hidden and cell states from each lstm layer and at each time step.

Note: The number of time steps (t) will be equal to the MAX_SEQ_LEN which indicates that we will learn the pattern word by word

The shape of Y will be [batch_size, MAX_SEQ_LEN, hidden_dim] which means for each sentence of the batch and for each word of the sentence, lstm has outputted a vector of size hidden_dim. This shows the intuition behind sequence learning. LSTM outputs a vector at each time step based on whatever it has learned till the current time step. In our example, Y will be of shape [4,5,3].

The shape of the tuple consisting final hidden and cell state will be [2, num_layers, batch_size, hidden_dim]. Here, 2 is for h and c respectively. If you look at the cell diagram, you’ll notice one thing Y(t) and h(t) share the same line which means the last output after processing the complete sequence in Y and the final hidden state will have the same values. Here’s the proof from our example,

Y:
tensor([[[ 0.0534, 0.1544, 0.1034],
[ 0.1264, 0.1795, 0.2037],
[ 0.0830, 0.2662, 0.2255],
[-0.0129, 0.3563, 0.1704],
# Final output for sentence-1 after processing all words
[ 0.0668, 0.2532, 0.2377]],

[[ 0.0806, 0.1436, 0.1112],
[ 0.1533, 0.1627, 0.2355],
[ 0.1115, 0.2602, 0.2185],
[ 0.0904, 0.2864, 0.2372],
[ 0.0751, 0.3196, 0.2148]],

[[ 0.0576, 0.1468, 0.1025],
[ 0.0711, 0.2353, 0.1567],
[-0.0086, 0.3189, 0.1238],
[ 0.0245, 0.3267, 0.1513],
[ 0.1010, 0.2779, 0.2234]],

[[ 0.0210, 0.1748, 0.0881],
[ 0.0964, 0.1988, 0.2025],
[ 0.1251, 0.2223, 0.2643],
[ 0.1545, 0.2169, 0.2900],
[ 0.0988, 0.2921, 0.2592]]], grad_fn=<TransposeBackward0>)

h(n):
tensor([[[ 0.3765, 0.2729, -0.0935],
[ 0.0581, -0.0700, -0.1784],
[ 0.1121, 0.3897, -0.0690],
[ 0.1201, -0.1870, -0.0776]],

[[ 0.0668, 0.2532, 0.2377], # Final hidden state for sentence-1
[ 0.0751, 0.3196, 0.2148],
[ 0.1010, 0.2779, 0.2234],
[ 0.0988, 0.2921, 0.2592]]], grad_fn=<StackBackward>)

Now that we have got the output from the lstm layer, we proceed further in the network. But hey, which output values should we use to pass onto further layers because we have outputs of all the time steps? So, we aimed to learn the pattern of a complete sequence or sentence. That means the output generated at the last time step will go into the further layer.

To do that we take all the last-time step elements of the nested tensors from Y (lstm_out[:, -1, :]). It will be passed on to the next dense layers and activation functions to generate desired outputs and perform optimization.

If you have reached till here then pat yourself. I have shared the Kaggle book for all the codes discussed in this blog at the end. You can try different hyperparameters to see how the output dimensions are changing to get a better understanding.

I tried my best to explain the things I explored and learned. If you find any mistake or issue in the blog, please do comment. It’ll help me and other learners also. If you find this blog helpful then do share it with your peers 😊.

Know more about me: https://linktr.ee/yashpaneliya

Kaggle Book: https://www.kaggle.com/code/yashpaneliya/decoding-lstm-using-pytorch

--

--

Yash Paneliya

Building @Krutrim | Data Science Intern @NPCI | MTech CSE'24 IIT KGP | Developer | Trying to simplify things as much as I can | linktr.ee/yashpaneliya