aiboard > Blog > Artificial Intelligence > Building a Transformer-based Language Model from scratch to generate text

Building a Transformer-based Language Model from scratch to generate text

May 18, 2025
Posted by: Kulbir Singh
Category: Artificial Intelligence

A Large Language Model (LLM) is a type of deep learning model trained on massive text datasets to understand and generate human language. It belongs to the family of transformer-based architectures, which have revolutionized Natural Language Processing (NLP).

LLMs are capable of tasks like:

Text generation (e.g., ChatGPT)
Translation
Summarization
Question answering
Sentiment analysis
Code generation

Model Components Overview – Text Generation

Tokenizer
- Purpose: Converts input text into a sequence of tokens (numbers representing words, subwords, or characters).
- Types:
  - Word-level
  - Subword-level (e.g., Byte-Pair Encoding, SentencePiece)
  - Character-level
Embedding Layer + positional encodings

Purpose: Transforms tokens into dense vector representations.
- These vectors capture semantic information and are input to the model.
- Think of embeddings as high-dimensional meaning encoders.
- Transformers do not have recurrence or convolution, so they use positional encodings to inject sequence order.
- Adds sinusoidal or learnable position-based vectors to embeddings.
Transformer Blocks

Each transformer block typically includes:

a. Multi-Head Self-Attention
- Allows the model to attend to different parts of a sequence simultaneously.
- Captures dependencies regardless of distance.
b. Layer Normalization
- Stabilizes and accelerates training by normalizing inputs to layers.
c. Feedforward Neural Network (FFN)
- A fully connected network applied to each token independently after attention.
d. Residual Connections
- Shortcut connections that help mitigate vanishing gradients and improve training.
Stacking Transformer Layers
- LLMs consist of many (e.g., 12, 24, 96+) stacked transformer layers.
- Deeper models = more capacity to learn complex patterns.

Sample Code: MiniGPT for Text Generation (PyTorch)

import torch

import torch.nn as nn

import torch.nn.functional as F

# Hyperparameters

vocab_size = 65 # Size of your character-level vocabulary

block_size = 128

n_embd = 128

n_head = 4

n_layer = 4

device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’

# Sample toy dataset (character-level)

text = “hello world! this is a simple transformer demo”

chars = sorted(set(text))

stoi = { ch:i for i,ch in enumerate(chars) }

itos = { i:ch for ch,i in stoi.items() }

def encode(s): return [stoi[c] for c in s]

def decode(l): return ”.join([itos[i] for i in l])

# Dataset

data = torch.tensor(encode(text), dtype=torch.long).to(device)

def get_batch(seq_len=block_size):

i = torch.randint(0, len(data)-seq_len-1, (1,))

x = data[i:i+seq_len]

y = data[i+1:i+seq_len+1]

return x.unsqueeze(0), y.unsqueeze(0)

# Self-Attention Head

class Head(nn.Module):

def __init__(self, head_size):

super().__init__()

self.key = nn.Linear(n_embd, head_size)

self.query = nn.Linear(n_embd, head_size)

self.value = nn.Linear(n_embd, head_size)

self.register_buffer(“tril”, torch.tril(torch.ones(block_size, block_size)))

self.dropout = nn.Dropout(0.1)

def forward(self, x):

B,T,C = x.shape

k = self.key(x)

q = self.query(x)

att = (q @ k.transpose(-2,-1)) / (C**0.5)

att = att.masked_fill(self.tril[:T, :T] == 0, float(‘-inf’))

att = F.softmax(att, dim=-1)

att = self.dropout(att)

v = self.value(x)

out = att @ v

return out

# Multi-Head Attention

class MultiHeadAttention(nn.Module):

def __init__(self, num_heads, head_size):

super().__init__()

self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])

self.proj = nn.Linear(num_heads * head_size, n_embd)

self.dropout = nn.Dropout(0.1)

def forward(self, x):

out = torch.cat([h(x) for h in self.heads], dim=-1)

out = self.dropout(self.proj(out))

return out

# Feedforward Layer

class FeedForward(nn.Module):

def __init__(self, n_embd):

super().__init__()

self.net = nn.Sequential(

nn.Linear(n_embd, 4 * n_embd),

nn.ReLU(),

nn.Linear(4 * n_embd, n_embd),

nn.Dropout(0.1),

)

def forward(self, x):

return self.net(x)

# Transformer Block

class Block(nn.Module):

def __init__(self, n_embd, n_head):

super().__init__()

head_size = n_embd // n_head

self.sa = MultiHeadAttention(n_head, head_size)

self.ff = FeedForward(n_embd)

self.ln1 = nn.LayerNorm(n_embd)

self.ln2 = nn.LayerNorm(n_embd)

def forward(self, x):

x = x + self.sa(self.ln1(x))

x = x + self.ff(self.ln2(x))

return x

# Full Language Model

class GPTLanguageModel(nn.Module):

def __init__(self):

super().__init__()

self.token_embedding_table = nn.Embedding(vocab_size, n_embd)

self.position_embedding_table = nn.Embedding(block_size, n_embd)

self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])

self.ln_f = nn.LayerNorm(n_embd)

self.lm_head = nn.Linear(n_embd, vocab_size)

def forward(self, idx, targets=None):

B,T = idx.shape

tok_emb = self.token_embedding_table(idx) # (B,T,C)

pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)

x = tok_emb + pos_emb

x = self.blocks(x)

x = self.ln_f(x)

logits = self.lm_head(x) # (B,T,vocab_size)

if targets is None:

return logits, None

B, T, C = logits.shape

logits = logits.view(B*T, C)

targets = targets.view(B*T)

loss = F.cross_entropy(logits, targets)

return logits, loss

def generate(self, idx, max_new_tokens):

for _ in range(max_new_tokens):

idx_cond = idx[:, -block_size:]

logits, _ = self(idx_cond)

logits = logits[:, -1, :] # last token

probs = F.softmax(logits, dim=-1)

next_token = torch.multinomial(probs, num_samples=1)

idx = torch.cat((idx, next_token), dim=1)

return idx

# Instantiate model

model = GPTLanguageModel().to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# Training loop

for step in range(500):

xb, yb = get_batch()

logits, loss = model(xb, yb)

optimizer.zero_grad()

loss.backward()

optimizer.step()

if step % 50 == 0:

print(f”Step {step}, Loss: {loss.item():.4f}”)

# Text generation

context = torch.tensor([[stoi[‘h’]]], dtype=torch.long).to(device)

generated = model.generate(context, max_new_tokens=100)[0].tolist()

print(“Generated Text:\n”, decode(generated))

Author:Kulbir Singh

I am an analytics and data science professional with over two decades of experience in IT, specializing in leveraging data for strategic decision-making and actionable insights. Proficient in AI and experienced across healthcare, retail, and finance, I have led impactful projects, improving healthcare quality and reducing costs. Recognized with international achievements and multiple awards, I founded AIBoard (https://aiboard.io/), authoring educational articles and courses on AI. With a Master's degree in Data Science, I drive innovation, mentor teams, and contribute to AI and healthcare advancement through publications and speaking engagements. In addition to his professional work, Singh is active in multiple IT communities, contributes as an active blogger and educator, and is a member of the judging committee member for Globee awards. Kulbir has completed his Master's in Computer Science in Data Science from the University of Illinois at Urbana Champaign.

Building a Transformer-based Language Model from scratch to generate text

Model Components Overview – Text Generation

Tokenizer

Embedding Layer + positional encodings

Transformer Blocks

a. Multi-Head Self-Attention

b. Layer Normalization

c. Feedforward Neural Network (FFN)

d. Residual Connections

Stacking Transformer Layers

Sample Code: MiniGPT for Text Generation (PyTorch)

Author:Kulbir Singh