Search
  • English
Login Register
aiboard
  • Home
  • Articles
  • Courses
  • Gallery
  • Contact Us
    • About Me
    • Editors
    • Contribution
  • Home
  • Articles
  • Courses
  • Gallery
  • Contact Us
    • About Me
    • Editors
    • Contribution
aiboard > Blog > Artificial Intelligence > Building a Transformer-based Language Model from scratch to generate text

Building a Transformer-based Language Model from scratch to generate text

  • May 18, 2025
  • Posted by: Kulbir Singh
  • Category: Artificial Intelligence
No Comments

A Large Language Model (LLM) is a type of deep learning model trained on massive text datasets to understand and generate human language. It belongs to the family of transformer-based architectures, which have revolutionized Natural Language Processing (NLP).

LLMs are capable of tasks like:

  1. Text generation (e.g., ChatGPT)

  2. Translation

  3. Summarization

  4. Question answering

  5. Sentiment analysis

  6. Code generation

Model Components Overview – Text Generation

  • Tokenizer

    • Purpose: Converts input text into a sequence of tokens (numbers representing words, subwords, or characters).

    • Types:

      • Word-level

      • Subword-level (e.g., Byte-Pair Encoding, SentencePiece)

      • Character-level

  • Embedding Layer + positional encodings

    Purpose: Transforms tokens into dense vector representations.

    • These vectors capture semantic information and are input to the model.

    • Think of embeddings as high-dimensional meaning encoders.

    • Transformers do not have recurrence or convolution, so they use positional encodings to inject sequence order.

    • Adds sinusoidal or learnable position-based vectors to embeddings.

  • Transformer Blocks

    Each transformer block typically includes:

    a. Multi-Head Self-Attention

    • Allows the model to attend to different parts of a sequence simultaneously.

    • Captures dependencies regardless of distance.

    b. Layer Normalization

    • Stabilizes and accelerates training by normalizing inputs to layers.

    c. Feedforward Neural Network (FFN)

    • A fully connected network applied to each token independently after attention.

    d. Residual Connections

    • Shortcut connections that help mitigate vanishing gradients and improve training.

  • Stacking Transformer Layers

    • LLMs consist of many (e.g., 12, 24, 96+) stacked transformer layers.

    • Deeper models = more capacity to learn complex patterns.

Sample Code: MiniGPT for Text Generation (PyTorch)

 
import torch
import torch.nn as nn
import torch.nn.functional as F
 
# Hyperparameters
vocab_size = 65  # Size of your character-level vocabulary
block_size = 128
n_embd = 128
n_head = 4
n_layer = 4
 
device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’
 
# Sample toy dataset (character-level)
text = “hello world! this is a simple transformer demo”
chars = sorted(set(text))
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for ch,i in stoi.items() }
 
def encode(s): return [stoi[c] for c in s]
def decode(l): return ”.join([itos[i] for i in l])
 
# Dataset
data = torch.tensor(encode(text), dtype=torch.long).to(device)
 
def get_batch(seq_len=block_size):
    i = torch.randint(0, len(data)-seq_len-1, (1,))
    x = data[i:i+seq_len]
    y = data[i+1:i+seq_len+1]
    return x.unsqueeze(0), y.unsqueeze(0)
 
# Self-Attention Head
class Head(nn.Module):
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size)
        self.query = nn.Linear(n_embd, head_size)
        self.value = nn.Linear(n_embd, head_size)
        self.register_buffer(“tril”, torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(0.1)
 
    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)
        q = self.query(x)
        att = (q @ k.transpose(-2,-1)) / (C**0.5)
        att = att.masked_fill(self.tril[:T, :T] == 0, float(‘-inf’))
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)
        v = self.value(x)
        out = att @ v
        return out
 
# Multi-Head Attention
class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(num_heads * head_size, n_embd)
        self.dropout = nn.Dropout(0.1)
 
    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out
 
# Feedforward Layer
class FeedForward(nn.Module):
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(0.1),
        )
 
    def forward(self, x):
        return self.net(x)
 
# Transformer Block
class Block(nn.Module):
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ff = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
 
    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x
 
# Full Language Model
class GPTLanguageModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)
 
    def forward(self, idx, targets=None):
        B,T = idx.shape
        tok_emb = self.token_embedding_table(idx)        # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)                         # (B,T,vocab_size)
 
        if targets is None:
            return logits, None
        B, T, C = logits.shape
        logits = logits.view(B*T, C)
        targets = targets.view(B*T)
        loss = F.cross_entropy(logits, targets)
        return logits, loss
 
    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]  # last token
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_token), dim=1)
        return idx
 
# Instantiate model
model = GPTLanguageModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
 
# Training loop
for step in range(500):
    xb, yb = get_batch()
    logits, loss = model(xb, yb)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if step % 50 == 0:
        print(f”Step {step}, Loss: {loss.item():.4f}”)
 
# Text generation
context = torch.tensor([[stoi[‘h’]]], dtype=torch.long).to(device)
generated = model.generate(context, max_new_tokens=100)[0].tolist()
print(“Generated Text:\n”, decode(generated))
 

Author:Kulbir Singh

I am an analytics and data science professional with over two decades of experience in IT, specializing in leveraging data for strategic decision-making and actionable insights. Proficient in AI and experienced across healthcare, retail, and finance, I have led impactful projects, improving healthcare quality and reducing costs. Recognized with international achievements and multiple awards, I founded AIBoard (https://aiboard.io/), authoring educational articles and courses on AI. With a Master's degree in Data Science, I drive innovation, mentor teams, and contribute to AI and healthcare advancement through publications and speaking engagements. In addition to his professional work, Singh is active in multiple IT communities, contributes as an active blogger and educator, and is a member of the judging committee member for Globee awards. Kulbir has completed his Master's in Computer Science in Data Science from the University of Illinois at Urbana Champaign.

About

Discover the cutting-edge synergy of Artificial Intelligence and healthcare with our educational blog. Explore the transformative potential of AI in revolutionizing healthcare delivery, diagnostics, and patient care.
Learning Now

Pages

  • About Me
  • Blog
  • Courses

Contact

  • Chicago, USA
  • [email protected]

Social Network

Footer logo
Copyright © AIBoard
  • home
  • courses
  • blog
  • gallery
  • Contribution
Sign In
The password must have a minimum of 8 characters of numbers and letters, contain at least 1 capital letter
Remember me
Sign In Sign Up
Restore password
Send reset link
Password reset link sent to your email Close
Confirmation link sent Please follow the instructions sent to your email address Close
No account? Sign Up Sign In
Lost Password?