Project Overview
This project implements a multilingual abstractive text summarizer powered by mT5 (multilingual Text-to-Text Transfer Transformer), a state-of-the-art model from Google Research. Unlike extractive summarization that simply selects and rearranges existing sentences, abstractive summarization generates entirely new text that captures the essence and key information of the source document — much like how humans summarize content.
The system leverages the Hugging Face Transformers library to load pre-trained models fine-tuned on multilingual summarization datasets, enabling high-quality summaries across 100+ languages without requiring separate models for each language. This makes it incredibly powerful for global applications, content curation, and information extraction.
What is mT5?
mT5 (multilingual T5) is Google's multilingual variant of the T5 (Text-to-Text Transfer Transformer) model. It was pre-trained on the mC4 corpus, covering 101 languages, and treats every NLP task as a text-to-text problem. This unified framework allows the same model architecture to perform translation, summarization, question answering, and more.
- Pre-trained on mC4: Multilingual Colossal Clean Crawled Corpus
- Text-to-text framework: All tasks framed as sequence-to-sequence generation
- Encoder-decoder architecture: Bidirectional encoder + autoregressive decoder
- Multiple sizes: Small (300M), Base (580M), Large (1.2B), XL (3.7B), XXL (13B)
Abstractive vs. Extractive Summarization
Understanding the difference between these two approaches is crucial for choosing the right summarization technique:
Extractive
Selects and combines existing sentences from the source text without modification. Fast and reliable, but can be choppy and lack coherence.
Example: Highlighting key sentences with a marker.
Abstractive (This Project)
Generates entirely new sentences that paraphrase and condense the source content. More human-like and fluent, capturing semantic meaning rather than copying text.
Example: Writing your own summary after reading a book.
Supported Languages
The mT5 model supports summarization across a wide range of languages, including:
Live Example
Multilingual Summarization in Action
English Input
Generated Summary
French Input
Generated Summary
Arabic Input
Generated Summary
Implementation Steps
- Install dependencies — Set up Hugging Face Transformers, PyTorch, and SentencePiece tokenizer
- Load pre-trained mT5 model — Download the model fine-tuned for multilingual summarization
- Tokenize input text — Convert text to token IDs with proper padding and truncation
- Generate summary — Use beam search decoding with length penalties and repetition prevention
- Decode output — Convert generated token IDs back to human-readable text
- Evaluate quality — Optionally measure performance using ROUGE metrics
Core Implementation
1. Installation & Setup
# Install required libraries
!pip install transformers sentencepiece torch
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
import torch
# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
2. Load Pre-trained Model
# Load mT5 model fine-tuned for multilingual summarization
# This model was trained on XLSum dataset covering 44 languages
model_name = "csebuetnlp/mT5_multilingual_XLSum"
tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)
# Move model to GPU if available
model = model.to(device)
print("Model loaded successfully!")
3. Create Summarization Function
def summarize_text(text, max_length=150, min_length=30, num_beams=4):
"""
Generate an abstractive summary of the input text.
Args:
text (str): Input text to summarize
max_length (int): Maximum length of generated summary
min_length (int): Minimum length of generated summary
num_beams (int): Beam search width for generation
Returns:
str: Generated summary
"""
# Tokenize the input text
inputs = tokenizer(
text,
return_tensors="pt",
max_length=512,
truncation=True,
padding=True
).to(device)
# Generate summary using the model
with torch.no_grad():
output_ids = model.generate(
inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=max_length,
min_length=min_length,
num_beams=num_beams,
length_penalty=2.0,
early_stopping=True,
no_repeat_ngram_size=3
)
# Decode the generated tokens back to text
summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return summary
4. Test with Multiple Languages
# English example
english_text = """
Climate change represents one of the most pressing challenges
facing humanity in the 21st century. Rising global temperatures
are causing ice caps to melt, sea levels to rise, and weather
patterns to become more extreme. Scientists warn that without
significant action to reduce greenhouse gas emissions, the
consequences could be catastrophic for future generations.
"""
# Generate and print summary
english_summary = summarize_text(english_text)
print("English Summary:")
print(english_summary)
# French example
french_text = """
La technologie blockchain révolutionne de nombreux secteurs,
de la finance aux chaînes d'approvisionnement. Cette technologie
décentralisée permet des transactions sécurisées sans intermédiaire,
réduisant les coûts et augmentant la transparence.
"""
french_summary = summarize_text(french_text)
print("\nFrench Summary:")
print(french_summary)
Generation Parameters Explained
Fine-tuning these parameters allows you to control the quality and characteristics of generated summaries:
| Parameter | Description | Typical Range | Effect |
|---|---|---|---|
max_length |
Maximum summary length in tokens | 100-200 | Higher = longer, more detailed summaries |
min_length |
Minimum summary length in tokens | 20-50 | Prevents overly short summaries |
num_beams |
Beam search width | 4-8 | Higher = better quality but slower |
length_penalty |
Exponential penalty for length | 1.0-2.0 | Higher = favors longer summaries |
no_repeat_ngram_size |
Prevents repeating n-grams | 2-3 | Reduces repetitive phrases |
early_stopping |
Stop when all beams finish | True/False | True = faster generation |
What is Beam Search?
Beam search is a heuristic search algorithm that explores multiple possible sequences simultaneously. Instead of greedily choosing the single best token at each step, it maintains the top k candidates (beams) and expands them in parallel.
How it works:
- Start with
num_beamshypotheses (e.g., 4 different starting sequences) - At each step, expand each hypothesis by considering all possible next tokens
- Keep only the top
num_beamssequences based on cumulative probability - Continue until all beams generate an end-of-sequence token or reach max length
- Return the sequence with the highest overall score
This approach balances quality and computational cost — significantly better than greedy decoding, much faster than exhaustive search.
Evaluation with ROUGE Metrics
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard metric for evaluating summarization quality by comparing generated summaries against reference summaries:
ROUGE Score Types
- ROUGE-1: Unigram (single word) overlap — measures word-level similarity
- ROUGE-2: Bigram (two-word phrase) overlap — captures phrase-level similarity
- ROUGE-L: Longest Common Subsequence — rewards sentence-level structure
Each metric reports Precision (how much of the generated summary is relevant), Recall (how much of the reference is captured), and F1-score (harmonic mean of both).
# Install ROUGE scorer
!pip install rouge-score
from rouge_score import rouge_scorer
# Create scorer for ROUGE-1, ROUGE-2, and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
# Compare generated summary to reference summary
reference_summary = "Climate change threatens humanity with rising temperatures and extreme weather."
generated_summary = "Rising global temperatures from climate change pose catastrophic risks to future generations."
scores = scorer.score(reference_summary, generated_summary)
# Print results
print("ROUGE Scores:")
print(f"ROUGE-1 F1: {scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-2 F1: {scores['rouge2'].fmeasure:.4f}")
print(f"ROUGE-L F1: {scores['rougeL'].fmeasure:.4f}")
Advanced: Fine-tuning on Custom Data
While the pre-trained model works well for general summarization, you can fine-tune it on domain-specific data for better performance in specialized contexts (legal documents, medical reports, news articles, etc.):
- Prepare dataset — Collect pairs of (document, summary) in your target domain and language
- Tokenize data — Process both inputs and labels using the mT5 tokenizer
- Configure training — Set up Trainer with appropriate hyperparameters (learning rate, batch size, epochs)
- Fine-tune model — Train on your custom dataset using Hugging Face Trainer API
- Evaluate and iterate — Test on held-out validation set, adjust hyperparameters as needed
Hugging Face Transformers Library
This project relies heavily on the Hugging Face ecosystem, which provides:
- Pre-trained models: Thousands of models for every NLP task across 100+ languages
- Tokenizers: Fast, efficient text preprocessing with automatic vocabulary handling
- Trainer API: High-level interface for training and evaluation with minimal boilerplate
- Hub integration: Easy model sharing, versioning, and deployment
- Production-ready: Optimized inference with ONNX, quantization, and hardware acceleration
Use Cases & Applications
Multilingual text summarization has numerous real-world applications:
News Aggregation
Automatically generate summaries of news articles in multiple languages for global news platforms and content curation.
Research & Academia
Create concise abstracts from lengthy academic papers, helping researchers quickly identify relevant literature.
Business Intelligence
Summarize market reports, financial documents, and customer feedback across international markets.
Customer Support
Generate ticket summaries and extract key issues from customer conversations in any supported language.
Try the Complete Implementation
Run the full notebook to experiment with multilingual summarization, test different parameters, and evaluate results on your own text data.
References & Further Reading
- Xue, L., et al. (2021). "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer" — Original mT5 paper
- Raffel, C., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" — T5 architecture
- mT5 on Hugging Face Model Hub
- mT5 Documentation — Hugging Face
- Lin, C. Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries" — ROUGE metrics