Multilingual Text Summarizer with mT5

Generate abstractive summaries in 100+ languages using Google's multilingual T5 transformer model and Hugging Face transformers.

View Source Run on Colab

Project Overview

This project implements a multilingual abstractive text summarizer powered by mT5 (multilingual Text-to-Text Transfer Transformer), a state-of-the-art model from Google Research. Unlike extractive summarization that simply selects and rearranges existing sentences, abstractive summarization generates entirely new text that captures the essence and key information of the source document — much like how humans summarize content.

The system leverages the Hugging Face Transformers library to load pre-trained models fine-tuned on multilingual summarization datasets, enabling high-quality summaries across 100+ languages without requiring separate models for each language. This makes it incredibly powerful for global applications, content curation, and information extraction.

What is mT5?

mT5 (multilingual T5) is Google's multilingual variant of the T5 (Text-to-Text Transfer Transformer) model. It was pre-trained on the mC4 corpus, covering 101 languages, and treats every NLP task as a text-to-text problem. This unified framework allows the same model architecture to perform translation, summarization, question answering, and more.

Abstractive vs. Extractive Summarization

Understanding the difference between these two approaches is crucial for choosing the right summarization technique:

Extractive

Selects and combines existing sentences from the source text without modification. Fast and reliable, but can be choppy and lack coherence.

Example: Highlighting key sentences with a marker.

Abstractive (This Project)

Generates entirely new sentences that paraphrase and condense the source content. More human-like and fluent, capturing semantic meaning rather than copying text.

Example: Writing your own summary after reading a book.

Supported Languages

The mT5 model supports summarization across a wide range of languages, including:

English
French
Arabic
Spanish
German
Chinese
Japanese
Russian
Portuguese
Italian
Korean
+90 More

Live Example

Multilingual Summarization in Action

English Input

Artificial intelligence has transformed numerous industries over the past decade. From healthcare diagnostics that can identify diseases earlier than human doctors to autonomous vehicles navigating complex traffic patterns, AI systems are becoming increasingly sophisticated. Machine learning algorithms can now process vast amounts of data to identify patterns that humans might miss. However, concerns about AI safety, algorithmic bias, and ethics continue to grow alongside these technological advances.

Generated Summary

AI has revolutionized industries through advanced machine learning, but ethical concerns about safety and bias persist.

French Input

L'apprentissage automatique est une branche de l'intelligence artificielle qui permet aux machines d'apprendre à partir de données sans être explicitement programmées. Cette technologie est utilisée dans la reconnaissance vocale, la traduction automatique et les systèmes de recommandation.

Generated Summary

L'apprentissage automatique permet aux machines d'apprendre sans programmation explicite, utilisé dans la reconnaissance vocale et les recommandations.

Arabic Input

الذكاء الاصطناعي هو فرع من علوم الكمبيوتر يهدف إلى إنشاء آلات قادرة على محاكاة الذكاء البشري. يشمل ذلك التعلم الآلي ومعالجة اللغة الطبيعية والرؤية الحاسوبية.

Generated Summary

الذكاء الاصطناعي يهدف لإنشاء آلات تحاكي الذكاء البشري عبر التعلم الآلي ومعالجة اللغة.

Implementation Steps

  1. Install dependencies — Set up Hugging Face Transformers, PyTorch, and SentencePiece tokenizer
  2. Load pre-trained mT5 model — Download the model fine-tuned for multilingual summarization
  3. Tokenize input text — Convert text to token IDs with proper padding and truncation
  4. Generate summary — Use beam search decoding with length penalties and repetition prevention
  5. Decode output — Convert generated token IDs back to human-readable text
  6. Evaluate quality — Optionally measure performance using ROUGE metrics

Core Implementation

1. Installation & Setup

# Install required libraries
!pip install transformers sentencepiece torch

from transformers import MT5ForConditionalGeneration, MT5Tokenizer
import torch

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

2. Load Pre-trained Model

# Load mT5 model fine-tuned for multilingual summarization
# This model was trained on XLSum dataset covering 44 languages
model_name = "csebuetnlp/mT5_multilingual_XLSum"

tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

# Move model to GPU if available
model = model.to(device)
print("Model loaded successfully!")

3. Create Summarization Function

def summarize_text(text, max_length=150, min_length=30, num_beams=4):
    """
    Generate an abstractive summary of the input text.

    Args:
        text (str): Input text to summarize
        max_length (int): Maximum length of generated summary
        min_length (int): Minimum length of generated summary
        num_beams (int): Beam search width for generation

    Returns:
        str: Generated summary
    """
    # Tokenize the input text
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True
    ).to(device)

    # Generate summary using the model
    with torch.no_grad():
        output_ids = model.generate(
            inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            min_length=min_length,
            num_beams=num_beams,
            length_penalty=2.0,
            early_stopping=True,
            no_repeat_ngram_size=3
        )

    # Decode the generated tokens back to text
    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return summary

4. Test with Multiple Languages

# English example
english_text = """
Climate change represents one of the most pressing challenges
facing humanity in the 21st century. Rising global temperatures
are causing ice caps to melt, sea levels to rise, and weather
patterns to become more extreme. Scientists warn that without
significant action to reduce greenhouse gas emissions, the
consequences could be catastrophic for future generations.
"""

# Generate and print summary
english_summary = summarize_text(english_text)
print("English Summary:")
print(english_summary)

# French example
french_text = """
La technologie blockchain révolutionne de nombreux secteurs,
de la finance aux chaînes d'approvisionnement. Cette technologie
décentralisée permet des transactions sécurisées sans intermédiaire,
réduisant les coûts et augmentant la transparence.
"""

french_summary = summarize_text(french_text)
print("\nFrench Summary:")
print(french_summary)

Generation Parameters Explained

Fine-tuning these parameters allows you to control the quality and characteristics of generated summaries:

Parameter Description Typical Range Effect
max_length Maximum summary length in tokens 100-200 Higher = longer, more detailed summaries
min_length Minimum summary length in tokens 20-50 Prevents overly short summaries
num_beams Beam search width 4-8 Higher = better quality but slower
length_penalty Exponential penalty for length 1.0-2.0 Higher = favors longer summaries
no_repeat_ngram_size Prevents repeating n-grams 2-3 Reduces repetitive phrases
early_stopping Stop when all beams finish True/False True = faster generation

What is Beam Search?

Beam search is a heuristic search algorithm that explores multiple possible sequences simultaneously. Instead of greedily choosing the single best token at each step, it maintains the top k candidates (beams) and expands them in parallel.

How it works:

This approach balances quality and computational cost — significantly better than greedy decoding, much faster than exhaustive search.

Evaluation with ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard metric for evaluating summarization quality by comparing generated summaries against reference summaries:

ROUGE Score Types

Each metric reports Precision (how much of the generated summary is relevant), Recall (how much of the reference is captured), and F1-score (harmonic mean of both).

# Install ROUGE scorer
!pip install rouge-score

from rouge_score import rouge_scorer

# Create scorer for ROUGE-1, ROUGE-2, and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Compare generated summary to reference summary
reference_summary = "Climate change threatens humanity with rising temperatures and extreme weather."
generated_summary = "Rising global temperatures from climate change pose catastrophic risks to future generations."

scores = scorer.score(reference_summary, generated_summary)

# Print results
print("ROUGE Scores:")
print(f"ROUGE-1 F1: {scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-2 F1: {scores['rouge2'].fmeasure:.4f}")
print(f"ROUGE-L F1: {scores['rougeL'].fmeasure:.4f}")

Advanced: Fine-tuning on Custom Data

While the pre-trained model works well for general summarization, you can fine-tune it on domain-specific data for better performance in specialized contexts (legal documents, medical reports, news articles, etc.):

  1. Prepare dataset — Collect pairs of (document, summary) in your target domain and language
  2. Tokenize data — Process both inputs and labels using the mT5 tokenizer
  3. Configure training — Set up Trainer with appropriate hyperparameters (learning rate, batch size, epochs)
  4. Fine-tune model — Train on your custom dataset using Hugging Face Trainer API
  5. Evaluate and iterate — Test on held-out validation set, adjust hyperparameters as needed

Hugging Face Transformers Library

This project relies heavily on the Hugging Face ecosystem, which provides:

Use Cases & Applications

Multilingual text summarization has numerous real-world applications:

News Aggregation

Automatically generate summaries of news articles in multiple languages for global news platforms and content curation.

Research & Academia

Create concise abstracts from lengthy academic papers, helping researchers quickly identify relevant literature.

Business Intelligence

Summarize market reports, financial documents, and customer feedback across international markets.

Customer Support

Generate ticket summaries and extract key issues from customer conversations in any supported language.

Try the Complete Implementation

Run the full notebook to experiment with multilingual summarization, test different parameters, and evaluate results on your own text data.

References & Further Reading