Kudos AI | Project | Multilingual Text Summarizer

Project Overview

This project implements a multilingual abstractive text summarizer powered by mT5 (multilingual Text-to-Text Transfer Transformer), a state-of-the-art model from Google Research. Unlike extractive summarization that simply selects and rearranges existing sentences, abstractive summarization generates entirely new text that captures the essence and key information of the source document — much like how humans summarize content.

The system leverages the Hugging Face Transformers library to load pre-trained models fine-tuned on multilingual summarization datasets, enabling high-quality summaries across 100+ languages without requiring separate models for each language. This makes it incredibly powerful for global applications, content curation, and information extraction.

What is mT5?

mT5 (multilingual T5) is Google's multilingual variant of the T5 (Text-to-Text Transfer Transformer) model. It was pre-trained on the mC4 corpus, covering 101 languages, and treats every NLP task as a text-to-text problem. This unified framework allows the same model architecture to perform translation, summarization, question answering, and more.

Pre-trained on mC4: Multilingual Colossal Clean Crawled Corpus
Text-to-text framework: All tasks framed as sequence-to-sequence generation
Encoder-decoder architecture: Bidirectional encoder + autoregressive decoder
Multiple sizes: Small (300M), Base (580M), Large (1.2B), XL (3.7B), XXL (13B)

Abstractive vs. Extractive Summarization

Understanding the difference between these two approaches is crucial for choosing the right summarization technique:

Extractive

Selects and combines existing sentences from the source text without modification. Fast and reliable, but can be choppy and lack coherence.

Example: Highlighting key sentences with a marker.

Abstractive (This Project)

Generates entirely new sentences that paraphrase and condense the source content. More human-like and fluent, capturing semantic meaning rather than copying text.

Example: Writing your own summary after reading a book.

Supported Languages

The mT5 model supports summarization across a wide range of languages, including:

English

French

Arabic

Spanish

German

Chinese

Japanese

Russian

Portuguese

Italian

Korean

+90 More

Live Example

Multilingual Summarization in Action

English Input

Artificial intelligence has transformed numerous industries over the past decade. From healthcare diagnostics that can identify diseases earlier than human doctors to autonomous vehicles navigating complex traffic patterns, AI systems are becoming increasingly sophisticated. Machine learning algorithms can now process vast amounts of data to identify patterns that humans might miss. However, concerns about AI safety, algorithmic bias, and ethics continue to grow alongside these technological advances.

Generated Summary

AI has revolutionized industries through advanced machine learning, but ethical concerns about safety and bias persist.

French Input

L'apprentissage automatique est une branche de l'intelligence artificielle qui permet aux machines d'apprendre à partir de données sans être explicitement programmées. Cette technologie est utilisée dans la reconnaissance vocale, la traduction automatique et les systèmes de recommandation.

Generated Summary

L'apprentissage automatique permet aux machines d'apprendre sans programmation explicite, utilisé dans la reconnaissance vocale et les recommandations.

Arabic Input

الذكاء الاصطناعي هو فرع من علوم الكمبيوتر يهدف إلى إنشاء آلات قادرة على محاكاة الذكاء البشري. يشمل ذلك التعلم الآلي ومعالجة اللغة الطبيعية والرؤية الحاسوبية.

Generated Summary

الذكاء الاصطناعي يهدف لإنشاء آلات تحاكي الذكاء البشري عبر التعلم الآلي ومعالجة اللغة.

Implementation Steps

Install dependencies — Set up Hugging Face Transformers, PyTorch, and SentencePiece tokenizer
Load pre-trained mT5 model — Download the model fine-tuned for multilingual summarization
Tokenize input text — Convert text to token IDs with proper padding and truncation
Generate summary — Use beam search decoding with length penalties and repetition prevention
Decode output — Convert generated token IDs back to human-readable text
Evaluate quality — Optionally measure performance using ROUGE metrics

Core Implementation

1. Installation & Setup

# Install required libraries
!pip install transformers sentencepiece torch

from transformers import MT5ForConditionalGeneration, MT5Tokenizer
import torch

# Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

2. Load Pre-trained Model

# Load mT5 model fine-tuned for multilingual summarization
# This model was trained on XLSum dataset covering 44 languages
model_name = "csebuetnlp/mT5_multilingual_XLSum"

tokenizer = MT5Tokenizer.from_pretrained(model_name)
model = MT5ForConditionalGeneration.from_pretrained(model_name)

# Move model to GPU if available
model = model.to(device)
print("Model loaded successfully!")

3. Create Summarization Function

def summarize_text(text, max_length=150, min_length=30, num_beams=4):
    """
    Generate an abstractive summary of the input text.

    Args:
        text (str): Input text to summarize
        max_length (int): Maximum length of generated summary
        min_length (int): Minimum length of generated summary
        num_beams (int): Beam search width for generation

    Returns:
        str: Generated summary
    """
    # Tokenize the input text
    inputs = tokenizer(
        text,
        return_tensors="pt",
        max_length=512,
        truncation=True,
        padding=True
    ).to(device)

    # Generate summary using the model
    with torch.no_grad():
        output_ids = model.generate(
            inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            min_length=min_length,
            num_beams=num_beams,
            length_penalty=2.0,
            early_stopping=True,
            no_repeat_ngram_size=3
        )

    # Decode the generated tokens back to text
    summary = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return summary

4. Test with Multiple Languages

# English example
english_text = """
Climate change represents one of the most pressing challenges
facing humanity in the 21st century. Rising global temperatures
are causing ice caps to melt, sea levels to rise, and weather
patterns to become more extreme. Scientists warn that without
significant action to reduce greenhouse gas emissions, the
consequences could be catastrophic for future generations.
"""

# Generate and print summary
english_summary = summarize_text(english_text)
print("English Summary:")
print(english_summary)

# French example
french_text = """
La technologie blockchain révolutionne de nombreux secteurs,
de la finance aux chaînes d'approvisionnement. Cette technologie
décentralisée permet des transactions sécurisées sans intermédiaire,
réduisant les coûts et augmentant la transparence.
"""

french_summary = summarize_text(french_text)
print("\nFrench Summary:")
print(french_summary)

Generation Parameters Explained

Fine-tuning these parameters allows you to control the quality and characteristics of generated summaries:

Parameter	Description	Typical Range	Effect
`max_length`	Maximum summary length in tokens	100-200	Higher = longer, more detailed summaries
`min_length`	Minimum summary length in tokens	20-50	Prevents overly short summaries
`num_beams`	Beam search width	4-8	Higher = better quality but slower
`length_penalty`	Exponential penalty for length	1.0-2.0	Higher = favors longer summaries
`no_repeat_ngram_size`	Prevents repeating n-grams	2-3	Reduces repetitive phrases
`early_stopping`	Stop when all beams finish	True/False	True = faster generation

What is Beam Search?

Beam search is a heuristic search algorithm that explores multiple possible sequences simultaneously. Instead of greedily choosing the single best token at each step, it maintains the top k candidates (beams) and expands them in parallel.

How it works:

Start with num_beams hypotheses (e.g., 4 different starting sequences)
At each step, expand each hypothesis by considering all possible next tokens
Keep only the top num_beams sequences based on cumulative probability
Continue until all beams generate an end-of-sequence token or reach max length
Return the sequence with the highest overall score

This approach balances quality and computational cost — significantly better than greedy decoding, much faster than exhaustive search.

Evaluation with ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is the standard metric for evaluating summarization quality by comparing generated summaries against reference summaries:

ROUGE Score Types

ROUGE-1: Unigram (single word) overlap — measures word-level similarity
ROUGE-2: Bigram (two-word phrase) overlap — captures phrase-level similarity
ROUGE-L: Longest Common Subsequence — rewards sentence-level structure

Each metric reports Precision (how much of the generated summary is relevant), Recall (how much of the reference is captured), and F1-score (harmonic mean of both).

# Install ROUGE scorer
!pip install rouge-score

from rouge_score import rouge_scorer

# Create scorer for ROUGE-1, ROUGE-2, and ROUGE-L
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Compare generated summary to reference summary
reference_summary = "Climate change threatens humanity with rising temperatures and extreme weather."
generated_summary = "Rising global temperatures from climate change pose catastrophic risks to future generations."

scores = scorer.score(reference_summary, generated_summary)

# Print results
print("ROUGE Scores:")
print(f"ROUGE-1 F1: {scores['rouge1'].fmeasure:.4f}")
print(f"ROUGE-2 F1: {scores['rouge2'].fmeasure:.4f}")
print(f"ROUGE-L F1: {scores['rougeL'].fmeasure:.4f}")

Advanced: Fine-tuning on Custom Data

While the pre-trained model works well for general summarization, you can fine-tune it on domain-specific data for better performance in specialized contexts (legal documents, medical reports, news articles, etc.):

Prepare dataset — Collect pairs of (document, summary) in your target domain and language
Tokenize data — Process both inputs and labels using the mT5 tokenizer
Configure training — Set up Trainer with appropriate hyperparameters (learning rate, batch size, epochs)
Fine-tune model — Train on your custom dataset using Hugging Face Trainer API
Evaluate and iterate — Test on held-out validation set, adjust hyperparameters as needed

Hugging Face Transformers Library

This project relies heavily on the Hugging Face ecosystem, which provides:

Pre-trained models: Thousands of models for every NLP task across 100+ languages
Tokenizers: Fast, efficient text preprocessing with automatic vocabulary handling
Trainer API: High-level interface for training and evaluation with minimal boilerplate
Hub integration: Easy model sharing, versioning, and deployment
Production-ready: Optimized inference with ONNX, quantization, and hardware acceleration

Use Cases & Applications

Multilingual text summarization has numerous real-world applications:

News Aggregation

Automatically generate summaries of news articles in multiple languages for global news platforms and content curation.

Research & Academia

Create concise abstracts from lengthy academic papers, helping researchers quickly identify relevant literature.

Business Intelligence

Summarize market reports, financial documents, and customer feedback across international markets.

Customer Support

Generate ticket summaries and extract key issues from customer conversations in any supported language.

Try the Complete Implementation

Run the full notebook to experiment with multilingual summarization, test different parameters, and evaluate results on your own text data.

Open in Colab View on GitHub

References & Further Reading

Xue, L., et al. (2021). "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer" — Original mT5 paper
Raffel, C., et al. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" — T5 architecture
mT5 on Hugging Face Model Hub
mT5 Documentation — Hugging Face
Lin, C. Y. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries" — ROUGE metrics

Multilingual Text Summarizer with mT5

Project Overview

What is mT5?

Abstractive vs. Extractive Summarization

Extractive

Abstractive (This Project)

Supported Languages

Live Example

Multilingual Summarization in Action

English Input

Generated Summary

French Input

Generated Summary

Arabic Input

Generated Summary

Implementation Steps

Core Implementation

1. Installation & Setup

2. Load Pre-trained Model

3. Create Summarization Function

4. Test with Multiple Languages

Generation Parameters Explained

What is Beam Search?

Evaluation with ROUGE Metrics

ROUGE Score Types

Advanced: Fine-tuning on Custom Data

Hugging Face Transformers Library

Use Cases & Applications

News Aggregation

Research & Academia

Business Intelligence

Customer Support

Try the Complete Implementation

References & Further Reading