Wikipedia-Based Question Answering Chatbot

Build an intelligent chatbot that retrieves and processes Wikipedia articles to answer questions using TF-IDF vectorization and cosine similarity.

View Source Run on Colab

Project Overview

This project demonstrates how to build a document-based question-answering chatbot that leverages the Wikipedia API as its knowledge source. When a user asks a question, the system fetches relevant Wikipedia articles, processes the text using natural language processing techniques, and returns the most relevant answer based on semantic similarity.

Unlike modern transformer-based models, this approach uses classical NLP methods including TF-IDF vectorization and cosine similarity, making it lightweight, interpretable, and ideal for understanding foundational information retrieval concepts.

Example Conversation
What is machine learning?
Machine learning is a subset of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it to learn for themselves.
Who invented the telephone?
Alexander Graham Bell is credited with patenting the first practical telephone in 1876. Bell's work on the harmonic telegraph led to the development of the telephone, which transmitted speech electronically.

System Architecture

The chatbot follows a multi-stage pipeline that transforms user queries into structured responses:

User Query
Wikipedia API
Text Preprocessing
TF-IDF Vectorization
Cosine Similarity
Best Answer

Core NLP Concepts

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects how important a word is to a document within a collection. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, helping to adjust for common words like "the" or "is".

TF-IDF Formula

\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right) \]

Where: TF = term frequency in document, N = total number of documents, DF = number of documents containing the term

How TF-IDF Works

Term Frequency (TF): Measures how often a term appears in a document. More occurrences suggest higher relevance.

Inverse Document Frequency (IDF): Reduces the weight of terms that appear frequently across many documents (common words) and increases the weight of rare, more informative terms.

Result: Words that are both frequent in a specific document and rare across the corpus receive the highest scores.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In NLP, these vectors represent documents or sentences after TF-IDF transformation. A cosine similarity of 1 means the vectors are identical, 0 means they are orthogonal (unrelated), and -1 means they are diametrically opposed.

Cosine Similarity Formula

\[ \text{similarity}(A, B) = \cos(\theta) = \frac{A \cdot B}{\|A\| \times \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} \]

Where: A and B are TF-IDF vectors, and the dot product measures their alignment

Implementation Steps

  1. Install Required Libraries

    Set up the necessary packages for Wikipedia access, NLP processing, and vectorization.

    # Install dependencies
    !pip install wikipedia-api nltk scikit-learn
    
    # Import libraries
    import wikipediaapi
    import nltk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np
    
    # Download NLTK resources
    nltk.download('punkt')
    nltk.download('stopwords')
  2. Fetch Wikipedia Content

    Create functions to retrieve article content from Wikipedia using the API.

    def get_wikipedia_content(topic):
        """Fetch Wikipedia article text for a given topic."""
        wiki = wikipediaapi.Wikipedia(
            language='en',
            user_agent='KudosAI-Chatbot/1.0'
        )
    
        page = wiki.page(topic)
    
        if page.exists():
            return page.text
        else:
            # Fallback to search if exact match not found
            return search_wikipedia(topic)
    
    def search_wikipedia(query):
        """Search Wikipedia for relevant articles."""
        wiki = wikipediaapi.Wikipedia('en', 'KudosAI-Chatbot/1.0')
        search_results = wiki.search(query, results=3)
    
        if search_results:
            page = wiki.page(search_results[0])
            return page.text
        return None
  3. Preprocess Text with NLTK

    Clean and tokenize text by removing stopwords, punctuation, and converting to lowercase.

    from nltk.tokenize import sent_tokenize, word_tokenize
    from nltk.corpus import stopwords
    import string
    
    def preprocess_text(text):
        """Clean and tokenize text into sentences."""
        # Split text into sentences
        sentences = sent_tokenize(text)
    
        # Clean each sentence
        cleaned_sentences = []
        for sent in sentences:
            # Convert to lowercase
            sent = sent.lower()
    
            # Remove punctuation
            sent = sent.translate(str.maketrans('', '', string.punctuation))
    
            # Tokenize and remove stopwords
            words = word_tokenize(sent)
            words = [w for w in words
                     if w not in stopwords.words('english')]
    
            cleaned_sentences.append(' '.join(words))
    
        # Return both original and cleaned versions
        return sentences, cleaned_sentences
  4. Build TF-IDF Vectorizer

    Transform text into numerical vectors using TF-IDF, making documents comparable.

    def find_best_answer(question, document_text):
        """Find the most relevant sentence(s) for a given question."""
        # Preprocess document and question
        original_sentences, cleaned_sentences = preprocess_text(document_text)
        _, cleaned_question = preprocess_text(question)
        cleaned_question = cleaned_question[0] if cleaned_question else question
    
        # Combine question with corpus for vectorization
        corpus = cleaned_sentences + [cleaned_question]
    
        # Create TF-IDF vectors
        vectorizer = TfidfVectorizer()
        tfidf_matrix = vectorizer.fit_transform(corpus)
    
        # Extract question vector and document vectors
        question_vector = tfidf_matrix[-1]
        sentence_vectors = tfidf_matrix[:-1]
    
        # Calculate cosine similarity
        similarities = cosine_similarity(question_vector, sentence_vectors).flatten()
    
        # Get top 3 most similar sentences
        top_indices = similarities.argsort()[-3:][::-1]
    
        # Return best matching sentences
        answers = []
        for idx in top_indices:
            if similarities[idx] > 0.1:  # Relevance threshold
                answers.append(original_sentences[idx])
    
        return ' '.join(answers) if answers else "I couldn't find a relevant answer."
  5. Create the Complete Chatbot

    Integrate all components into a conversational chatbot class with context management.

    class WikipediaChatbot:
        def __init__(self):
            self.context = None
            self.current_topic = None
    
        def chat(self, user_input):
            """Main chatbot interface."""
            # Check if user wants to change topic
            if user_input.lower().startswith('tell me about'):
                topic = user_input[13:].strip()
                self.context = get_wikipedia_content(topic)
                self.current_topic = topic
    
                if self.context:
                    return f"I found information about {topic}. What would you like to know?"
                return f"Sorry, I couldn't find information about {topic}."
    
            # Answer question using current context
            if self.context:
                return find_best_answer(user_input, self.context)
    
            return "Please tell me a topic first. Try: 'Tell me about [topic]'"
    
    # Usage example
    bot = WikipediaChatbot()
    print(bot.chat("Tell me about artificial intelligence"))
    print(bot.chat("What is machine learning?"))
    print(bot.chat("Who are the pioneers in this field?"))

Key Features

What This Chatbot Does Well

Current Limitations

Possible Improvements

Technical Insights

Why TF-IDF + Cosine Similarity Works

This combination is effective because:

Parameter Tuning

Key parameters to experiment with:

Ready to Build Your Own?

Run the complete implementation in Google Colab or explore the source code on GitHub. Experiment with different topics and see how the chatbot performs!