Kudos AI | Project | Wikipedia-Based English Chatbot

Project Overview

This project demonstrates how to build a document-based question-answering chatbot that leverages the Wikipedia API as its knowledge source. When a user asks a question, the system fetches relevant Wikipedia articles, processes the text using natural language processing techniques, and returns the most relevant answer based on semantic similarity.

Unlike modern transformer-based models, this approach uses classical NLP methods including TF-IDF vectorization and cosine similarity, making it lightweight, interpretable, and ideal for understanding foundational information retrieval concepts.

Example Conversation

What is machine learning?

Machine learning is a subset of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It focuses on the development of computer programs that can access data and use it to learn for themselves.

Who invented the telephone?

Alexander Graham Bell is credited with patenting the first practical telephone in 1876. Bell's work on the harmonic telegraph led to the development of the telephone, which transmitted speech electronically.

System Architecture

The chatbot follows a multi-stage pipeline that transforms user queries into structured responses:

User Query

→

Wikipedia API

→

Text Preprocessing

→

TF-IDF Vectorization

→

Cosine Similarity

→

Best Answer

Core NLP Concepts

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF is a numerical statistic that reflects how important a word is to a document within a collection. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, helping to adjust for common words like "the" or "is".

TF-IDF Formula

\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right) \]

Where: TF = term frequency in document, N = total number of documents, DF = number of documents containing the term

How TF-IDF Works

Term Frequency (TF): Measures how often a term appears in a document. More occurrences suggest higher relevance.

Inverse Document Frequency (IDF): Reduces the weight of terms that appear frequently across many documents (common words) and increases the weight of rare, more informative terms.

Result: Words that are both frequent in a specific document and rare across the corpus receive the highest scores.

Cosine Similarity

Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In NLP, these vectors represent documents or sentences after TF-IDF transformation. A cosine similarity of 1 means the vectors are identical, 0 means they are orthogonal (unrelated), and -1 means they are diametrically opposed.

Cosine Similarity Formula

\[ \text{similarity}(A, B) = \cos(\theta) = \frac{A \cdot B}{\|A\| \times \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} \]

Where: A and B are TF-IDF vectors, and the dot product measures their alignment

Implementation Steps

Install Required Libraries

Set up the necessary packages for Wikipedia access, NLP processing, and vectorization.

# Install dependencies
!pip install wikipedia-api nltk scikit-learn

# Import libraries
import wikipediaapi
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

Fetch Wikipedia Content

Create functions to retrieve article content from Wikipedia using the API.

def get_wikipedia_content(topic):
    """Fetch Wikipedia article text for a given topic."""
    wiki = wikipediaapi.Wikipedia(
        language='en',
        user_agent='KudosAI-Chatbot/1.0'
    )

    page = wiki.page(topic)

    if page.exists():
        return page.text
    else:
        # Fallback to search if exact match not found
        return search_wikipedia(topic)

def search_wikipedia(query):
    """Search Wikipedia for relevant articles."""
    wiki = wikipediaapi.Wikipedia('en', 'KudosAI-Chatbot/1.0')
    search_results = wiki.search(query, results=3)

    if search_results:
        page = wiki.page(search_results[0])
        return page.text
    return None

Preprocess Text with NLTK

Clean and tokenize text by removing stopwords, punctuation, and converting to lowercase.

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
import string

def preprocess_text(text):
    """Clean and tokenize text into sentences."""
    # Split text into sentences
    sentences = sent_tokenize(text)

    # Clean each sentence
    cleaned_sentences = []
    for sent in sentences:
        # Convert to lowercase
        sent = sent.lower()

        # Remove punctuation
        sent = sent.translate(str.maketrans('', '', string.punctuation))

        # Tokenize and remove stopwords
        words = word_tokenize(sent)
        words = [w for w in words
                 if w not in stopwords.words('english')]

        cleaned_sentences.append(' '.join(words))

    # Return both original and cleaned versions
    return sentences, cleaned_sentences

Build TF-IDF Vectorizer

Transform text into numerical vectors using TF-IDF, making documents comparable.

def find_best_answer(question, document_text):
    """Find the most relevant sentence(s) for a given question."""
    # Preprocess document and question
    original_sentences, cleaned_sentences = preprocess_text(document_text)
    _, cleaned_question = preprocess_text(question)
    cleaned_question = cleaned_question[0] if cleaned_question else question

    # Combine question with corpus for vectorization
    corpus = cleaned_sentences + [cleaned_question]

    # Create TF-IDF vectors
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(corpus)

    # Extract question vector and document vectors
    question_vector = tfidf_matrix[-1]
    sentence_vectors = tfidf_matrix[:-1]

    # Calculate cosine similarity
    similarities = cosine_similarity(question_vector, sentence_vectors).flatten()

    # Get top 3 most similar sentences
    top_indices = similarities.argsort()[-3:][::-1]

    # Return best matching sentences
    answers = []
    for idx in top_indices:
        if similarities[idx] > 0.1:  # Relevance threshold
            answers.append(original_sentences[idx])

    return ' '.join(answers) if answers else "I couldn't find a relevant answer."

Create the Complete Chatbot

Integrate all components into a conversational chatbot class with context management.

class WikipediaChatbot:
    def __init__(self):
        self.context = None
        self.current_topic = None

    def chat(self, user_input):
        """Main chatbot interface."""
        # Check if user wants to change topic
        if user_input.lower().startswith('tell me about'):
            topic = user_input[13:].strip()
            self.context = get_wikipedia_content(topic)
            self.current_topic = topic

            if self.context:
                return f"I found information about {topic}. What would you like to know?"
            return f"Sorry, I couldn't find information about {topic}."

        # Answer question using current context
        if self.context:
            return find_best_answer(user_input, self.context)

        return "Please tell me a topic first. Try: 'Tell me about [topic]'"

# Usage example
bot = WikipediaChatbot()
print(bot.chat("Tell me about artificial intelligence"))
print(bot.chat("What is machine learning?"))
print(bot.chat("Who are the pioneers in this field?"))

Key Features

What This Chatbot Does Well

Fast and lightweight: No need for large pre-trained models or GPU
Interpretable results: TF-IDF scores show exactly why answers were selected
Domain flexible: Works for any topic available on Wikipedia
Educational value: Demonstrates foundational IR and NLP concepts

Current Limitations

Keyword-based matching: Doesn't understand semantic meaning or context
No multi-hop reasoning: Can't answer complex questions requiring multiple documents
Limited context awareness: Doesn't maintain conversation history beyond current topic
Wikipedia dependency: Answer quality depends on article structure and completeness

Possible Improvements

Semantic embeddings: Use sentence transformers (BERT, RoBERTa) for deeper understanding
Named entity recognition: Better topic detection and entity linking
Conversation memory: Track dialogue history for contextual follow-up questions
Re-ranking: Apply a question-answering model like DistilBERT for final answer selection
Multi-document retrieval: Aggregate information from multiple Wikipedia articles

Technical Insights

Why TF-IDF + Cosine Similarity Works

This combination is effective because:

TF-IDF captures which words are distinctive to each sentence
Cosine similarity measures overlap in distinctive terms between question and answers
Questions and answers about the same topic naturally share important keywords
The approach is robust to sentence length differences

Parameter Tuning

Key parameters to experiment with:

Similarity threshold: Currently 0.1 — increase to return only highly relevant answers
Top-k sentences: Currently 3 — adjust based on desired answer length
TF-IDF parameters: max_features, min_df, max_df control vocabulary size
Preprocessing: Toggle stopword removal, stemming, or lemmatization

Ready to Build Your Own?

Run the complete implementation in Google Colab or explore the source code on GitHub. Experiment with different topics and see how the chatbot performs!

Open in Colab View on GitHub

Wikipedia-Based Question Answering Chatbot

Project Overview

System Architecture

Core NLP Concepts

TF-IDF (Term Frequency-Inverse Document Frequency)

How TF-IDF Works

Cosine Similarity

Implementation Steps

Key Features

What This Chatbot Does Well

Current Limitations

Possible Improvements

Technical Insights

Why TF-IDF + Cosine Similarity Works

Parameter Tuning

Ready to Build Your Own?