Project Overview
This project demonstrates how to build a document-based question-answering chatbot that leverages the Wikipedia API as its knowledge source. When a user asks a question, the system fetches relevant Wikipedia articles, processes the text using natural language processing techniques, and returns the most relevant answer based on semantic similarity.
Unlike modern transformer-based models, this approach uses classical NLP methods including TF-IDF vectorization and cosine similarity, making it lightweight, interpretable, and ideal for understanding foundational information retrieval concepts.
System Architecture
The chatbot follows a multi-stage pipeline that transforms user queries into structured responses:
Core NLP Concepts
TF-IDF (Term Frequency-Inverse Document Frequency)
TF-IDF is a numerical statistic that reflects how important a word is to a document within a collection. It increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus, helping to adjust for common words like "the" or "is".
TF-IDF Formula
\[ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right) \]Where: TF = term frequency in document, N = total number of documents, DF = number of documents containing the term
How TF-IDF Works
Term Frequency (TF): Measures how often a term appears in a document. More occurrences suggest higher relevance.
Inverse Document Frequency (IDF): Reduces the weight of terms that appear frequently across many documents (common words) and increases the weight of rare, more informative terms.
Result: Words that are both frequent in a specific document and rare across the corpus receive the highest scores.
Cosine Similarity
Cosine similarity measures the cosine of the angle between two non-zero vectors in a multi-dimensional space. In NLP, these vectors represent documents or sentences after TF-IDF transformation. A cosine similarity of 1 means the vectors are identical, 0 means they are orthogonal (unrelated), and -1 means they are diametrically opposed.
Cosine Similarity Formula
\[ \text{similarity}(A, B) = \cos(\theta) = \frac{A \cdot B}{\|A\| \times \|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \times \sqrt{\sum_{i=1}^{n} B_i^2}} \]Where: A and B are TF-IDF vectors, and the dot product measures their alignment
Implementation Steps
-
Install Required Libraries
Set up the necessary packages for Wikipedia access, NLP processing, and vectorization.
# Install dependencies !pip install wikipedia-api nltk scikit-learn # Import libraries import wikipediaapi import nltk from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity import numpy as np # Download NLTK resources nltk.download('punkt') nltk.download('stopwords') -
Fetch Wikipedia Content
Create functions to retrieve article content from Wikipedia using the API.
def get_wikipedia_content(topic): """Fetch Wikipedia article text for a given topic.""" wiki = wikipediaapi.Wikipedia( language='en', user_agent='KudosAI-Chatbot/1.0' ) page = wiki.page(topic) if page.exists(): return page.text else: # Fallback to search if exact match not found return search_wikipedia(topic) def search_wikipedia(query): """Search Wikipedia for relevant articles.""" wiki = wikipediaapi.Wikipedia('en', 'KudosAI-Chatbot/1.0') search_results = wiki.search(query, results=3) if search_results: page = wiki.page(search_results[0]) return page.text return None -
Preprocess Text with NLTK
Clean and tokenize text by removing stopwords, punctuation, and converting to lowercase.
from nltk.tokenize import sent_tokenize, word_tokenize from nltk.corpus import stopwords import string def preprocess_text(text): """Clean and tokenize text into sentences.""" # Split text into sentences sentences = sent_tokenize(text) # Clean each sentence cleaned_sentences = [] for sent in sentences: # Convert to lowercase sent = sent.lower() # Remove punctuation sent = sent.translate(str.maketrans('', '', string.punctuation)) # Tokenize and remove stopwords words = word_tokenize(sent) words = [w for w in words if w not in stopwords.words('english')] cleaned_sentences.append(' '.join(words)) # Return both original and cleaned versions return sentences, cleaned_sentences -
Build TF-IDF Vectorizer
Transform text into numerical vectors using TF-IDF, making documents comparable.
def find_best_answer(question, document_text): """Find the most relevant sentence(s) for a given question.""" # Preprocess document and question original_sentences, cleaned_sentences = preprocess_text(document_text) _, cleaned_question = preprocess_text(question) cleaned_question = cleaned_question[0] if cleaned_question else question # Combine question with corpus for vectorization corpus = cleaned_sentences + [cleaned_question] # Create TF-IDF vectors vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(corpus) # Extract question vector and document vectors question_vector = tfidf_matrix[-1] sentence_vectors = tfidf_matrix[:-1] # Calculate cosine similarity similarities = cosine_similarity(question_vector, sentence_vectors).flatten() # Get top 3 most similar sentences top_indices = similarities.argsort()[-3:][::-1] # Return best matching sentences answers = [] for idx in top_indices: if similarities[idx] > 0.1: # Relevance threshold answers.append(original_sentences[idx]) return ' '.join(answers) if answers else "I couldn't find a relevant answer." -
Create the Complete Chatbot
Integrate all components into a conversational chatbot class with context management.
class WikipediaChatbot: def __init__(self): self.context = None self.current_topic = None def chat(self, user_input): """Main chatbot interface.""" # Check if user wants to change topic if user_input.lower().startswith('tell me about'): topic = user_input[13:].strip() self.context = get_wikipedia_content(topic) self.current_topic = topic if self.context: return f"I found information about {topic}. What would you like to know?" return f"Sorry, I couldn't find information about {topic}." # Answer question using current context if self.context: return find_best_answer(user_input, self.context) return "Please tell me a topic first. Try: 'Tell me about [topic]'" # Usage example bot = WikipediaChatbot() print(bot.chat("Tell me about artificial intelligence")) print(bot.chat("What is machine learning?")) print(bot.chat("Who are the pioneers in this field?"))
Key Features
What This Chatbot Does Well
- Fast and lightweight: No need for large pre-trained models or GPU
- Interpretable results: TF-IDF scores show exactly why answers were selected
- Domain flexible: Works for any topic available on Wikipedia
- Educational value: Demonstrates foundational IR and NLP concepts
Current Limitations
- Keyword-based matching: Doesn't understand semantic meaning or context
- No multi-hop reasoning: Can't answer complex questions requiring multiple documents
- Limited context awareness: Doesn't maintain conversation history beyond current topic
- Wikipedia dependency: Answer quality depends on article structure and completeness
Possible Improvements
- Semantic embeddings: Use sentence transformers (BERT, RoBERTa) for deeper understanding
- Named entity recognition: Better topic detection and entity linking
- Conversation memory: Track dialogue history for contextual follow-up questions
- Re-ranking: Apply a question-answering model like DistilBERT for final answer selection
- Multi-document retrieval: Aggregate information from multiple Wikipedia articles
Technical Insights
Why TF-IDF + Cosine Similarity Works
This combination is effective because:
- TF-IDF captures which words are distinctive to each sentence
- Cosine similarity measures overlap in distinctive terms between question and answers
- Questions and answers about the same topic naturally share important keywords
- The approach is robust to sentence length differences
Parameter Tuning
Key parameters to experiment with:
- Similarity threshold: Currently 0.1 — increase to return only highly relevant answers
- Top-k sentences: Currently 3 — adjust based on desired answer length
- TF-IDF parameters: max_features, min_df, max_df control vocabulary size
- Preprocessing: Toggle stopword removal, stemming, or lemmatization
Ready to Build Your Own?
Run the complete implementation in Google Colab or explore the source code on GitHub. Experiment with different topics and see how the chatbot performs!