AI-powered Document Search
Abstract
AI-powered Document Search is a Python project that uses AI to perform semantic search and ranking of documents. The application features NLP-based ranking, error handling, and a CLI interface, demonstrating information retrieval and text processing techniques.
Prerequisites
- Python 3.8 or above
- A code editor or IDE
- Basic understanding of NLP and information retrieval
- Required libraries:
scikit-learn
scikit-learn
,numpy
numpy
,pandas
pandas
Before you Start
Install Python and the required libraries:
Install dependencies
pip install scikit-learn numpy pandas
Install dependencies
pip install scikit-learn numpy pandas
Getting Started
Create a Project
- Create a folder named
ai-powered-document-search
ai-powered-document-search
. - Open the folder in your code editor or IDE.
- Create a file named
ai_powered_document_search.py
ai_powered_document_search.py
. - Copy the code below into your file.
Write the Code
⚙️ AI-powered Document Search
AI-powered Document Search
"""
AI-powered Document Search
Features:
- Semantic document search
- NLP-based ranking
- Modular design
- CLI interface
- Error handling
"""
import sys
try:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
except ImportError:
TfidfVectorizer = None
cosine_similarity = None
class DocumentSearch:
def __init__(self):
self.vectorizer = TfidfVectorizer() if TfidfVectorizer else None
self.documents = []
self.vectors = None
def add_documents(self, docs):
self.documents.extend(docs)
if self.vectorizer:
self.vectors = self.vectorizer.fit_transform(self.documents)
def search(self, query):
if self.vectorizer and self.vectors is not None:
query_vec = self.vectorizer.transform([query])
scores = cosine_similarity(query_vec, self.vectors).flatten()
ranked = sorted(zip(self.documents, scores), key=lambda x: x[1], reverse=True)
return ranked[:5]
return []
class CLI:
@staticmethod
def run():
print("AI-powered Document Search")
searcher = DocumentSearch()
while True:
cmd = input('> ')
if cmd.startswith('add'):
parts = cmd.split(maxsplit=1)
if len(parts) < 2:
print("Usage: add <doc1|doc2|...>")
continue
docs = parts[1].split('|')
searcher.add_documents(docs)
print(f"Added {len(docs)} documents.")
elif cmd.startswith('search'):
parts = cmd.split(maxsplit=1)
if len(parts) < 2:
print("Usage: search <query>")
continue
query = parts[1]
results = searcher.search(query)
for doc, score in results:
print(f"Score: {score:.2f} | Doc: {doc}")
elif cmd == 'exit':
break
else:
print("Unknown command")
if __name__ == "__main__":
try:
CLI.run()
except Exception as e:
print(f"Error: {e}")
sys.exit(1)
AI-powered Document Search
"""
AI-powered Document Search
Features:
- Semantic document search
- NLP-based ranking
- Modular design
- CLI interface
- Error handling
"""
import sys
try:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
except ImportError:
TfidfVectorizer = None
cosine_similarity = None
class DocumentSearch:
def __init__(self):
self.vectorizer = TfidfVectorizer() if TfidfVectorizer else None
self.documents = []
self.vectors = None
def add_documents(self, docs):
self.documents.extend(docs)
if self.vectorizer:
self.vectors = self.vectorizer.fit_transform(self.documents)
def search(self, query):
if self.vectorizer and self.vectors is not None:
query_vec = self.vectorizer.transform([query])
scores = cosine_similarity(query_vec, self.vectors).flatten()
ranked = sorted(zip(self.documents, scores), key=lambda x: x[1], reverse=True)
return ranked[:5]
return []
class CLI:
@staticmethod
def run():
print("AI-powered Document Search")
searcher = DocumentSearch()
while True:
cmd = input('> ')
if cmd.startswith('add'):
parts = cmd.split(maxsplit=1)
if len(parts) < 2:
print("Usage: add <doc1|doc2|...>")
continue
docs = parts[1].split('|')
searcher.add_documents(docs)
print(f"Added {len(docs)} documents.")
elif cmd.startswith('search'):
parts = cmd.split(maxsplit=1)
if len(parts) < 2:
print("Usage: search <query>")
continue
query = parts[1]
results = searcher.search(query)
for doc, score in results:
print(f"Score: {score:.2f} | Doc: {doc}")
elif cmd == 'exit':
break
else:
print("Unknown command")
if __name__ == "__main__":
try:
CLI.run()
except Exception as e:
print(f"Error: {e}")
sys.exit(1)
Example Usage
Run document search
python ai_powered_document_search.py
Run document search
python ai_powered_document_search.py
Explanation
Key Features
- Semantic Search: Uses NLP for document ranking.
- Information Retrieval: Finds relevant documents based on queries.
- Error Handling: Validates inputs and manages exceptions.
- CLI Interface: Interactive command-line usage.
Code Breakdown
- Import Libraries and Setup Search
ai_powered_document_search.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
ai_powered_document_search.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
- Document Search Function
ai_powered_document_search.py
def search_documents(docs, query):
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(docs)
query_vec = vectorizer.transform([query])
scores = cosine_similarity(query_vec, doc_vectors).flatten()
ranked = np.argsort(scores)[::-1]
return [(docs[i], scores[i]) for i in ranked[:5]]
ai_powered_document_search.py
def search_documents(docs, query):
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform(docs)
query_vec = vectorizer.transform([query])
scores = cosine_similarity(query_vec, doc_vectors).flatten()
ranked = np.argsort(scores)[::-1]
return [(docs[i], scores[i]) for i in ranked[:5]]
- CLI Interface and Error Handling
ai_powered_document_search.py
def main():
print("AI-powered Document Search")
# docs = [...] # Load documents (not shown for brevity)
while True:
cmd = input('> ')
if cmd == 'search':
query = input("Query: ")
# results = search_documents(docs, query)
print("[Demo] Search logic here.")
elif cmd == 'exit':
break
else:
print("Unknown command. Type 'search' or 'exit'.")
if __name__ == "__main__":
main()
ai_powered_document_search.py
def main():
print("AI-powered Document Search")
# docs = [...] # Load documents (not shown for brevity)
while True:
cmd = input('> ')
if cmd == 'search':
query = input("Query: ")
# results = search_documents(docs, query)
print("[Demo] Search logic here.")
elif cmd == 'exit':
break
else:
print("Unknown command. Type 'search' or 'exit'.")
if __name__ == "__main__":
main()
Features
- AI-Based Document Search: High-accuracy semantic ranking
- Modular Design: Separate functions for search and ranking
- Error Handling: Manages invalid inputs and exceptions
- Production-Ready: Scalable and maintainable code
Next Steps
Enhance the project by:
- Integrating with real-world document datasets
- Supporting batch search
- Creating a GUI with Tkinter or a web app with Flask
- Adding evaluation metrics (precision, recall)
- Unit testing for reliability
Educational Value
This project teaches:
- Information Retrieval: Semantic search and ranking
- Software Design: Modular, maintainable code
- Error Handling: Writing robust Python code
Real-World Applications
- Enterprise Search Tools
- Content Management
- Educational Tools
Conclusion
AI-powered Document Search demonstrates how to build a scalable and accurate semantic search tool using Python. With modular design and extensibility, this project can be adapted for real-world applications in enterprise, education, and more. For more advanced projects, visit Python Central Hub.
Was this page helpful?
Let us know how we did