Introduction
Teams often struggle to find the right information buried across Confluence pages, Notion docs, wikis, or GitHub READMEs. Searching manually wastes time and disrupts focus.
An internal AI knowledge assistant solves this by indexing your organization’s content and allowing anyone to query it conversationally, without sending private data outside your environment.
In this guide, you’ll build a local retrieval-augmented generation (RAG) system that answers questions using your company’s documentation, all hosted within your secure infrastructure.
What You’ll Build
A private AI assistant that:
- Reads and indexes internal documents (Markdown, PDF, Notion, Confluence)
- Converts them into searchable vector embeddings
- Answers questions by retrieving relevant passages and generating context-aware responses
- Runs locally (no external API dependency if you choose a local LLM)
Step 1: Set Up Your Environment
Prerequisites
- Python 3.9+
- Docker (optional, for local LLM or Qdrant)
- Basic understanding of APIs and environment variables
Install Required Packages
pip install langchain llama-index openai chromadb qdrant-client fastapi uvicorn sentence-transformersIf you prefer a local LLM (for example, using Ollama):
curl https://ollama.ai/install.sh | sh
ollama pull mistralStep 2: Ingest and Clean Your Documents
Create a Python script ingest_docs.py that collects and cleans company documentation.
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os, glob
# Load all markdown and txt files from docs folder
loader = DirectoryLoader('./docs', glob='**/*.md', loader_cls=TextLoader)
documents = loader.load()
# Split long documents into chunks for embeddings
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
print(f"Loaded {len(documents)} docs, split into {len(chunks)} chunks.")You can extend this using other loaders:
- NotionLoader from
langchain.document_loaders - ConfluenceLoader for corporate wiki integration
- PDFLoader for scanned reports
Step 3: Generate Embeddings and Store Them
Use either SentenceTransformers locally or OpenAI embeddings for higher accuracy.
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.vectorstores import Qdrant
embedding_fn = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
# Connect to local Qdrant instance
qdrant = Qdrant.from_documents(
documents=chunks,
embedding=embedding_fn,
url="http://localhost:6333",
collection_name="company_knowledge"
)
print("Embeddings created and stored successfully.")To run Qdrant locally:
docker run -p 6333:6333 qdrant/qdrantStep 4: Query the Assistant Locally
Now you’ll connect the retrieval layer to an LLM using LangChain’s RetrievalQA.
from langchain.chains import RetrievalQA
from langchain.llms import Ollama
llm = Ollama(model="mistral") # local LLM via Ollama
retriever = qdrant.as_retriever(search_kwargs={"k": 3})
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
chain_type="stuff"
)
question = "How do we deploy our production environment?"
answer = qa.run(question)
print("Answer:", answer)If you prefer to use OpenAI’s API instead:
from langchain.llms import OpenAI
llm = OpenAI(model="gpt-4-turbo", temperature=0)Step 5: Serve It Through a Chat Interface
Use FastAPI for a lightweight REST service or Streamlit for a chat UI.
FastAPI Example
from fastapi import FastAPI, Query
from pydantic import BaseModel
app = FastAPI()
class QueryRequest(BaseModel):
question: str
@app.post("/ask")
def ask(request: QueryRequest):
answer = qa.run(request.question)
return {"answer": answer}Run with:
uvicorn app:app --host 0.0.0.0 --port 8000Now you can query it via:
curl -X POST "http://localhost:8000/ask" -H "Content-Type: application/json" -d '{"question":"What’s our VPN setup?"}'Step 6: Secure and Extend
- Authentication: Add token-based access control in FastAPI.
- Scheduling: Re-index docs nightly using a cron job.
- Versioning: Store embedding metadata (document path, hash, version).
- UI: Build a Streamlit or React chat front-end using
/askAPI. - Caching: Implement Redis or SQLite caching for repeated queries.
Step 7: Keep It Private and Compliant
To ensure data privacy:
- Use local embedding models and local LLMs when possible.
- If using OpenAI or Anthropic APIs, redact sensitive content before sending.
- Log all queries and responses for transparency.
- Ensure compliance with internal data policies and GDPR requirements.
Example Folder Structure
ai-knowledge-assistant/
│
├── docs/ # Company documentation
├── ingest_docs.py
├── app.py # FastAPI service
├── requirements.txt
├── vectorstore/ # Local Qdrant data
└── config.env # API keys, paths, etc.Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Embeddings take too long | Large docs or remote embedding model | Use local SentenceTransformer |
| Wrong or outdated answers | Old embeddings | Re-run ingest_docs.py regularly |
| Incomplete responses | Context window too small | Use larger model or chunk overlap 300+ |
| Docker memory issues | Qdrant indexing large corpus | Increase Docker memory limit |
References & Resources
- LangChain Documentation
- LlamaIndex (GPT Index)
- Qdrant Vector Database
- SentenceTransformers Models
- Ollama for Local LLMs
- FastAPI Framework
Conclusion
By following this tutorial, you’ve built a private AI knowledge assistant that understands your company’s internal documents, runs locally, and can be extended to any department.
This setup saves countless hours spent on repetitive searches and scales naturally. You can connect more data sources, plug in a front-end chat UI, or fine-tune models on internal phrasing and acronyms.