Narayan
RAG Assistant for Medical Research. This page is a placeholder and can be expanded with details, architecture, and outcomes.
Narayan is a production-ready Retrieval-Augmented Generation (RAG) system that lets you ask natural language questions over your document stacks and get answers with full source citations. Built for anyone dealing with large volumes of documents: medical researchers, legal professionals, compliance officers, engineers, or anyone else living in PDFs.
Upload documents, ask questions, get answers grounded in the text with exact page references.
Key Features
- Accurate Retrieval: ChromaDB-powered vector search with reranking to find the most relevant passages
- Source-Grounded Answers: Every answer includes citations with filenames, page numbers, and relevance scores
- Hallucination-Free: System prompt constraints force the LLM to use only provided sources or admit when it doesn’t know
- Document Deduplication: Content-hash based system prevents duplicate PDFs from inflating the knowledge base
- Scoped Queries: Search across all documents or limit to specific files
- Metadata Tracking: Every chunk carries document ID, filename, page number, and chunk index for full traceability
- Local Inference: Embeddings run locally using sentence-transformers; LLM calls configurable (Ollama, OpenRouter, or any OpenAI-compatible API)
- Privacy by Default: Your documents stay on your machine
What It Does
- Document Ingestion: Upload PDFs. The system extracts text page by page, respecting two-column layouts and paragraph boundaries.
- Chunking and Embedding: PDFs are split into overlapping chunks and converted to vectors using local embedding models.
- Vector Search: When you ask a question, it’s converted to a vector and matched against your stored documents using cosine similarity.
- Reranking: Top candidates are reranked to keep only the most relevant chunks.
- Answer Generation: The LLM receives the question plus the top sources and generates an answer with inline citations.
- Evidence Cards: The frontend shows expandable snippets of source text so you can verify answers yourself.
Tech Stack
Backend
- FastAPI: REST API framework
- LangChain: RAG pipeline orchestration
- ChromaDB: Vector store (local, embedded)
- PyMuPDF (fitz): PDF text extraction respecting reading order
- sentence-transformers: Local embedding models (runs on CPU/GPU)
- Pydantic: Data validation
- OpenAI Python SDK: LLM API calls (compatible with Ollama, OpenRouter, and more)
Frontend
- React 19: UI framework
- Vite: Build tool and dev server
- Vanilla CSS: No CSS framework bloat
- Fetch API: Backend communication
Installation
Prerequisites
- Python 3.10+
- Node.js 18+ (for frontend)
- 4GB+ RAM recommended
- Optional: GPU for faster embeddings
Backend Setup
- Clone the repository:
git clone https://github.com/om-wani/narayan
cd narayan/backend
- Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
cp .env.example .env
Edit .env and fill in:
OPENROUTER_API_KEY: Get from https://openrouter.ai/ (or leave empty if using local Ollama)CORS_ORIGINS: Frontend URL (default: http://localhost:5173)- Other settings are optional
- Run the backend:
uvicorn app.main:app --reload
The API will be available at http://localhost:8000
Frontend Setup
- Navigate to the frontend directory:
cd narayan/frontend
- Install dependencies:
npm install
- Start the dev server:
npm run dev
The frontend will be available at http://localhost:5173
Using Ollama (Local LLM)
If you want to run the LLM locally without API calls:
- Install Ollama from https://ollama.ai
- Pull a model:
ollama pull llama2
# or: ollama pull neural-chat, mistral, etc.
Start Ollama server (runs on port 11434 by default)
Update your backend to point to Ollama:
# In config.py or via environment
OPENROUTER_API_KEY="" # Leave empty
LLM_MODEL="llama2" # Whatever model you pulled
# Update the OpenAI client base_url to http://localhost:11434/v1
API Reference
Documents
Upload a PDF
POST /api/documents/upload
Content-Type: multipart/form-data
file: <PDF file>
Response:
{
"doc_id": "medical_paper_a1b2c3d4",
"filename": "medical_paper.pdf",
"page_count": 18,
"chunk_count": 127,
"uploaded_at": "2026-01-20T10:30:45.123456+00:00"
}
List documents
GET /api/documents/
Response:
{
"documents": [
{
"doc_id": "medical_paper_a1b2c3d4",
"filename": "medical_paper.pdf",
"page_count": 18,
"chunk_count": 127
}
],
"total": 1
}
Delete a document
DELETE /api/documents/{doc_id}
Response:
{
"deleted": true,
"doc_id": "medical_paper_a1b2c3d4"
}
Query
Ask a question
POST /api/query/
Content-Type: application/json
{
"question": "What are the side effects of the treatment?",
"doc_ids": null, // Optional: limit to specific documents
"top_k": null // Optional: number of chunks to retrieve (default: 5)
}
Response:
{
"answer": "The treatment showed mild side effects including headache and nausea in 12% of patients [Source 1]. More severe reactions were rare [Source 2].",
"sources": [
{
"filename": "medical_paper.pdf",
"page": 7,
"doc_id": "medical_paper_a1b2c3d4",
"score": 0.876,
"text": "Patient cohort (n=250) experienced mild adverse events including..."
},
{
"filename": "medical_paper.pdf",
"page": 9,
"doc_id": "medical_paper_a1b2c3d4",
"score": 0.812,
"text": "Severe adverse reactions occurred in less than 1% of cases..."
}
],
"model": "google/gemma-3-27b-it:free",
"tokens_used": 287
}
Configuration
All settings are in backend/app/core/config.py. Common configurations:
# Chunking (tuned for medical/technical documents)
CHUNK_SIZE = 1000 # Characters per chunk
CHUNK_OVERLAP = 200 # Overlap between chunks
# Retrieval
TOP_K = 5 # Initial chunks to retrieve
RERANK_TOP_K = 3 # Final chunks to use for generation
# Files
MAX_FILE_SIZE_MB = 50 # Max PDF size
VECTOR_STORE_PATH = "./data/chroma_db"
# LLM
LLM_MODEL = "google/gemma-3-27b-it:free" # Via OpenRouter
EMBEDDING_MODEL = "all-MiniLM-L6-v2" # Local, via sentence-transformers
How It Works
Document Processing
- Text Extraction: PyMuPDF extracts text block by block, respecting the reading order of multi-column documents
- Deduplication: Document content is hashed; duplicate uploads are detected and skipped
- Chunking: Text is split using
RecursiveCharacterTextSplitterwith paragraph-aware boundaries - Embedding: Each chunk is converted to a vector using a local sentence-transformer model
- Storage: Vectors and metadata are stored in ChromaDB with stable chunk IDs
Query Processing
- Question Embedding: Your question is converted to a vector using the same embedding model
- Retrieval: ChromaDB returns the top-K most similar chunks using cosine similarity
- Reranking: Chunks are sorted by similarity score (currently naive; cross-encoder upgrades planned)
- Context Formatting: Top chunks are formatted with source metadata and sent to the LLM
- Generation: The LLM generates an answer using only the provided sources
- Response: Answer, sources, and metadata are returned to the frontend
Hallucination Prevention
Three mechanisms prevent incorrect answers:
- System Prompt: Explicitly forbids making up information; forces the LLM to use only provided sources
- Constrained Context: Only top-3 chunks are shown to the LLM (not the entire knowledge base)
- Transparency: Every answer includes source metadata so users can verify the evidence themselves
Development
Project Structure
narayan/
├── backend/
│ ├── app/
│ │ ├── api/ # API endpoints
│ │ │ ├── documents.py # Upload, list, delete
│ │ │ └── query.py # Question answering
│ │ ├── services/ # Business logic
│ │ │ ├── ingestion.py # PDF -> chunks -> vectors
│ │ │ └── rag.py # RAG pipeline
│ │ ├── core/ # Configuration
│ │ │ ├── config.py # Settings
│ │ │ └── vector_store.py # ChromaDB setup
│ │ ├── models/ # Data schemas
│ │ │ └── schemas.py # Pydantic models
│ │ └── main.py # FastAPI app
│ ├── requirements.txt # Python dependencies
│ └── .env.example # Environment template
├── frontend/
│ ├── src/ # React components
│ ├── public/ # Static assets
│ ├── package.json # Node dependencies
│ └── vite.config.js # Build config
└── README.md
Running Tests
Currently no test suite, but the system is validated through:
- Manual API testing via curl/Postman
- Frontend UI testing in the browser
- Real document ingestion and query workflows
Adding Custom Embedding Models
Edit backend/app/core/vector_store.py:
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="your-model-name", # e.g., "nomic-embed-text"
model_kwargs={"trust_remote_code": True}
)
Switching LLM Providers
The system uses OpenAI-compatible APIs. To switch providers:
- OpenAI:
client = OpenAI(api_key="sk-...", base_url="https://api.openai.com/v1")
- Claude (via OpenRouter):
client = OpenAI(api_key="openrouter_key", base_url="https://openrouter.ai/api/v1")
LLM_MODEL = "anthropic/claude-3.5-sonnet"
- Local Ollama:
client = OpenAI(api_key="any_string", base_url="http://localhost:11434/v1")
LLM_MODEL = "llama2" # or whatever model you pulled
Roadmap
Phase 2 features in development:
- Cross-Encoder Reranking: Better relevance for complex queries
- OCR Support: Scanned PDFs and images with text
- Admin Panel: Bulk operations, document organization, analytics
- Query Analytics: Track what questions are asked and where they fail
- Multi-Format Support: HTML, Markdown, Word docs, spreadsheets
- Domain-Specific Embeddings: Separate models trained on medical, legal, compliance documents
- Document Organization: Tagging, categories, folder-like structure
- Batch Query API: Process multiple questions efficiently
Troubleshooting
Issue: “ModuleNotFoundError: No module named ‘app’”
Make sure you’re in the backend/ directory before running the server:
cd backend
uvicorn app.main:app --reload
Issue: “CORS error when frontend calls backend”
Check that CORS_ORIGINS in .env includes your frontend URL:
CORS_ORIGINS=http://localhost:5173,http://localhost:3000
Issue: “No readable text found in PDF”
The PDF is likely a scanned image without OCR. Current version requires text-based PDFs. OCR support is coming in phase 2.
Issue: “LLM takes too long or times out”
If using OpenRouter, check your API quota. If using local Ollama, reduce CHUNK_SIZE or TOP_K in config to speed up processing.
Issue: “Embeddings are slow”
- Use a smaller embedding model (e.g.,
all-MiniLM-L6-v2instead of larger variants) - If you have a GPU, make sure PyTorch is using it:
torch.cuda.is_available()
Performance Notes
- First Ingestion: First PDF takes longer as the embedding model and vector store initialize
- Subsequent Uploads: Typically 1-2 seconds per document depending on size
- Query Latency:
- Retrieval: 100-200ms (vector search in ChromaDB)
- LLM Generation: 1-5 seconds (depends on model and context length)
- Total: ~2-6 seconds per question
Privacy and Data
- All document embeddings are stored locally in
./data/chroma_db - No documents are sent to external services unless you use an external LLM provider
- Using local Ollama means zero data leaves your machine
Contributing
Found a bug? Have an idea? Feel free to open an issue or submit a pull request.
License
MIT License. See LICENSE file for details.
Questions?
- Check the writing about it: https://omwani.pages.dev/writings/on-building-narayan/
- Open an issue on GitHub
- Check the API reference above
Built by Om Wani under Upperture Interactive. Questions? Get in touch: omwani03@gmail.com