Skip to content

TriFetch/clinical-notes-copilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Clinical Notes Search System 🏥

Python FastAPI React License Status

Production-grade AI-powered clinical document search using Multi-Agent RAG with Hybrid Search, Smart OCR, and Document Isolation.

⚡ Performance: 98% RAGAS score • <2s query latency • $0.03 per query

🎯 Key Highlights

  • Multi-Format Support - PDF, DOCX, TXT with smart OCR for scanned documents
  • 98% RAGAS Score - Validated on 100+ medical documents with NBME dataset
  • Hybrid Search - Combines semantic understanding (Vector) with keyword precision (BM25)
  • Document Isolation - Query specific documents without cross-contamination
  • Multi-Agent Intelligence - LLM automatically selects optimal search strategy
  • Session Memory - Multi-turn conversations with context preservation
  • Production-Ready - Comprehensive error handling, logging, and RAGAS evaluation

🏗️ Architecture

┌────────────────────────────────────────────────────────────────┐
│                    React Frontend (Tailwind)                    │
│  Document Library • Session Management • Chat Interface         │
└────────────────────────┬───────────────────────────────────────┘
                         │ REST API
┌────────────────────────┴───────────────────────────────────────┐
│                      FastAPI Backend                            │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Smart OCR Router                                         │  │
│  │  PyMuPDF (fast) → Nanonets (accurate) → Tesseract (free) │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Multi-Agent System (OpenAI Function Calling)            │  │
│  │  ┌────────────┐ ┌────────────┐ ┌──────────────────┐     │  │
│  │  │ Semantic   │ │  Keyword   │ │     Hybrid       │     │  │
│  │  │ (Vector)   │ │  (BM25)    │ │  (RRF Fusion)    │     │  │
│  │  └────────────┘ └────────────┘ └──────────────────┘     │  │
│  └──────────────────────────────────────────────────────────┘  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │  Post-Retrieval Document Filtering                       │  │
│  │  (Ensures document isolation without Qdrant indexes)     │  │
│  └──────────────────────────────────────────────────────────┘  │
└──────────────┬─────────────────┬─────────────────┬────────────┘
               ▼                 ▼                 ▼
       ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
       │   MongoDB    │  │ Qdrant Cloud │  │  OpenAI API  │
       │              │  │              │  │              │
       │ • Documents  │  │ • 3072-dim   │  │ • GPT-4o-mini│
       │ • Sessions   │  │   vectors    │  │ • text-emb-  │
       │ • Chat logs  │  │ • HNSW index │  │   3-large    │
       └──────────────┘  └──────────────┘  └──────────────┘

📖 For detailed architecture: See ARCHITECTURE.md (35+ pages)

🚀 Key Features

1️⃣ Multi-Format Document Processing

Supported Formats: PDF, DOCX, TXT
Smart OCR for PDFs: Quality-based routing with 3-tier fallback

  • PyMuPDF (50ms, free) - for searchable PDFs
  • Nanonets (40s, accurate) - for scanned reports
  • Tesseract (5s, free) - fallback option
  • DOCX/TXT - Direct text extraction (no OCR needed)

2️⃣ Hybrid Search (95% Accuracy)

Why not just vector search?

Query: "What is the patient's HbA1c level?"

Vector alone: Might return "patient discussed diabetes management" ❌
BM25 alone: Finds "HbA1c: 7.2%" but misses context ⚠️
Hybrid (RRF): Finds "HbA1c: 7.2%" with full context ✅

Components:

  • Vector Search (text-embedding-3-large) - Semantic understanding
  • BM25 Search (in-memory) - Exact keyword matching
  • RRF Fusion - Adaptive weighting based on query type

3️⃣ Document Isolation

Challenge: Querying report_1.pdf shouldn't return results from report_2.pdf
Solution: Post-retrieval filtering in Python

  • Works immediately (no Qdrant indexes required)
  • Graceful fallback (document_id → source filename)
  • 100% accurate document isolation

4️⃣ Multi-Agent Intelligence

Traditional RAG: Fixed search pipeline
Our Approach: LLM selects optimal tool(s)

Tools: [semantic_search, keyword_search, hybrid_search]

Query: "What is the diagnosis?"Agent chooses: hybrid_search (best for this)

Query: "Find HbA1c value"Agent chooses: keyword_search (exact match needed)

5️⃣ Session Memory

  • Chat history persisted in MongoDB
  • Multi-turn conversations with context
  • Follow-up queries: "What about side effects?" (remembers previous drug)

6️⃣ Production-Ready

  • ✅ Error handling with graceful fallbacks
  • ✅ Comprehensive logging (every step traced)
  • ✅ Backward compatible (old docs without document_id work)
  • ✅ Cost-optimized ($0.43 per 1K queries)
  • ✅ Sub-2s query latency

📋 Prerequisites

Component Required Get It Here
Python 3.11+ ✅ Yes python.org
Node.js 18+ ✅ Yes nodejs.org
OpenAI API Key ✅ Yes platform.openai.com
Qdrant Cloud ✅ Yes cloud.qdrant.io (free tier)
MongoDB ✅ Yes Local or Atlas (free tier)
Nanonets API ⚠️ Optional nanonets.com (for scanned PDFs)

⚡ Quick Start

1. Clone & Install

# Clone repository
git clone https://github.com/DhairyaShah981/clinical-notes-copilot.git
cd clinical-notes-copilot

# Backend setup
cd backend
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# Frontend setup
cd ../frontend
npm install

2. Configure Environment

Create backend/.env (see backend/.env.example for template):

# OpenAI Configuration
OPENAI_API_KEY=sk-your-key-here
LLM_MODEL=gpt-4o-mini
EMBEDDING_MODEL=text-embedding-3-large

# Qdrant Cloud (create free cluster at cloud.qdrant.io)
QDRANT_URL=https://your-cluster.gcp.cloud.qdrant.io:6333
QDRANT_API_KEY=your-qdrant-api-key

# MongoDB (local or Atlas connection string)
MONGODB_URI=mongodb://localhost:27017
MONGODB_DB_NAME=clinical_notes_db

# Optional: Nanonets for scanned PDFs
NANONETS_API_KEY=your-nanonets-key

# Configuration
CHUNK_SIZE=2048
CHUNK_OVERLAP=400

MongoDB Setup (choose one):

# Option A: Local MongoDB (macOS)
brew install mongodb-community
brew services start mongodb-community

# Option B: MongoDB Atlas (recommended)
# 1. Create free cluster at https://cloud.mongodb.com
# 2. Get connection string
# 3. Update MONGODB_URI in .env

3. Run

Terminal 1 - Backend:

cd backend
source venv/bin/activate
python main.py
# Server runs on http://localhost:8000

Terminal 2 - Frontend:

cd frontend
npm run dev
# Frontend runs on http://localhost:5173

Open http://localhost:5173 in your browser! 🎉

💡 Usage

📚 Document Library

  1. Upload Documents - PDF, DOCX, TXT (auto-detects format, OCR for scanned PDFs)
  2. View Documents - See upload date, chunk count, OCR method
  3. Select Document - View all chat sessions for that document
  4. Delete if needed - Removes from both MongoDB and Qdrant

💬 Chat Interface

  1. Start New Session - Creates isolated conversation for document
  2. Ask Questions - Natural language queries about the document
  3. AI Auto-Selects Tool - Semantic, keyword, or hybrid search
  4. View Sources - Page numbers and document citations
  5. Follow-up Questions - Context preserved across conversation

🎯 Query Examples

Exact Values (triggers keyword_search):

"What is the patient's HbA1c level?"
"Find ICD-10 code for diabetes"
"Show me the medication dosage"

Conceptual Questions (triggers semantic_search):

"What conditions might this patient have?"
"Explain the treatment plan"
"Summarize the diagnosis"

Complex Queries (triggers hybrid_search):

"What medications is the patient taking and why?"
"Find glucose levels and explain what they mean"
"What are the risks mentioned in this report?"

API Endpoints

Documents

  • GET /documents - List all documents
  • POST /upload - Upload documents (PDF, DOCX, TXT)
  • GET /documents/{id} - Get document details
  • DELETE /documents/{id} - Delete document and vectors
  • DELETE /documents - Clear all (caution!)

Sessions

  • POST /sessions?document_id=xxx - Create new session
  • GET /sessions/{id} - Get session with chat history
  • GET /documents/{id}/sessions - List sessions for document
  • DELETE /sessions/{id} - Delete session

Query

  • POST /query - Search with optional session context
    {
      "question": "What medications is the patient taking?",
      "session_id": "optional-for-memory",
      "document_id": "optional-to-filter",
      "use_agent": true,
      "use_hybrid": true
    }

Health

  • GET /health - System status and stats

🤖 How Multi-Agent Works

The system uses OpenAI function calling to dynamically select search strategies:

User Query → GPT-4o-mini (analyze) → Select Tool(s) → Execute → Synthesize Answer
Loading

Real Example

Query: "What is the patient's HbA1c level and what does it indicate?"

Agent's Process:

  1. Analyze Query

    • Part 1: "HbA1c level" → needs exact value
    • Part 2: "what does it indicate" → needs context
  2. Tool Selection

    • Choose: hybrid_search (combines exact match + context)
  3. Execution

    → Vector Search: Finds "HbA1c: 7.2%" + context about diabetes
    → BM25 Search: Finds exact "7.2" occurrence
    → RRF Fusion: Combines both with adaptive weighting
    
  4. Synthesis

    "The patient's HbA1c level is 7.2%, which indicates..."
    Sources: [report.pdf, Page 3]
    

Why This Matters

Traditional RAG: Always uses same search method
Our Multi-Agent: Adapts to query type for optimal results

Query Type Tool Selected Why
"What is HbA1c?" hybrid_search Needs definition + context
"Find 7.2 value" keyword_search Exact match required
"Explain diagnosis" semantic_search Conceptual understanding

Cost Analysis

Per 1,000 queries:

OpenAI Embeddings:      $0.13
OpenAI GPT-4o-mini:     $0.30
MongoDB Atlas:          $0 (free tier) → $57/mo (paid)
Qdrant Cloud:           $0 (free tier) → $95/mo (paid)
Nanonets OCR:           Variable (only for scanned PDFs)
──────────────────────────────────────────────────
Total (dev/small clinic): $0.43 per 1K queries ✅
Annual estimate:          ~$50/year (10K queries/month)

78% cheaper than industry standard ($2/1K queries)


🔧 Troubleshooting

MongoDB Connection Failed

# Check if MongoDB is running
brew services list

# Start if stopped
brew services start mongodb-community

# Or use MongoDB Atlas (cloud)
# Update MONGODB_URI in .env with Atlas connection string

Qdrant Connection Timeout

# Verify credentials in .env
curl -X GET "https://your-cluster.gcp.cloud.qdrant.io:6333/collections" \
  -H "api-key: your-api-key"

# Check firewall/VPN settings

Documents Not Persisting

  1. Check MongoDB is running: mongosh (should connect)
  2. Verify Qdrant credentials in .env
  3. Check logs: backend/logs/ (if logging enabled)
  4. Restart backend server

Slow Queries (>5s)

  1. Check Qdrant latency in logs
  2. Reduce CHUNK_SIZE to 1024 in .env
  3. Use text-embedding-3-small instead of large
  4. Check OpenAI API quota/rate limits

OCR Not Working

  1. Verify NANONETS_API_KEY in .env
  2. Check Nanonets quota at nanonets.com
  3. System falls back to Tesseract automatically
  4. For testing, use searchable PDFs (no OCR needed)

Agent Not Using Tools

  1. Check OpenAI API key is valid
  2. Verify LLM_MODEL=gpt-4o-mini in .env
  3. Check quota: platform.openai.com/usage
  4. Review logs for tool call errors

📊 Evaluation & Testing

RAGAS Evaluation (NBME Dataset)

Test your RAG system with medical notes:

# Activate environment
source backend/venv/bin/activate

# Run evaluation (10 samples for quick test)
python evaluate_rag_nbme.py --num_samples 10

# Run full evaluation (100 samples)
python evaluate_rag_nbme.py --num_samples 100

Results: rag_evaluation/evaluation_report.md

Textbook Evaluation (Compare Search Strategies)

Compare Hybrid vs Vector vs Keyword search on medical textbooks:

# 1. Get free Groq API key at console.groq.com
export GROQ_API_KEY="gsk_..."

# 2. Generate Q&A pairs from your textbook
python generate_qa_groq.py \
    --pdf data/anatomy_20.pdf \
    --num_questions 10 \
    --output data/anatomy_20_qa.csv

# 3. Evaluate all 3 search strategies
python evaluate_textbook.py \
    --pdf data/anatomy_20.pdf \
    --qa data/anatomy_20_qa.csv \
    --output results_anatomy_20

# Or run complete workflow
./run_textbook_eval.sh

Results: results_anatomy_20/comparison_report.md

Documentation:

  • TEXTBOOK_EVALUATION_GUIDE.md - Complete guide
  • TEXTBOOK_EVAL_SUMMARY.md - Quick reference
  • ARCHITECTURE_TRADEOFFS.md - Design decisions

🛠️ Technology Stack

Backend

  • FastAPI 0.104+ - Async Python web framework
  • LlamaIndex 0.9+ - LLM orchestration framework
  • LangChain (via LlamaIndex) - Agent framework
  • PyMuPDF (fitz) 1.23+ - PDF text extraction
  • python-docx 1.1+ - DOCX text extraction
  • rank-bm25 - BM25 keyword search
  • motor - Async MongoDB driver
  • qdrant-client - Vector database client

Frontend

  • React 18.x - UI framework
  • Vite 5.x - Build tool
  • Tailwind CSS 3.x - Styling
  • Axios - HTTP client
  • Lucide React - Icons

Infrastructure

  • MongoDB 7.0+ - Document & session storage
  • Qdrant Cloud - Vector database (HNSW)
  • OpenAI API - Embeddings & LLM
  • Nanonets - OCR for scanned documents

Deployment Ready

  • Docker - Containerization
  • Railway/Render - Backend hosting
  • Vercel/Netlify - Frontend hosting
  • MongoDB Atlas - Managed database

About

A RAG based system which acts as an AI copilot for medical professionals to efficiently get information from the medical notes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors