Nafis Neehal

nafisneehal95@gmail.com

About Me

As a PhD candidate in Computer Science at Rensselaer Polytechnic Institute, I have been solving complex challenges in healthcare—one of the most data-intensive industries—using advanced AI and machine learning techniques. With over 7 years of experience in applied machine learning, deep learning, and data science, I specialize in developing, fine-tuning, and deploying large-scale models that drive real-world impact.

Through industrial research collaborations, I currently bridge cutting-edge AI research with practical healthcare solutions. In collaboration with IBM Research, I lead the development of next-generation LLM frameworks for clinical trial automation, while my work with CDPHP focused on deploying ML systems that have processed over 22M+ patient records to improve healthcare delivery and risk prediction. My expertise in large-scale data processing, ML system optimization, and trustworthy AI architectures is domain-agnostic and directly transferable across industries.

My research has been recognized at leading venues including ACM RecSys, AMIA, and Society for Clinical Trials, demonstrating the successful translation of academic innovation to industrial applications. Previously, as a lecturer at Daffodil International University, I established the institution’s first AI Research Lab, mentoring future AI researchers and engineers.

Research Focus

My research lies at the intersection of AI and Healthcare, with expertise in developing trustworthy and efficient systems. My work spans:

LLM Development & Evaluation Leading development of specialized clinical trial LLMs through quantized fine-tuning of Llama models on 65k+ trials, while engineering novel evaluation frameworks with hallucination-adjusted metrics for GPT-4/LLaMA-70B. Created comprehensive benchmarking infrastructure (CTBench) and implemented RAG architectures with few-shot learning, achieving 18% point improvement of F1 Scores in clinical trial feature generation tasks.
ML Systems for Healthcare Engineered large-scale healthcare ML systems processing 22.5M+ patient records with 87-dimensional features, implementing deep autoencoders for efficient patient matching (35% faster, 40% memory reduction) and hybrid clustering algorithms for treatment effect analysis in 350K+ patient cohorts, while achieving 200x efficiency gain through PCA-based optimization for imbalanced healthcare data.
Health Recommender Systems Developed fairness-aware patient matching frameworks improving treatment effect estimation accuracy by 4% point improvement, incorporating dual-adjustment pipelines for demographic alignment (96% improvement) and multi-stage survival analysis for outcome tracking. Engineered cost-efficient trial recruitment strategies demonstrating 49% reduction in expenses while maintaining equity.
MLOps & System Architecture Architected end-to-end ML pipelines using AWS SageMaker, MLflow, and Docker, optimizing large-scale data processing with PySpark implementations achieving 60% faster processing. Developed distributed computing solutions with vector database integrations for enhanced retrieval systems, focusing on scalability and production-ready deployments.

Technical Expertise

Languages & DB: Python, SQL, R, C++, Neo4j, Google Firestore (NoSQL), MySQL, SQLite
ML/DL/Causal: PyTorch, DDP, TensorFlow, Scikit-learn, DeepSpeed, AutoML, OpenCV, SHAP, EconML, DoWhy
MLOps Stack: MLflow, Docker, CI/CD, ChromaDB, Hopsworks, PySpark, W&B, AWS (SageMaker, Lambda, EC2)
LLM Frameworks LangChain, LlamaIndex, HF Transformers, Axolotl, Unsloth, Autotrain, Comet, PromptHub
Data Visualization and Others Streamlit, Gradio, R-Shiny, Tableau
Specialization in LLMs:
- Prompt Engineering (Zero/Few Shot)
- Fine-tuning (PEFT)
- End-to-end RAG Pipeline (Embedding, Ingestion, Indexing, Storing, Query Engines)
- Quantization
- Benchmarking
- GraphRAG
- Trustworthiness Evaluation
- Deployment (WebUI + Cloud Serving)

Beyond Research

Beyond research, I am passionate about developing cool AI applications. I enjoy reading with particular interests in human history and international politics, exploring how past events and current global dynamics shape our world. Love to travel around, love mountains, and trains (who doesn’t, right?). I’m an avid consumer of sci-fi movies and tv-series, always wondering what future holds for us. Recently developed interest in Archery. Like to have interesting discussions with friends/strangers on Religion, God, Existence, Life and Philosophy. Also,like to play Blackjack, Codenames and Poker with friends.

News

Nov 26, 2024	Releasing Cerebro 1.0 - Open-Source Fast AI Paper Search. Check out latest release - Github 🚀
Nov 20, 2024	Paper on LLM Hallucination Detection/Mitigation in Clinical Trial Design got accepted in IEEE Bigdata 2024 (Trustworthy ML4H) 🎉
Nov 13, 2024	RecSys’24 (@HealthRecSys) Paper out now. [Link]
Oct 05, 2024	Joining BanglaLLM - developing LLMs to improve reasoning in Bengali Language. [HuggingFace]
Jun 25, 2024	New Paper released on LLMs in Clinical Trial Design. [Link]

Selected Publications

RecSys

Design and Assessment of Representative Hybrid Clinical Trials using Health Recommender System

Nafis Neehal, Vibha Anand, and Kristin P Bennett

2024

Abs HTML PDF Code Slides

Incorporating real-world data (RWD) into clinical trials can enhance trial efficiency, diversity, and generalizability. This paper introduces the Framework for Research in Synthetic Control Arms (FRESCA), which uses a novel Recommender System combined with Equity Adjustment strategies to design and evaluate Representative Hybrid Clinical Trials (HCTs). FRESCA employs a novel matching algorithm through its recommendation system to select suitable patients from RWD while ensuring that the selected population is representative of the target demographic. This dual approach improves both patient selection and trial outcomes by balancing statistical appropriateness and equity. Simulations based on data from two existing randomized clinical trials (RCTs) show that using FRESCA to recommend patients from RWD and apply equity adjustments enhances internal validity and generalizability. Our analysis indicates that combining matching and equity adjustments yields more accurate treatment effect estimates and fair population representation, even with reduced RCT control group sizes. In contrast, using either method alone may result in biased outcomes. The flexibility of FRESCA to simulate various HCT scenarios makes it a valuable tool for advancing equitable and efficient clinical trial designs.
arXiv
CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design

Nafis Neehal, Bowen Wang, Shayom Debopadhaya, and 4 more authors

arXiv preprint arXiv:2406.17888, 2024

Abs arXiv Bib PDF Code

We introduce CTBench, a benchmark to assess language models (LMs) in aiding clinical study design. Given metadata specific to a study, CTBench examines how well AI models can determine the baseline features of the clinical trial (CT) which include demographic and relevant features collected at the start of the trial from all participants. The baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also required for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo", containing baseline features from 1, 690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub", a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. We develop two LM-based evaluation methods for evaluating the actual baseline feature lists against LM-generated responses. “ListMatch-LM” and “ListMatch-BERT” use GPT-4o and BERT scores (at various thresholds), respectively, to perform the evaluation. To establish baseline results, we apply advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings to generate potential baseline features. We validate the performance of GPT-4o as an evaluator through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. Our results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.
@article{neehal2024ctbench, title = {CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design}, author = {Neehal, Nafis and Wang, Bowen and Debopadhaya, Shayom and Dan, Soham and Murugesan, Keerthiram and Anand, Vibha and Bennett, Kristin P}, journal = {arXiv preprint arXiv:2406.17888}, year = {2024}, }

Book

Book-“Machine Learning Algorithm”, 2018 (Bangla)

Nafis Neehal

Page-132, 2018

Bib HTML