Hippocrates:
An Open-Source Framework for Advancing
Large Language Models in Healthcare

1Koç University, KUIS AI Center, 2Koç University, Department of Computer Engineering, 3Hacettepe University, Department of Computer Engineering, 4Yıldız Technical University, Department of Computer Engineering, 5Robert College
ResultsTable

Abstract

The integration of Large Language Models (LLMs) into healthcare promises to transform medical diagnostics, research, and patient care. Yet, the progression of medical LLMs faces obstacles such as complex training requirements, rigorous evaluation demands, and the dominance of proprietary models that restrict academic exploration. Transparent, comprehensive access to LLM resources is essential for advancing the field, fostering reproducibility, and encouraging innovation in healthcare AI. We present Hippocrates, an open-source LLM framework specifically developed for the medical domain. In stark contrast to previous efforts, it offers unrestricted access to its training datasets, codebase, checkpoints, and evaluation protocols. This open approach is designed to stimulate collaborative research, allowing the community to build upon, refine, and rigorously evaluate medical LLMs within a transparent ecosystem. Also, we introduce Hippo, a family of 7B models tailored for the medical domain, fine-tuned from Mistral and LLaMA2 through continual pre-training, instruction tuning, and reinforcement learning from human and AI feedback. Our models outperform existing open medical LLMs models by a large-margin, even surpassing models with 70B parameters. Through Hippocrates, we aspire to unlock the full potential of LLMs not just to advance medical knowledge and patient care but also to democratize the benefits of AI research in healthcare, making them available across the globe.

Hippocrates


Our release includes:

  • Hippocrates, a comprehensive and open-source framework tailored for the medical domain.
  • We provide openly available datasets and establish an intuitive benchmark using the LM-Evaluation-Harness tool.
  • We also introduce Hippo-Logo and Hippo-Logo, two 7B models demonstrating superior performance.

We argue that the development of a broad, varied collection of open models is crucial for deepening our knowledge of language models and enhancing their applicability across various domains. Our work makes substantial contributions to the field by combining in-depth empirical research with a structured training methodology, offering invaluable insights and tools for future research not only in healthcare but in any area requiring domain-specific adaptation of LLMs.

Hippocrates Framework


  • Approach. Our training strategy includes several phases: injection of medical knowledge through continued pre-training, domain-specific instruction tuning, and reinforcement learning from AI-generated feedback for improved alignment with medical experts. Employing the LLaMA Factory framework, we adhere to replicable and high-performance training standards. Moreover, we adopt the Low-Rank Adaptation (LoRA) technique for training efficiency and precision. LoRA enhances LLMs by selectively updating weights within additional trainable layers, thereby accelerating the training process, minimizing memory usage, and mitigating overfitting and catastrophic forgetting. Our foundational models, LLaMA2 7B and Mistral 7B, are selected based on their robust performance across medical benchmarks, demonstrating their capacity to excel without extensive training modifications.

  • Continued Pre-training. A key aspect of our methodology is the integration of specialized medical knowledge through an extensive pre-training corpus, assembled from three specialized datasets: Medical Guidelines, PMC-Patients, and PubMedQA-contexts. This stage employs traditional language modeling, focusing on next-token prediction. We systematically assessed the impact of each dataset, both individually and in combination, to optimize our model's performance.

  • Supervised Finetuning. After continued pre-training, models undergo fine-tuning with an Instruction Tuning (IT) dataset to closely mirror medical directives, aligning model outputs with clinical requirements. We have tested with the different datasets and found that MedQA-train IT works better than the other options.

  • Medical Preference Data and RLAIF. To create a preference dataset, we used LLMs to generate varied responses to queries, reviewed by human annotators for accuracy. This method, costly in computation and manual review, led us to use the iCliniq-10k dataset containing 10,000 real patient-doctor dialogues. Each includes a patient question with three responses: one from a real doctor and two from AI models. We preprocessed this dataset to remove irrelevant details and applied the RLAIF methodology in the medical domain for cost-effective annotation, using GPT4 and modified medical assessment guidelines. This approach cost $120.

  • Medical Preference Learning. Finally, the instruction-tuned models are further trained with a recent and popular technique called direct preference optimization (DPO). In DPO, reinforcement learning is bypassed which allows for direct optimization based on preference data. Unlike RLHF, the responses in DPO need not be derived from the LLM being optimized. Central to DPO is the development of a loss function that evaluates the likelihood of a preferred response over a less preferred one, steering the LLM towards this goal. This makes DPO more stable and significantly reduces computational demands. Our medical LLMs underwent further refinement with these clinical preferences through DPO trainings.

    The outcome of all this are our medical LLMs, named Hippo-Logo and Hippo-Logo, built upon the pre-trained LLaMA2 7B and Mistral 7B models. These models were refined through a comprehensive process that included continued pre-training and/or instruction tuning using our carefully curated medical datasets. Following this, we also explored the impact of aligning the models with clinical preferences by conducting further training on medical preference data.

Contribution of Each Training Stage


  • Hippo-Logo. Our evaluation methodology for the LLaMA2 7B model covers successive training stages: Continued Pre-training (CP), Instruction Tuning (SFT), and Direct Preference Optimization (DPO). As listed in table above, the base model LLaMA2 7B initially achieves an average accuracy of 34.0 across benchmarks. The CP stage marginally increases accuracy to 34.4, indicating initial benefits from domain-focused continued pre-training. The subsequent introduction of SFT yields a substantial performance boost to an average accuracy of 50.3, demonstrating the critical role of customized instruction in enhancing the model’s capabilities in understanding and answering medical queries. Integrating CP with SFT further improves this performance to 53.0, highlighting the combined value of domain knowledge and specific instruction tuning. The final DPO stage slightly decreases the model's performance to 52.5, albeit with a slight increase in accuracy for MedMCQA and PubMedQA, illustrating DPO's refined impact on model preference alignment. This sequence delineates the incremental enhancements attributable to each training phase, with SFT marking a pivotal improvement. The composite model, LLaMA2 + CP + SFT, is thus designated as Hippo-Logo for its distinguished performance across our benchmarks.

  • Hippo-Logo. Following the approach for Hippo-Logo, the training evolution for the Mistral 7B model reveals gradual improvement in the model's proficiency in medical question-answering. Initial results from the baseline Mistral 7B model, as shown in the table, show an average benchmark accuracy of 39.3. Implementing CP slightly improves this to 41.0, reflecting the positive yet modest impact of domain-specific continued pre-training. The pivotal SFT stage significantly raises the performance, achieving an average accuracy of 61.6, emphasizing the critical role of customized instruction in enhancing the model's interpretative and response capabilities for medical inquiries. Interestingly, combining CP and SFT results in a slight reduction to 61.1, suggesting a complex interaction between domain pre-training and instruction tuning. The subsequent application of DPO slightly lowers the overall score to 59.6, similar to the pattern observed for Hippo-Logo, with targeted performance adjustment. Based on comprehensive analysis, Mistral 7b + SFT is selected to represent Hippo-Logo, credited for its exceptional performance across all benchmarks.

Uncertainty Quantification


  • In our study, we conducted an uncertainty quantification experiment on Hippo-Logo to understand its performance on the MedMCQA, MedQA, and PubMedQA datasets, as shown in figure above. Our findings reveal that our model consistently assigns higher probabilities to questions it answers correctly across all datasets, suggesting an ability to self-calibrate its certainty. The model’s confidence is notably higher on MedMCQA, possibly reflecting the dataset's relative simplicity. In contrast, its confidence on PubMedQA is comparatively lower, likely due to the dataset's complexity. Additionally, the model's confidence changes with different training stages; CPT leads to more conservative estimates, SFT boosts confidence, and adding DPO leads to variable confidence, with noticeable effects in MedMCQA and MedQA. These outcomes emphasize a complex relationship between training approaches and confidence calibration in the model.

Limitations and Safety

  • Model Limitations. While our 7B model has achieved state-of-the-art results within its class, it is important to acknowledge its limitations compared to larger models such as OpenAI's GPT4. The constraints imposed by the smaller parameter size may impede the model's reasoning capabilities, a crucial aspect of complex medical decision-making. Additionally, the model's performances are almost half on the average which highlights a huge area for improvement in open-source models.

  • Safety and Risks. Crucially, despite these advancements, it is important to highlight that these AI models need substantial improvements before they can be safely and effectively employed with real patients. They are not yet at a stage where they can provide medical advice or be utilized for commercial healthcare applications. This limitation highlights the need for ongoing, careful development and validation of AI systems to guarantee their reliability and safety in clinical settings. The path toward AI integration in patient care is still unfolding, and while it holds promise, it requires a methodical and thoroughly evaluated approach.

BibTeX

Please don't forget to kindly cite our paper if you use our models, data, codes, or results:


        @misc{acikgoz2024hippocrates,
          title={Hippocrates: An Open-Source Framework for Advancing Large Language Models in Healthcare}, 
          author={Emre Can Acikgoz and Osman Batur İnce and Rayene Bench and Arda Anıl Boz and İlker Kesen and Aykut Erdem and Erkut Erdem},
          year={2024},
          eprint={2404.16621},
          archivePrefix={arXiv},
          primaryClass={cs.LG}
        }
      

Acknowledgements

This work is supported in part provided by the KUIS AI Center. The numerical calculations reported in this paper were fully/partially performed at TUBITAK ULAKBIM, High Performance and Grid Computing Center (TRUBA resources). Last but not least, we also acknowledge VSB – Technical University of Ostrava, IT4Innovations National Supercomputing Center, Czech Republic, for awarding this project access to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (grant ID: 90254)