Society for Clinical Vascular Surgery

SCVS Home SCVS Home Past & Future Symposia Past & Future Symposia


Facebook   Instagram   Twitter   Youtube

Back to 2025 Display Posters


Exploring The Use Of Customized Large Language Models For Potential In Vascular Surgery: A Pilot Study Involving The VESAP Board Exam
Peter Vien, Elie Donath, MD, Joshua Le, Aryan Naik, Sean Liebscher, MD, Benjamin Carriveau, Daniel Bertges, MD.
University of Vermont, Burlington, VT, USA.

Objectives The advent of generative artificial intelligence and, more particularly, large language models (LLMs), hold tremendous potential for medical applications though little is known about the reliability of their outputs. We evaluated the performance of several commercially available, closed-source LLMs, as well as models customized with vascular surgery-related content, in answering VESAP5 questions. Method
The open-source LLAMA-3 was customized using retrieval augmented generation (RAG) with specialty-specific resources including Audible Bleeding transcripts, the DeBakey YouTube series, and Rutherford’s. We compared these three customized LLMs to their baseline (zeroshot (ZS)) counterpart and two commercially available LLMs (ChatGPT and Claude). Evaluation metrics included VESAP accuracy (n=680) and similarity measures, including ROUGE and BERT scoring, which assess the fidelity of LLM’s rationale to the original answer. RAG-based customization, data analysis, and model prompting were conducted using Python libraries (LangChain and ChromaDB). Results Overall, LLM accuracy ranged from 60.6% to 71.9%. As for the three customized LLAMA3 models, modification improved LLM accuracy relative to the ZS version with an average improvement of 8.3% (Table 1). Significant improvements were seen in the Audible Bleeding and Rutherford’s variants when compared to the LLAMA3 ZS. Furthermore, RAG augmentation allowed LLAMA3, a relatively tiny 70-billion parameter model, to perform comparatively to ChatGPT. There was no statistical difference in accuracy with the RAG Audible Bleeding and Rutherford’s models when compared to ChatGPT whose parameters are estimated to be 10-100x
larger than LLAMA3’s. Conversely, RAG augmentation minimally improved and, in most cases, lowered LLAMA3 capabilities in semantic and verbatim similarity, as indicated by the BERT and ROUGE scores respectively. ConclusionRelatively small, customized, open-sourced LLMs perform similarly to flagship, large-scale, models based on unverified source material. They show great promise for clinical and educational vascular surgery applications. Further work is needed to determine the optimal customization parameters that may produce a trusted LLM for vascular specialists.

MetricCommercial LLM non-customizedOpen-source LLMnon-customizedOpen source LLM, customized with RAG
ChatGPT 4oClaudeLLaMA3Audible BleedingRutherfordDebakey
VESAP Accuracy (% correct)71.9%*(489/680)65.3%(444/680)60.6%(412/680)71.9%*(489/680)71.2%*(484/680)63.5%(432/680)
Similarity
BERT-F1 score0.6130.6190.6140.6130.6060.599*
ROUGE-L score0.196*0.200*0.1830.1850.175*0.171*

Table 1. Accuracy and similarity of baseline open source, customized open-source, ChatGPT 4o and Claude large language models. * indicates significance when compared to the zero-shot LLAMA3 model.


Back to 2025 Display Posters