Session O-3C

What's Going on in Biomedical Research? How LLMs Can Augment the Bench to Bedside Translation

3:30 PM to 5:10 PM | MGH 242 | Moderated by John Kang


Generative Large Language Models to Extract Topics and Trends in Research Funded by the National Cancer Institute in Radiation Oncology
Presenters
  • Helena Zheng, Senior, Computer Science
  • Camie Sawa, Sophomore, Applied Mathematics, Computer Science
  • Pranav Alaparthi
Mentor
  • John Kang, Radiation Oncology
Session
  • MGH 242
  • 3:30 PM to 5:10 PM

Generative Large Language Models to Extract Topics and Trends in Research Funded by the National Cancer Institute in Radiation Oncologyclose

Investigators, funders, and the public desire knowledge on topics and trends in research funded by the federal government. Current efforts to categorize efforts are limited to manual categorization and naming of a few dozen grants at a time. We developed an automated pipeline within BERTopic (a topic modeling and representation technique) to extract and name research topics and applied this to $1.9B of NCI funding over 21 years in the radiological sciences to determine micro- and macro-scale research topics and funding trends. In our prior work by Nguyen et al., we used Word2Vec-based embeddings to represent grants, hierarchical/K-means clustering to group them, and iterative topic naming by humans to label them. Our current study builds on this with updated embedding, clustering, and generative-AI-driven naming methods. We mapped out 9202 grant abstracts from 2000-2020 using PubMedBERT-base embeddings, then clustered them into 60 clusters with HDBScan, and visualized them in two dimensions using UMAP to aid in interpretation. We employed a chaining strategy comparing c-TF-IDF and topic distributions to reduce cluster outliers. The resultant clusters were named via OpenAI's GPT-3.5 model. We used prompt tuning methods (role prompting, directive commanding) through three reinforcement phases to generate topic labels based on the most representative documents of each cluster. The three largest topics in descending order are related to PET/CT imaging, tumor cell imaging, and breast cancer computer-aided detection. We believe these results may (1) demonstrate the feasibility of using topic modeling to help funders and the public understand funding patterns in the field of radiation oncology (2) provide updated clustering and representation methodology which increases accuracy and decreases reliance on manual human validation.


AI-Driven Extraction of Research Topics and Trends from NCI Funding Across Departments (2000-2023)
Presenters
  • Ikshita Ravishankar Sathanur, Senior, Computer Science
  • Kevin Lee, Senior, Geography: Data Science
Mentor
  • John Kang, Radiation Oncology
Session
  • MGH 242
  • 3:30 PM to 5:10 PM

AI-Driven Extraction of Research Topics and Trends from NCI Funding Across Departments (2000-2023)close

Cancer remains the second leading cause of death in the United States, with nearly 1.9 million new cases and over 609,000 deaths annually. Research funded by the National Cancer Institute (NCI) plays a key role in advancing cancer treatments, diagnostics, and understanding. This study analyzes 24 years of NCI grant data to uncover funding trends and their broader implications. Using NIH RePORTER, we filtered and analyzed $36.07B of grants from 2000 to 2023. Leveraging BERTopic, a topic modeling algorithm, we clustered grant abstracts based on semantic similarities to identify major research themes. OpenAI’s GPT-4o-mini model was then used to generate topic labels. Our findings reveal key shifts in funding allocation. Total NCI funding has significantly increased since 2000, with notable growth in areas like Epigenetic Modifications in Cancer and P53 Pathways in HCC Liver Cancer, while topics such as Ethics of Cancer Research and Signal Transduction Pathways have seen less emphasis over time. Additionally, emerging areas like Natural Care Approaches for Cancer Patients exhibit high annual growth, reflecting new focuses in patient care. These insights enhance transparency in research funding, informing stakeholders about emerging therapies and underfunded research areas. This work highlights the link between funding and patient outcomes, demonstrating how NCI initiatives drive innovation in cancer care. By presenting trends, we aim to support equitable resource distribution, improve transparency, and enhance knowledge to guide future funding decisions. 


Investigating Synaptic Function and Age-Related Cognitive Decline in Middle-Aged Mice treated with Intraperitoneal GHK-Cu
Presenter
  • Kavneet Thoohan, Senior, Biology (Physiology)
Mentors
  • Warren Ladiges, Comparative Medicine
  • Jordan Mazzola, Comparative Medicine
Session
  • MGH 242
  • 3:30 PM to 5:10 PM

Investigating Synaptic Function and Age-Related Cognitive Decline in Middle-Aged Mice treated with Intraperitoneal GHK-Cuclose

Age-related cognitive decline (ARCD) is very common and increases the risk for severe neurodegenerative conditions such as Alzheimer's disease. Treatment of ARCD can delay and lead to the cure of age-related diseases, but there is a lack of clinically proven drugs. One option is the naturally occurring peptide GHK (glycyl-L-histidyl-L-lysine), which readily forms a complex with copper (II). GHK is a key ingredient in anti-aging skin creams and regulates astrocytes through TGF-β and the SMAD pathway. As synaptic signaling decreases with age, this study investigates GHK-Cu's impact on synaptic function in middle-aged mice as a potential treatment for ARCD. Male and female C57BL/6 mice aged 20-22 months were treated with either the GHK-Cu peptide or saline as a control through intraperitoneal (IP) injection for five days. A spatial navigation learning task, the Box Maze, was utilized to analyze cognitive function by assessing the memory and learning of the mice on their last day of treatment. After the brain tissue samples were processed, synaptic function was assessed by performing immunohistochemistry (IHC) with Synaptophysin and PSD95 antibodies as molecular markers of pre- and post-synaptic integrity. The tissue slides were rehydrated, incubated with the antibodies overnight, and stained. After, the presence of antibodies was seen through microscopic examination and photographed for QuPath image analysis. Preliminary results of the Box Maze behavioral assay reveal the treated mice had increased cognitive function, memory, and learning capacity, which signals alleviated symptoms of ARCD. It is predicted that this increased resilience to ARCD will also be observed in the brain through the increased presence of Synaptophysin and PSD95 antibodies in the treated tissues compared to the control cohort. These results will show that short-term treatment of the GHK-Cu peptide will improve cognitive function and synaptic function, providing a potential treatment for ARCD and neurodegenerative diseases.


Natural Language Processing (NLP) and Automated Workflows to Extract Research Trends From American Society for Radiation Oncology (ASTRO) Annual Conferences (2019-2023)
Presenters
  • Camie Sawa, Sophomore, Applied Mathematics, Computer Science
  • Helena Zheng, Senior, Computer Science
  • Pranav Alaparthi, Junior, Computer Science
Mentor
  • John Kang, Radiation Oncology
Session
  • MGH 242
  • 3:30 PM to 5:10 PM

Natural Language Processing (NLP) and Automated Workflows to Extract Research Trends From American Society for Radiation Oncology (ASTRO) Annual Conferences (2019-2023)close

Every year, thousands of cancer research abstracts are presented at the ASTRO Annual Meeting. As biomedical literature continues to grow, there is a need to better understand trends in this large corpus of unstructured text data to aid conference organizers and attendees. This study examines the effectiveness of natural language processing (NLP) techniques to organize and present conference research. We analyzed a dataset of 9,770 abstracts accepted to the ASTRO Annual Meeting conference from 2019 to 2023. Using the BERTopic Python package, we converted abstracts into PubMedBERT embeddings and clustered the embeddings into 100 topics with HDBScan clustering. We experimented with c-TF-IDF scores, centroid distance, or HDBScan probabilities as various distance metrics to identify representative documents of each topic. To generate topic names, we input representative documents and BERTopic-extracted keywords into OpenAI’s GPT-3.5 model, applying role prompting and directive commanding strategies across three reinforcement phases of prompt tuning. Manual validation of GPT-generated names was performed through surveys assessing quantitative agreement and comments. Our approach combining BERTopic with a PubMedBERT transformer model and HDBScan clustering successfully categorized 91% of ASTRO abstracts. The three largest topics encompassed thoracic malignancies, head and neck cancer radiation therapy, and prostate cancer, while the smallest topics centered around radiation oncology education and brain tumor treatments. Two-dimensional interactive visualization using the Altair package also uncovered meta-topics such as Education and Basic Science. GPT-generated names, obtained using 20 representative documents selected by c-TF-IDF scores and three prompt tuning stages, were preferred in validation over human-generated categories. These results demonstrate the potential of combining representative models and generative models to derive topics from abstracts that are more preferred than human-generated categories. Our methods for optimizing clustering and prompt tuning to produce the best organization and naming of biomedical text may also be applied to automated conference organization.


Strategies to Optimize Outliers in Topic Modeling of Research Text in an Oncology Journal
Presenters
  • Sonya Renee Outhred, Junior, Computer Science
  • Addison Kuo Apisarnthanarax,
Mentor
  • John Kang, Radiation Oncology
Session
  • MGH 242
  • 3:30 PM to 5:10 PM

Strategies to Optimize Outliers in Topic Modeling of Research Text in an Oncology Journalclose

Publications are constantly being released as scientists and doctors continue to conduct new research. Keeping track of all publications released, even if narrow to a specific field, is onerous, requiring dedication of extensive time and resources. Our project uses LLMs (Large Language Models) to automate this process so that investigation of publication trends over decades is easily accessible to help inform future research. We extracted 4277 abstracts published from 2013 to 2023 from the International Journal of Radiation Oncology, Biology, and Physics. We leveraged the BERTopic (Bidirectional Encoder Representations from Transformers) framework, to cluster publications into a hundred topics based on PubMed pre-trained embeddings. In addition, we explored the parameter space of HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), our chosen clustering method, in order to maximize the number of relevant topics and minimize the number of outliers. However, an outlier group that contained 18% of our abstracts still remained. To address this, we further processed abstracts in this group and assigned each of them to a topic using c-TF-IDF (class-based Term Frequency-Inverse Document Frequency) with more relaxed matching thresholds. We applied three different threshold levels and manually reviewed 30 randomly chosen outlier abstracts and graded them as strongly, moderately, or poorly aligned to the assigned topic. We found on our lowest threshold 76% of the abstracts were sorted to relevant topics. The verification we conduct on reduction helps ensure the quality of the clusters we produced and thus the accuracy of future analysis on underlying trends.


Enhancing Synthetic Bone Findings for CT Imaging Through Prompt Tuning
Presenters
  • Kanush Sethia, Senior, Biology (Molecular, Cellular & Developmental)
  • Madhumitha (Madhu) Sridhar, Senior, Informatics
Mentor
  • John Kang, Radiation Oncology
Session
  • MGH 242
  • 3:30 PM to 5:10 PM

Enhancing Synthetic Bone Findings for CT Imaging Through Prompt Tuningclose

This experiment explores the impact of prompt tuning on generating synthetic bone findings from CT scans using a large language model. The study aims to enhance the model’s ability to produce realistic and diverse data, which could improve medical research, diagnostic tools, and AI-driven healthcare solutions. It tests 0-shot, 1-shot, and multi-shot prompt tuning strategies to assess their effectiveness in generating accurate radiology reports. The 0-shot strategy provides only a task description, the 1-shot strategy offers one example, and the multi-shot strategy provides five examples. Before applying these strategies, the CT scan writing styles of five doctors were reviewed to create a checklist of key elements expected in the reports. Some criteria were intentionally omitted to add variety to the synthetic reports. A role-playing prompt was given to GPT-4, instructing it to assume the role of Dr. GPT, tasked with generating authentic CT scan reports. GPT-4 was then asked to generate 50 synthetic data points for each prompt-tuning strategy (0-, 1-, and multi-shot). These reports were paired with 50 distinct authentic CT scan reports, forming three separate datasets for Turing Test evaluations. A physician ranked each report on a scale of one to five, with one indicating the report was authentic and five indicating the report was synthetic. The synthetic reports were generated using GPT-4, and a dataset of real patient data and physician-generated bone findings was used for Turing Test evaluations. This study aims to demonstrate the potential of prompt tuning in enhancing synthetic data generation, contributing to AI-driven tools in healthcare and medical research. Future refinements may lead to even more accurate synthetic data generation.


The University of Washington is committed to providing access and accommodation in its services, programs, and activities. To make a request connected to a disability or health condition contact the Office of Undergraduate Research at undergradresearch@uw.edu or the Disability Services Office at least ten days in advance.