Session O-2P
Large Language Models: Engineering and Social Requirements
1:15 PM to 3:00 PM | CSE 305 | Moderated by Blake Hannaford
- Presenter
-
- Weikai Huang, Freshman, Pre-Sciences
- Mentors
-
- Ranjay Krishna, Computer Science & Engineering
- Jieyu Zhang, Computer Science & Engineering
- Zixian Ma (zixianma@cs.washington.edu)
- Session
-
- CSE 305
- 1:15 PM to 3:00 PM
The advent of the Large Language Model (LLM) like ChatGPT has marked a significant milestone in the field of artificial intelligence with its unprecedented language capabilities in various language-related tasks, such as writing novels, generating code, etc. However, when it comes to solving complex real-world tasks that require task decomposition and multi-step planning, LLM can only provide general or plausible plans without execution results, which might not be useful for specific tasks. This limitation is primarily caused by its inability to interact directly with real-life tools; the lack of real tools limits LLM's ability to perform tasks in the real world. Furthermore, an absence of customized feedback from real-life tools and environments makes it impossible to refine its planning, which significantly limits its potential in real-life applications. To address this, we propose an innovative framework that enhances LLM with over 50 real-life tools, including web APIs like Wikipedia search, machine learning models, image processing software, etc. This framework can understand user requests and parse them into a multi-step plan. For example, if a user inputs: "I want to know whether the review of 'Iron Man' is positive or negative", the framework would develop a plan like (IMDB API [Web API] -> Sentiment Analysis [Machine Learning Model]), and then automatically execute tools to produce the result. Additionally, we also build a feedback system to provide feedback on both planning and execution quality. It evaluates the quality of the planning (e.g. format, rationality) and the execution result (e.g. alignment with the user's request), which significantly improves the quality and rationality of the planning by GPT. Besides this framework, we also developed a comprehensive benchmark that includes thousands of human-verified multi-step, multi-tool request-planning pairs that cover a variety of real-life scenarios. We conduct benchmarks on several state-of-the-art models like GPT-4, LLaMA, Gemini, etc.
- Presenter
-
- Yubin Li, Sophomore, Computer Science, Shoreline Community College
- Mentor
-
- Lauren Bryant, Information School, Shoreline Community College
- Session
-
- CSE 305
- 1:15 PM to 3:00 PM
Addressing bias in artificial intelligence (AI) and machine learning (ML) systems is crucial for ensuring fairness, transparency, and ethical integrity. This study introduces a pioneering interdisciplinary approach, blending advanced computational methods with social sciences insights to tackle the multifaceted nature of bias. Through a mixed methods strategy that combines quantitative and qualitative data, we scrutinize algorithmic outcomes and conduct different case studies of stakeholders—developers, users, and communities affected by AI/ML biases. Our initial findings indicate that bias transcends technical boundaries, manifesting as a complex socio-technical dilemma that demands both algorithmic adjustments and societal reforms. We highlight specific biases, such as gender and racial disparities in recruitment algorithms and facial recognition technologies, underscoring the critical need for our research. To address these biases, we propose adopting data enhancement techniques, fairness-focused learning algorithms, and promoting explainable AI practices. Inspired by influential figures like Joy Buolamwini, founder of the Algorithmic Justice League, and Cathy O'Neil, author of Weapons of Math Destruction, we emphasize the importance of inclusive datasets and critically examining opaque algorithms. Our future efforts concentrate on developing comprehensive guidelines to reduce AI/ML biases and exploring the broader societal impacts of establishing unbiased AI and ML systems. By cultivating more equitable and ethical AI and ML frameworks, our research aims to meet the diverse needs of global communities, setting a new standard for responsible AI development.
- Presenter
-
- Abhika Mishra, Senior, Computer Science
- Mentors
-
- Hannaneh Hajishirzi, Computer Science & Engineering
- Akari Asai (akari@cs.washington.edu)
- Session
-
- CSE 305
- 1:15 PM to 3:00 PM
Large language models (LMs) are prone to generate diverse factually incorrect statements, which are widely called hallucinations. Current approaches predominantly focus on coarse-grained automatic hallucination detection or editing, overlooking nuanced error levels. In this project, we propose a novel task—automatic fine-grained hallucination detection—and present a comprehensive taxonomy encompassing six hierarchically defined types of hallucination. To facilitate evaluation, we introduce a new benchmark that includes fine-grained human judgments on two LM outputs across various domains. To run this evaluation, I directly managed the collection of around 400 total human annotations which were analyzed to better understand the hallucinations present in LM outputs. My analysis using this benchmark reveals that ChatGPT and Llama2-Chat exhibit hallucinations in 60% and 75% of their outputs, respectively. A majority of these hallucinations fall into categories that have been underexplored in previous work. As an initial step to address this, I trained FAVA, a retrieval-augmented LM by carefully designing synthetic data generations to detect and correct fine-grained hallucinations. I set up the synthetic data generation pipeline to train FAVA which consists of prompting ChatGPT to noise a passage and insert errors one by one. The noisy passage is then post processed into our training erroneous input and edited output pairs. On our benchmark, our automatic and human evaluations show that FAVA significantly outperforms ChatGPT on fine-grained hallucination detection by a large margin though a large room for future improvement still exists. FAVA’s suggested edits also improve the factuality of LM-generated text, resulting in 5-10% FActScore improvements. These results further demonstrate the strong capabilities of FAVA in detecting factual errors in LM outputs.
- Presenter
-
- Jiatao Quan, Senior, Human Ctr Design & Engr: Data Science, Psychology
- Mentor
-
- Sourojit Ghosh, Human Centered Design & Engineering
- Session
-
- CSE 305
- 1:15 PM to 3:00 PM
The latest developments in health applications for large language models (LLMs) have shown promising prospects in providing preliminary diagnoses based on user symptoms, especially appealing to patients with mild conditions who prefer online consultation over in-person clinic visits. However, the accuracy of these symptom-based decisions remains uncertain. This gap is more pronounced in mental health applications, where there is a noticeable lack of research on the efficacy of LLMs in offering effective treatment advice. Our study addresses this issue by evaluating systems through a combined analysis of LDA topic modeling, word frequency analysis, and cosine similarity analysis, examining how chatbots based on LLMs utilize data from the American Psychological Association to provide diagnostic information and the accuracy of their treatment advice. We found that while these chatbots can provide highly accurate diagnoses and corresponding treatment recommendations for virtual patients, they typically offer less information about the prescribed treatment needed compared to human psychologists; moreover, chatbots based on LLMs tend to mention common relevant words across all topics, whereas responses from psychology experts are more likely to cover most relevant words within a topic. This raises discussions about the significance and limitations of LLMs in mental health diagnosis and treatment advice, calling for a highly academic approach.
- Presenter
-
- Lushan Wang, Senior, Human Ctr Des & Engr: Human-Computer Int
- Mentors
-
- Sarah Coppola, Human Centered Design & Engineering
- Alainna Brown, Human Centered Design & Engineering
- Session
-
- CSE 305
- 1:15 PM to 3:00 PM
International students are an essential part of the UW community as they are able to bring in a unique set of lenses and perspectives to perceive, approach, and solve problems. The UW International Student Service (ISS) is a place that provides information and guidance for the international students to legally live and study in the U.S. After encountering difficulties using the ISS system as an international student and a design researcher, I started to wonder how the ISS might improve to create a better user experience of their website and services. I began exploring this question in HCDE 417 in Autumn 2023. This application supports the continuation of that work, motivated by the following two research questions: 1) How well does the UW ISS website navigation work in terms of guiding international students to complete the correct tasks? 2) How might we improve the ISS system to better support the needs of the international students? My research is a usability study focused on understanding three attributes of usability for the UW ISS system: the usefulness, discoverability, and satisfaction. By carrying out initial usability testing sessions in HCDE 417 with international students and analyzing the transcript data using open coding and axial coding methods, I was able to take a deep dive into the problems with virtual advising services. My initial research surfaced several insights including the inconvenient drop-in only advising services, unreasonable student-to-advisor ratio, and less discoverable content. The impact of this study is that I took my initial research findings to ISS UX intern to discuss potential changes that could be made to the ISS to improve students' experience. As part of the community, I would like to use my design background to advocate for international students to be receiving more attention and resources from the UW ISS.
- Presenter
-
- Andre Ye, Senior, Computer Science, Philosophy UW Honors Program
- Mentor
-
- Ranjay Krishna, Computer Science & Engineering
- Session
-
- CSE 305
- 1:15 PM to 3:00 PM
I investigate the influence of cultural and linguistic backgrounds on visual perception and semantic interpretation within computer vision. This study addresses the question: Are there significant variations in the semantic content described by vision-language datasets and models across different languages? Guided by the hypothesis that cultural and linguistic diversities lead to distinct semantic interpretations, I compare multilingual datasets against monolingual counterparts. I developed metrics such as scene graph complexity, embedding space width, and linguistic diversity to quantify semantic variations across languages in both human-annotated and model-generated image captions. The methodology involves using linguistic tools and translation techniques to ensure semantic consistency across languages. Our findings indicate that multilingual captions contain, on average, 21.8% more objects, 24.5% more relations, and 27.1% more attributes than monolingual ones. Furthermore, models trained on diverse linguistic content demonstrate improved generalizability across different linguistic datasets. This study contributes to the understanding of how language and culture impact visual perception in computer vision and advocates for more inclusive dataset compilation and model training strategies.
The University of Washington is committed to providing access and accommodation in its services, programs, and activities. To make a request connected to a disability or health condition contact the Office of Undergraduate Research at undergradresearch@uw.edu or the Disability Services Office at least ten days in advance.