91��Ѽ

Fall 2023

Date	Event	Speaker	Abstract/Details	��	��
08/30/2023	Planning, introductions, welcome!	��	��	��	��
09/06/2023	ACL Keynote Watch Party	Geoffrey Hinton	��	��	��
09/13/2023	Ongoing Project Updates	Susan Windisch	AIDA, KAIROS, DWD	��	��
09/20/2023	Brunch and garden party outside in the Shakespeare Garden	��	��	��	��
09/27/2023	Practice Talk, Ongoing Projects ��Updates	Felix Zheng, Martha Palmer, Jim Martin, Rehan Ahmed	UMR, iSAT, Event Coreference Projects	��	��
10/04/2023	Ongoing Projects ��Updates	Alexis Palmer, Katharin von der Wense	Low-resource and endangered languages (UMR2, LECS Lab, NALA)	��	��
10/11/2023	Ongoing Projects ��Updates	Alexis Palmer, Maria Pacheco	LECS Lab and BLAST Lab	��	��
10/18/2023	Thesis Proposal (NALA Lab) : Pretrained Multilingual Model Adaptation for Low-Resource Languages with OCR	Téa Wright	Pretrained multilingual models (PMMs) have advanced the natural language processing (NLP) field over recent years, but they often struggle when confronted with low-resource languages. This proposal will explore the challenges of adapting PMMs to such languages, with a current focus on Lakota and Dakota. Of the data available for endangered languages, much of it is in formats that are not machine readable. As a result, endangered languages are left out of NLP technologies. Using optical character recognition (OCR) to digitize these resources is beneficial for this dilemma, but also introduces noise. The goal of this research is to determine how this noise affects model adaptation and performance for zero-shot and few-shot learning for low-resource languages. The project will involve data collection and scanning, annotation for a gold evaluation dataset, and evaluation of multiple language models across different adaptation methods and levels of noise. Additionally, we hope to expand this pipeline to more scripts and languages. The potential implications of this study are broad: generalizability to languages not included in the study as well as providing insight into how noise affects model adaptation and the types of noise that are most harmful. This project aims to address the unique challenges of Lakota and Dakota as well as develop the field's understanding of how models may be adapted to include low-resource languages, working towards more inclusive NLP technologies.	��	��
10/25/2023	BLAST Lab Session and Guest Talk : The Differential and Irreplaceable Contributions of Academia and Industry to Al Research	Daniel Acuña (Science of Science)	Striking recent advances by industry's artificial intelligence (AI) have stunned the academic world, making us rethink whether academia should just follow industry's lead. Due to its open publication, citation, and code-sharing culture, Al offers a rare opportunity to investigate whether these recent advances are outliers or something more systematic. In the present study, we investigate the impact and novelty of academic and industry Al research across 58 conferences the primary publication medium of Al-involving 292,185 articles and 524 state-of-the-art models from 1995 to 2020. Our findings reveal an overall seismic shift in impact and novelty metrics, which started around 2015, presumably motivated by deep learning. In the most recent measures, an article published by an exclusively industry team dominates impact, with a 73.78 percent higher chance of being highly cited, 12.80 percent higher chance of being citation-disruptive, and several times more likely to produce state-of-the-art models. In contrast, we find that academic teams dominate novelty, having a striking 2.8 times more likelihood of producing novel, atypical work. Controlling for potential confounding factors such as subfield, team size, seniority, and prestige, we find that academia-industry collaborations are unable to simultaneously replicate the impact and novelty of non-collaborative teams, suggesting each environment offers irreplaceable contributions to advance AI.	��	��
11/01/2023	Commonsense Knowledge of Prototypical Functions for Natural Language Processing	Tianyu Jiang (University of Cincinnati)	Abstract: Recent advances in natural language processing (NLP) have enabled computers to understand and generate natural language to a remarkable degree. However, it is still a big challenge for computers to "read between the lines" as we humans do. People often omit a lot of information in daily communication, but we have no difficulty understanding each other because our commonsense knowledge can help us make inferences. In this research, we focus on one specific type of commonsense knowledge that people use in everyday living: "functional knowledge". People go to different places for a common set of goals: people go to schools to study, go to stores to buy clothing, and go to restaurants to eat. Comparably, people create and use physical objects for different purposes: knives are for cutting, cars are for transportation, and phones are for communication. I will first introduce how we can automatically learn this type of knowledge, and then demonstrate how to utilize this prior knowledge of functions in two downstream applications including sentence-level understanding and visual activity recognition. Bio: Recent advances in natural language processing (NLP) have enabled computers to understand and generate natural language to a remarkable degree. However, it is still a big challenge for computers to "read between the lines" as we humans do. People often omit a lot of information in daily communication, but we have no difficulty understanding each other because our commonsense knowledge can help us make inferences. In this research, we focus on one specific type of commonsense knowledge that people use in everyday living: "functional knowledge". People go to different places for a common set of goals: people go to schools to study, go to stores to buy clothing, and go to restaurants to eat. Comparably, people create and use physical objects for different purposes: knives are for cutting, cars are for transportation, and phones are for communication. I will first introduce how we can automatically learn this type of knowledge, and then demonstrate how to utilize this prior knowledge of functions in two downstream applications including sentence-level understanding and visual activity recognition.	��	��
11/08/2023	Low-Resource Monolingual Transformer Language Models	Luke Gessler	Since the publication of BERT in 2018, pretrained Transformer language models (TLMs) have been a foundational requirement for almost all natural language processing systems. High-quality TLMs are easily attainable for languages with vast amounts of data, such as English, but for all but the top 100 most data-rich languages, it is very difficult to train TLMs with high quality. Most work aimed at addressing this issue has taken a multilingual approach, but in this talk, we take up the question of whether low-resource TLMs could be trained effectively using only data drawn from one language. First, we describe a novel training algorithm for monolingual low-resource TLMs which characteristically involves reducing model size and using multitask learning with syntactically labeled data. Second, we describe a complementary training algorithm which uses contrastive learning and a syntactically-guided self-attention mechanism to provide syntactic inductive bias to TLMs. Third, we present a new TLM evaluation dataset, extensible to any language with a New Testament translation, aimed at addressing the severe lack of model evaluation resources in low-resource settings. To our knowledge, this is the first major effort to develop low-resource monolingual TLMs, and our results show that our methods are often more effective than any other competing approach to provide TLMs for low-resource languages.	��	��
11/15/2023	Inductive Biases for Deep Linguistic Structured Prediction with Independent Factorization	Jie Cao	Discovering the underlying structure of text can enable rigorous analysis, easier knowledge organization, and programmable reasoning. The no-free-lunch theorem underscores that the search for appropriate inductive biases that influence hypothesis selection in machine learning is necessary to obtain generalization. This is also true for deep learning models to predict intricate combinatory structures. We ground our studies on deep structured prediction on both broad-coverage linguistic representations and application-specific representations. Due to the compositionality of natural language, many language representations are also defined to be compositional structures. However, we need to make the right design choices to factorize the input and output, and then model the correlations between their decomposed parts. We study structural inductive biases by designing factorization-oriented learning and reasoning mechanisms at the lexical, phrasal, and sentential levels. Furthermore, human-encoded knowledge with language can also be used as valuable inductive biases. We study how to use natural language descriptions to represent the meaning of output symbols (intents and slots) in task-oriented dialogue state tracking, which helps to generalize to unseen domains and services. We offer detailed comparative studies on how to use natural language as inductive biases by investigating encoding strategies, supplementary pretraining, and homogeneous/heterogeneous evaluations.	��	��
11/29/2023	EMNLP Practice Talks	��	��	��	��
12/06/2023	Dissertation Proposal Defense : Towards Automatically Expanding Multilingual Coverage of Morphological Databases	Adam Wiemerslage	New NLP methods to leverage enormous amounts of digital text are transforming the experience of working with computers and accessing the internet for many people. For most of the world's languages though, there is not enough digital data to make recently popular technology like large language models (LLMs) possible. These underrepresented languages---often referred to as low-resource languages in NLP---are often not well-suited for recent technology like LLMs without sufficient digital data. In this case, simpler language technologies like lexica, morphological analyzers, and text normalizers can serve many purposes, especially in language documentary life-cycles and for building educational tools, and can contribute to the development of more digital data. These tools also enable research for a wider breadth of languages than are typically studied in a computational context. With this in mind, we propose techniques for automatically expanding coverage of morphological databases and techniques for developing morphological tools for the large set of languages with few available resources. We then study the generation capabilities of neural network models that learn from these resources. Finally we propose methods for training neural networks when only small amounts of data are available, taking inspiration from the recent successes of unsupervised pretraining in high-resource NLP.	��	��
12/13/2023	Dissertation Proposal Defense : Proto-role Theory in Natural Language Processing	Elizabeth Spaulding	Dowty's theory of thematic proto-roles has been offered as an alternative to traditional inventories of roles (Agent, Patient, Theme, etc.). Instead of describing arguments as Agent of action, for example, the theory describes arguments using a set of properties such as "volitional involvement," "changes state," and "sentience." The theory was hoped to describe patterns of meaning more relevant to the real world than previous theories of thematic roles. This thesis proposal will investigate the operationalization of this theory. First, I will present completed work: an in-depth analysis of the computational semantic task of semantic proto-role labeling in a joint end-to-end setting. Then, I will propose an analysis of the link between semantic role labeling and semantic proto-role labeling using large language models. Finally, I will propose an application of the task that seeks to reveal implicitly held assumptions about what entities can be moral agents and moral patients by analyzing proto-role properties in real text.	��	��