The 4th International Workshop on Data-Centric AI

Data-Centric AI (DCAI): shifting research focus from model to data.

Home

Machine learning focuses on developing models for datasets, but real-world data is often messy. Improving the dataset itself can be a better way to enhance performance instead of just improving the models. Data-Centric AI (DCAI) is an emerging field that systematically improves datasets, resulting in significant improvements in ML applications. DCAI treats data improvement as an engineering discipline, offering a shift in focus from modeling to the underlying data. This workshop aims to build an interdisciplinary DCAI community to tackle data problems such as collection, labeling, preprocessing, quality evaluation, debt, and governance. Interested parties can shape the future of AI and ML by submitting papers in response to the call for papers.

Agenda

Date: October 25th
Time	Title	Format	Presenter/Author
14:00-14:05	Opening Remarks	5 min	Organizers
14:05-14:45	Keynote Presentation: Data diversity for understanding safety across modalities	30 min + 10 min QA session	Alicia Parrish
14:45-15:30	Keynote Presentation: Future Insights: Harnessing AI and Social Media for Advanced Event and Epidemic Forecasting	30 min + 10 min QA session	Chang-Tien Lu
15:30-16:00	Coffee Break	30 min	Organizers
16:00-16:40	Keynote Presentation: Actionable Decision Making with Small Data	30 min + 10 min QA session	Hua Wei
16:40-16:48	Paper Presentation: AugTriever: Unsupervised Dense Retrival by Scalable Data Augmentation	6 min + 2 min QA session	Rui Meng, Ye Liu, Semih Yavuz, Lifu Tu, Ning Yu, jianguo Zhang, Meghana Bhat, and Yingbo Zhou
16:48-16:56	Paper Presentation: Transfer Learning for E-commerce Query Product Type Prediction	6 min + 2 min QA session	Anna Tigunova, Ghadir Eraisha, and Thomas Ricatte
16:56-17:04	Paper Presentation: Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models	6 min + 2 min QA session	Bryan Zhang, Taichi Nakatani, and Stephan Walter
17:04-17:12	Paper Presentation: Towards Scalability and Extensibility of Query Reformulation Modeling in E-commerce Search	6 min + 2 min QA session	Ziqi Zhang, Yupin Huang, Quan Deng, Jinghui Xiao, Vivek Mittal, and Jingyuan Deng
17:12-17:20	Paper Presentation: Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval	6 min + 2 min QA session	Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, and William M. Campbell
17:20-17:28	Paper Presentation: Assessing Data Copyright in Large Language Model via Partial Information Probing	6 min + 2 min QA session	Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, and Denghui Zhang
17:28-17:30	Closing Remarks	2 min	Organizers

Topics

We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:

Automated Data Science Methods
- Data cleaning, denoising, and interpolation
- Feature selection and generation
- Data refinement, feature-instance joint selection
- Data quality improvement, representation learning, reconstruction
- Outlier detection and removal

Tools and Methodologies for Expediting Open-source Dataset Preparation
- Time acceleration tools for sourcing and preparing high-quality data
- Tools for consistent data labeling, data quality improvement
- Tools for generating high-quality supervised learning training data
- Tools for dataset control, high-level editing, searching public resources
- Tools for dataset feedback incorporation, coverage understanding, editing
- Dataset importers and exporters for easy data combination and consumption
- System architectures and interfaces for dataset tool composition

Algorithms for Handling Limited Labeled Data and Label Efficiency
- Data selection techniques, semi-supervised learning, few-shot learning
- Weak supervision methods, transfer learning, self-supervised learning approaches

Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
- Datasets for bias evaluation and analysis
- Algorithms for automated bias elimination, model training with biased data

Submission Details

We invite the submission of regular research papers (max 9 pages), including the bibliography and any possible appendices. Submissions must be in PDF format, and formatted according to the 2-column ACM sigconf template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the EasyChair Submission. For more questions about the workshop and submissions, please send email to kunpeng@pdx.edu

Important Dates (Anywhere on Earth)

Workshop Papers Submission: July 29, 2024
Notification of Workshop Papers Acceptance: August 30, 2024
Camera-ready Deadline and Copyright Form: September 15, 2024
Workshop Day: October 25, 2024

Organizing Committee

Steering Co-Chairs

Hui Xiong

The Hong Kong University of Science and Technology (Guangzhou)

Vipin Kumar

University of Minnesota

Organizing Committee

Yanjie Fu

Arizona State University

Kunpeng Liu

Portland State University

Pengyang Wang

University of Macau

Pengfei Wang

Chinese Academy of Sciences

Dongjie Wang

University of Kansas

Meng Xiao

Chinese Academy of Sciences

Yanyong Huang

Southwestern University of Finance and Economics

Wei Fan

University of Oxford

Ziyue Qiao

Great Bay University

Zhengzhang Chen

NEC Laboratories America

Publicity Co-Chairs

Pengyang Wang

University of Macau

Dongjie Wang

University of Kansas

Web Co-Chairs

Nanxu Gong

Arizona State University

Wangyang Ying

Arizona State University

Accepted Paper

Rui Meng, Ye Liu, Semih Yavuz, Lifu Tu, Ning Yu, jianguo Zhang, Meghana Bhat, and Yingbo Zhou, "AugTriever: Unsupervised Dense Retrival by Scalable Data Augmentation"
Anna Tigunova, Ghadir Eraisha, and Thomas Ricatte, "Transfer Learning for E-commerce Query Product Type Prediction"
Bryan Zhang, Taichi Nakatani, and Stephan Walter, "Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models"
Ziqi Zhang, Yupin Huang, Quan Deng, Jinghui Xiao, Vivek Mittal, and Jingyuan Deng, "Towards Scalability and Extensibility of Query Reformulation Modeling in E-commerce Search"
Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, and William M. Campbell, "Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval"
Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, and Denghui Zhang, "Assessing Data Copyright in Large Language Model via Partial Information Probing"

Keynote Presentations

Data diversity for understanding safety across modalities

Presenter: Alicia Parrish

Bio: Alicia Parrish is a research scientist working on data-centric responsible AI at Google Deepmind in NYC. Her work focuses on better understanding how to ensure data diversity and high data quality through the entire pipeline of data, from the way the data is collected, who/where it's collected from, how it gets aggregated or sampled, etc. By looking across the entire pipeline, her work traces the impact of early data choices all the way through to the kinds of conclusions we're able to make about model behavior. Alicia holds a PhD in linguistics from New York University.

Abstract: Though datasets to test the safety of large generative models often rely on binary labels, human perspectives are much more nuanced. How people interpret safety is deeply influenced by their socio-cultural backgrounds and lived experiences. As we aim to evaluate systems that will be deployed in real-world settings, it is critical to understand how the diverse perspectives of humans involved in creating the model evaluations can be leveraged. In this talk, I provide an overview of two recent datasets that enable the study human diversity in relation to safety evaluation: (i) the DICES dataset, which captures diverse demographic perspectives on safety ratings for text-to-text models, highlighting the impact of individual backgrounds on safety assessments, and (ii) the Adversarial Nibbler dataset, which engages red teamers from distinct geographic regions to gather adversarial prompts for text-to-image models, revealing the importance of geographic diversity in identifying novel potential harms. I show that embracing the diversity of those involved in creating and rating data enables us to uncover nuanced perspectives that are relevant to ensuring the safety of generative models.

Future Insights: Harnessing AI and Social Media for Advanced Event and Epidemic Forecasting

Presenter: Chang-Tien Lu

Bio: Chang-Tien Lu is a Professor in the Department of Computer Science, Curriculum Lead at the Innovation Campus, and Associate Director of the Sanghani Center for AI and Data Analytics at Virginia Tech. He received his Ph.D. from the University of Minnesota, Twin Cities, in 2001. Dr. Lu currently serves as an Associate Editor for ACM Transactions on Spatial Algorithms and Systems, Data & Knowledge Engineering, IEEE Transactions on Big Data, and GeoInformatica. He has held prominent roles in organizing major conferences, including serving as General Co-Chair of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems in 2009, 2020, and 2021, the International Symposium on Spatial and Temporal Databases in 2017, and the IEEE International Conference on Big Data in 2024. He also served as Secretary (2008–2011) and Vice Chair (2011–2014) of ACM SIGSPATIAL. Dr. Lu's research spans spatial informatics, urban computing, artificial intelligence, and intelligent transportation systems. He has authored over 250 publications in top-tier journals and conferences, with funding from the NSF, NIH, DoD, and DoE. He is recognized as an ACM Distinguished Scientist and an IEEE Fellow.

Abstract: In an era where information flows rapidly through digital platforms, social media data offers a powerful tool for predicting and responding to global challenges. This talk presents advanced techniques for forecasting societal events and epidemics by leveraging social media data. I will introduce an automated system that analyzes open-source data from platforms such as Twitter, blogs, and news articles to predict events like natural disasters and public demonstrations, addressing challenges like unstructured text and dynamic relationships. Additionally, I will discuss SimNest, a deep learning framework that integrates computational epidemiology with social media data to enhance real-time disease outbreak monitoring and improve epidemic response strategies. Finally, I will explore a multi-task learning framework that improves event forecasting across regions by addressing issues such as data imbalances and geographic differences, with notable success demonstrated using Twitter data from Latin America. These approaches highlight the power of combining social media and machine learning to improve prediction accuracy and enable timely interventions for both societal events and health crises.

Actionable Decision Making with Small Data

Presenter: Hua Wei

Bio: Hua Wei is an assistant professor at the School of Computing and Augmented Intelligence (SCAI) in Arizona State University (ASU). He also affiliates with the Lawrence Berkeley National Laboratory. Before joining ASU, he worked as an Assistant Professor at New Jersey Institute of Technology and a Staff Researcher at Tencent AI Lab. He got his PhD from Pennsylvania State University in 2020 under the supervision of Dr. Zhenhui (Jessie) Li. Before that, he received his master and bachelor degree from Beihang University (BUAA) majoring in Computer Science, working with Prof. Jinpeng Huai and Dr. Tianyu Wo.

Abstract: Decision making in the real world has been challenging due to the challenge of small data, i.e., the real-world data could be sparse and hard to obtain. This talk explores the strategies of data generation in the real-world setting of small data to support decision making. The objective is to improve the accuracy and efficiency of data generation by training machine learning models to learn from real-world data. Specifically, this talk will introduce some of our latest work on spatio-temporal data, and the follow-up generative model and control model facing the small data.

Program Committee

Dr. Yong Ge, University of Arizona
Dr. Hao Liu, The Hong Kong University of Science and Technology (Guangzhou)
Dr. Kunpeng Liu, Portland State University
Dr. Qi Liu, University of Science and Technology of China
Dr. Yanchi Liu, NEC Labs America
Dr. Leilei Sun, Beihang University
Dr. Pengfei Wang, Chinese Academy of Sciences
Dr. Pengyang Wang, University of Macau
Dr. Senzhang Wang, Central South University
Dr. Keli Xiao, Stony Brook University
Dr. Yang Yang, Nanjing University of Science and Technology
Dr. Zijun Yao, University of Kansas
Dr. Denghui Zhang, Stevens Institute of Technology
Dr. Wei Zhang, University of Central Florida
Dr. Xi Zhang, Chinese Academy of Sciences
Dr. Dongjie Wang, University of Kansas
Dr. Muhammad Zunnurain Hussain, Bahria University Lahore Campus
Dr. Venkata Nedunoori, Dentsu International

Volunteers

Mr. Haihua Xu, University of Macau
Ms. Qi Hao, University of Macau

Photos