Home


Machine learning focuses on developing models for datasets, but real-world data is often messy. Improving the dataset itself can be a better way to enhance performance instead of just improving the models. Data-Centric AI (DCAI) is an emerging field that systematically improves datasets, resulting in significant improvements in ML applications. DCAI treats data improvement as an engineering discipline, offering a shift in focus from modeling to the underlying data. This workshop aims to build an interdisciplinary DCAI community to tackle data problems such as collection, labeling, preprocessing, quality evaluation, debt, and governance. Interested parties can shape the future of AI and ML by submitting papers in response to the call for papers.


Agenda


Date: October 25th
Time Title Format Presenter/Author
14:00-14:05 Opening Remarks 5 min Organizers
14:05-14:45 Keynote Presentation: Data diversity for understanding safety across modalities 30 min + 10 min QA session Alicia Parrish
14:45-15:30 Keynote Presentation: Future Insights: Harnessing AI and Social Media for Advanced Event and Epidemic Forecasting 30 min + 10 min QA session Chang-Tien Lu
15:30-16:00 Coffee Break 30 min Organizers
16:00-16:40 Keynote Presentation: Actionable Decision Making with Small Data 30 min + 10 min QA session Hua Wei
16:40-16:48 Paper Presentation: AugTriever: Unsupervised Dense Retrival by Scalable Data Augmentation 6 min + 2 min QA session Rui Meng, Ye Liu, Semih Yavuz, Lifu Tu, Ning Yu, jianguo Zhang, Meghana Bhat, and Yingbo Zhou
16:48-16:56 Paper Presentation: Transfer Learning for E-commerce Query Product Type Prediction 6 min + 2 min QA session Anna Tigunova, Ghadir Eraisha, and Thomas Ricatte
16:56-17:04 Paper Presentation: Enhancing E-commerce Product Title Translation with Retrieval-Augmented Generation and Large Language Models 6 min + 2 min QA session Bryan Zhang, Taichi Nakatani, and Stephan Walter
17:04-17:12 Paper Presentation: Towards Scalability and Extensibility of Query Reformulation Modeling in E-commerce Search 6 min + 2 min QA session Ziqi Zhang, Yupin Huang, Quan Deng, Jinghui Xiao, Vivek Mittal, and Jingyuan Deng
17:12-17:20 Paper Presentation: Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval 6 min + 2 min QA session Qiuhai Zeng, Zimeng Qiu, Dae Yon Hwang, Xin He, and William M. Campbell
17:20-17:28 Paper Presentation: Assessing Data Copyright in Large Language Model via Partial Information Probing 6 min + 2 min QA session Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, and Denghui Zhang
17:28-17:30 Closing Remarks 2 min Organizers

Topics


We welcome a wide array of submissions focused on data-centric AI, encompassing topics such as theories, algorithms, applications, systems, and tools. These topics include but are not limited to:

  • Automated Data Science Methods
    • Data cleaning, denoising, and interpolation
    • Feature selection and generation
    • Data refinement, feature-instance joint selection
    • Data quality improvement, representation learning, reconstruction
    • Outlier detection and removal
  • Tools and Methodologies for Expediting Open-source Dataset Preparation
    • Time acceleration tools for sourcing and preparing high-quality data
    • Tools for consistent data labeling, data quality improvement
    • Tools for generating high-quality supervised learning training data
    • Tools for dataset control, high-level editing, searching public resources
    • Tools for dataset feedback incorporation, coverage understanding, editing
    • Dataset importers and exporters for easy data combination and consumption
    • System architectures and interfaces for dataset tool composition
  • Algorithms for Handling Limited Labeled Data and Label Efficiency
    • Data selection techniques, semi-supervised learning, few-shot learning
    • Weak supervision methods, transfer learning, self-supervised learning approaches
  • Algorithms for Dealing with Biased, Shifted, Drifted, and Out of Distribution Data
    • Datasets for bias evaluation and analysis
    • Algorithms for automated bias elimination, model training with biased data

Submission Details


We invite the submission of regular research papers (max 9 pages), including the bibliography and any possible appendices. Submissions must be in PDF format, and formatted according to the 2-column ACM sigconf template. Submitted papers will be assessed based on their novelty, technical quality, potential impact, insightfulness, depth, clarity, and reproducibility. All the papers are required to be submitted via the EasyChair Submission. For more questions about the workshop and submissions, please send email to kunpeng@pdx.edu

Important Dates (Anywhere on Earth)


  • Workshop Papers Submission: July 29, 2024
  • Notification of Workshop Papers Acceptance: August 30, 2024
  • Camera-ready Deadline and Copyright Form: September 15, 2024
  • Workshop Day: October 25, 2024

Organizing Committee


Steering Co-Chairs

Placeholder

Hui Xiong

The Hong Kong University of Science and Technology (Guangzhou)

Placeholder

Vipin Kumar

University of Minnesota

Organizing Committee

Placeholder

Yanjie Fu

Arizona State University

Placeholder

Kunpeng Liu

Portland State University

Placeholder

Pengyang Wang

University of Macau

Placeholder

Pengfei Wang

Chinese Academy of Sciences

Placeholder

Dongjie Wang

University of Kansas

Placeholder

Meng Xiao

Chinese Academy of Sciences

Placeholder

Yanyong Huang

Southwestern University of Finance and Economics

Placeholder

Wei Fan

University of Oxford

Placeholder

Ziyue Qiao

Great Bay University

Placeholder

Zhengzhang Chen

NEC Laboratories America

Publicity Co-Chairs

Placeholder

Pengyang Wang

University of Macau

Placeholder

Dongjie Wang

University of Kansas

Web Co-Chairs

Placeholder

Nanxu Gong

Arizona State University

Placeholder

Wangyang Ying

Arizona State University


Accepted Paper



Keynote Presentations


Data diversity for understanding safety across modalities

Presenter: Alicia Parrish

Bio: Alicia Parrish is a research scientist working on data-centric responsible AI at Google Deepmind in NYC. Her work focuses on better understanding how to ensure data diversity and high data quality through the entire pipeline of data, from the way the data is collected, who/where it's collected from, how it gets aggregated or sampled, etc. By looking across the entire pipeline, her work traces the impact of early data choices all the way through to the kinds of conclusions we're able to make about model behavior. Alicia holds a PhD in linguistics from New York University.

Abstract: Though datasets to test the safety of large generative models often rely on binary labels, human perspectives are much more nuanced. How people interpret safety is deeply influenced by their socio-cultural backgrounds and lived experiences. As we aim to evaluate systems that will be deployed in real-world settings, it is critical to understand how the diverse perspectives of humans involved in creating the model evaluations can be leveraged. In this talk, I provide an overview of two recent datasets that enable the study human diversity in relation to safety evaluation: (i) the DICES dataset, which captures diverse demographic perspectives on safety ratings for text-to-text models, highlighting the impact of individual backgrounds on safety assessments, and (ii) the Adversarial Nibbler dataset, which engages red teamers from distinct geographic regions to gather adversarial prompts for text-to-image models, revealing the importance of geographic diversity in identifying novel potential harms. I show that embracing the diversity of those involved in creating and rating data enables us to uncover nuanced perspectives that are relevant to ensuring the safety of generative models.


Future Insights: Harnessing AI and Social Media for Advanced Event and Epidemic Forecasting

Presenter: Chang-Tien Lu

Bio: Chang-Tien Lu is a Professor in the Department of Computer Science, Curriculum Lead at the Innovation Campus, and Associate Director of the Sanghani Center for AI and Data Analytics at Virginia Tech. He received his Ph.D. from the University of Minnesota, Twin Cities, in 2001. Dr. Lu currently serves as an Associate Editor for ACM Transactions on Spatial Algorithms and Systems, Data & Knowledge Engineering, IEEE Transactions on Big Data, and GeoInformatica. He has held prominent roles in organizing major conferences, including serving as General Co-Chair of the ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems in 2009, 2020, and 2021, the International Symposium on Spatial and Temporal Databases in 2017, and the IEEE International Conference on Big Data in 2024. He also served as Secretary (2008–2011) and Vice Chair (2011–2014) of ACM SIGSPATIAL. Dr. Lu's research spans spatial informatics, urban computing, artificial intelligence, and intelligent transportation systems. He has authored over 250 publications in top-tier journals and conferences, with funding from the NSF, NIH, DoD, and DoE. He is recognized as an ACM Distinguished Scientist and an IEEE Fellow.

Abstract: In an era where information flows rapidly through digital platforms, social media data offers a powerful tool for predicting and responding to global challenges. This talk presents advanced techniques for forecasting societal events and epidemics by leveraging social media data. I will introduce an automated system that analyzes open-source data from platforms such as Twitter, blogs, and news articles to predict events like natural disasters and public demonstrations, addressing challenges like unstructured text and dynamic relationships. Additionally, I will discuss SimNest, a deep learning framework that integrates computational epidemiology with social media data to enhance real-time disease outbreak monitoring and improve epidemic response strategies. Finally, I will explore a multi-task learning framework that improves event forecasting across regions by addressing issues such as data imbalances and geographic differences, with notable success demonstrated using Twitter data from Latin America. These approaches highlight the power of combining social media and machine learning to improve prediction accuracy and enable timely interventions for both societal events and health crises.


Actionable Decision Making with Small Data

Presenter: Hua Wei

Bio: Hua Wei is an assistant professor at the School of Computing and Augmented Intelligence (SCAI) in Arizona State University (ASU). He also affiliates with the Lawrence Berkeley National Laboratory. Before joining ASU, he worked as an Assistant Professor at New Jersey Institute of Technology and a Staff Researcher at Tencent AI Lab. He got his PhD from Pennsylvania State University in 2020 under the supervision of Dr. Zhenhui (Jessie) Li. Before that, he received his master and bachelor degree from Beihang University (BUAA) majoring in Computer Science, working with Prof. Jinpeng Huai and Dr. Tianyu Wo.

Abstract: Decision making in the real world has been challenging due to the challenge of small data,  i.e., the real-world data could be sparse and hard to obtain. This talk explores the strategies of data generation in the real-world setting of small data to support decision making. The objective is to improve the accuracy and efficiency of data generation by training machine learning models to learn from real-world data. Specifically, this talk will introduce some of our latest work on spatio-temporal data, and the follow-up generative model and control model facing the small data.


Program Committee


  • Dr. Yong Ge, University of Arizona
  • Dr. Hao Liu, The Hong Kong University of Science and Technology (Guangzhou)
  • Dr. Kunpeng Liu, Portland State University
  • Dr. Qi Liu, University of Science and Technology of China
  • Dr. Yanchi Liu, NEC Labs America
  • Dr. Leilei Sun, Beihang University
  • Dr. Pengfei Wang, Chinese Academy of Sciences
  • Dr. Pengyang Wang, University of Macau
  • Dr. Senzhang Wang, Central South University
  • Dr. Keli Xiao, Stony Brook University
  • Dr. Yang Yang, Nanjing University of Science and Technology
  • Dr. Zijun Yao, University of Kansas
  • Dr. Denghui Zhang, Stevens Institute of Technology
  • Dr. Wei Zhang, University of Central Florida
  • Dr. Xi Zhang, Chinese Academy of Sciences
  • Dr. Dongjie Wang, University of Kansas
  • Dr. Muhammad Zunnurain Hussain, Bahria University Lahore Campus
  • Dr. Venkata Nedunoori, Dentsu International

Volunteers


  • Mr. Haihua Xu, University of Macau
  • Ms. Qi Hao, University of Macau

Photos