NL Seminar- The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

ISI Natural Language Seminar

NL Seminar- The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

When

Thursday, March 21, 2024 11:00am - 12:00pm PDT

Add to calendar:

Presenter

Presented by:

Anthony Chen, UC Irvine

Location

ISI-MDR #689 in-person attendance will be permitted for USC/ISI faculty, staff, students only. Open to the public virtually via Zoom.

Virtual URL

Online Link

Virtual Recording

This event is open to:

Everyone

Event Details

Speakers: Anthony Chen and Shayne Longpre, MIT

Conference Rm Location: ISI-MDR #689 in-person attendance will be permitted for USC/ISI faculty, staff, students only. Open to the public virtually via Zoom

REMINDER:

If you do not have access to the 6th Floor, please check in at the main reception desk on 10th floor and someone will escort you to the conference room location prior to the start of the talk.

Meeting hosts only admit guests that they know to the Zoom meeting. Hence, you’re highly encouraged to use your USC account to sign into Zoom.

If you’re an outside visitor, please provide your: Full Name, Title and Name of Workplace to (nlg-seminar-host(at)isi.edu) beforehand so we’ll be aware of your attendance. Also, let us know if you plan to attend in-person or virtually.

For more information on the NL Seminar series and upcoming talks, please visit:

https://www.isi.edu/research-groups-nlg/nlg-seminars/

Hosts: Jon May and Justin Cho

The arms race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we introduce the Data Provenance Initiative, a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data.

Speaker Bio

Bio 1:Anthony Chen is an engineer at Google DeepMind doing research on factuality and long-context language models. He received his PhD from UC Irvine last year where he focused on generative evaluation and factuality in language models.

Bio 2: Shayne Longpre is a PhD candidate at MIT with a focus on data-centric AI, language models, and their societal impact.

If speaker approves to be recorded for this NL Seminar talk, it will be posted on our USC/ISI YouTube page within 1-2 business days: https://www.youtube.com/user/USCISI.

Subscribe here to learn more about upcoming seminars: https://www.isi.edu/events/

This program is open to all eligible individuals. Information Sciences Institute operates all of its programs and activities consistent with the University’s Notice of Non-Discrimination. Eligibility is not determined based on race, sex, ethnicity, sexual orientation, or any other prohibited factor.