GenAI Media Generation Challenge Workshop @ CVPR

Jun 17 (afternoon), 2024
Workshop Meeting Location: Summit 423-425

Challenge Overview

In the wake of the rapid advancements in generative model capabilities, the field of text-guided image generation and editing has seen unprecedented progress. This has resulted in the creation of images with unparalleled quality, aesthetics, and adherence to text guidance. Yet, a significant challenge remains: the absence of a universally-accepted, easily accessible benchmark for evaluating in-depth capabilities of the generated models. This issue stems from the lack of a comprehensive, large-scale evaluation dataset, standardized evaluation protocols, and the insufficiency of current automatic metrics.

We are proud to announce the launch of the GenAI Media Generation Challenge (MAGIC) that moves us towards addressing these issues. We host two challenge tracks: (1) text to image generation, and (2) text guided image editing, and will be providing the following set of data and resources:

  • Benchmark Datasets: diverse and comprehensive datasets will be released publicly for both tracks, ensuring that all participants have access to the same information and resources.
  • Evaluation Protocol: A standardized evaluation protocol and metrics will be established and shared with all participants. This metric will be designed keeping in mind the nuances and specifics of the tasks, ensuring fair and objective evaluation.
  • Human Annotations for all Submissions: Recognizing the importance of meticulous human oversight, we commit to offering the requisite human annotation resources as stipulated by the evaluation protocol for the duration of the competition.
  • Baselines: For aiding participants and setting a preliminary standard, baseline results will be shared.
  • Awards for top performing participants for different tracks and challenges.

In addition to the benchmark dataset, after the challenge, we will be releasing all human annotations for all submissions for the participants. This hopefully would be a contribution to the community for future evaluation tooling development.

Tracks Overview

Track A - Text-to-image Generation

Benchmarking Datasets

For this track, we test the model's text to image generation capabilities using benchmark prompts with different scenarios and complexity. Benchmark prompts will be leveraging existing academic benchmarks and newly constructed prompts which expands to different levels of complexity. We aim to test different scenarios as below with different benchmark prompts:

  1. Single object generation, focusing on generation accuracy and correctness of the object
  2. Object composition, focusing on generation along with relationship and activity
  3. Reasoning, focusing on concepts such as counting or logic
  4. Stylization, focusing on correctness of global style
  5. Rendered text, focusing on quality and correctness of the generated text in image.

Within these scenarios, the benchmark prompts would contain topics from various domains. For example, for single object generation, there prompts looking at generation of people, animals, food, and locations.

Evaluation Protocol

We leverage both human and automated evaluations. On human-based evaluations, we use human annotators to annotate the of the text2Image model performance on the following aspects:

  1. Text Faithfulness - whether the visual elements in prompts such as objects, attributes, interactions are accurately generated in the image.
  2. Visual quality - whether there are visual errors/defects in the generated image. Examples of errors/defects include misplaced body parts for people, unrealistic scale, bad arrangement, or blurry.
  3. Visual appeal - whether the generated images are aesthetically pleasing to view. We judge whether models are either on-par or surpass professionally made ones in terms of abundant visual details, color harmony, and composition.

On automatic evaluation, we will leverage existing methods that have developed automatic metrics to help in assessing the outputs of the image based on the prompt.

To determine winners, we use automatic evaluation to help prune the total number entries to 10 finalists. Within the final 10 participants, we would use human evaluation results to determine the final winners.

Track B - Text-guided Image Editing

Benchmarking Dataset

For text guided image editing, we test the capacity of the model to change a given image's contents based on some text instructions. The specific type of instructions that we test for are the following:

  1. Addition: Adding new objects within the images.
  2. Remove: Removing objects
  3. Local: Replace local parts of an object and later the object's attributes, i.e., make it smile
  4. Global: Edit the entire image, i.e., let's see it in winter
  5. Inpaint: Alter an object's visual appearance without affecting its structure
  6. Background: Change the scene's background

Evaluation Protocol

We leverage both human and automated evaluations. On human-based evaluations, we use human annotators to annotate the we mainly evaluate the following aspects:

  1. Edit faithfulness - whether the edited image follows the editing instruction
  2. Content preservation - whether the edited image preserves the regions of the original image which should not be changed
  3. Visual quality - whether the edited image is artifact-free, keeping the core visual features of the original image, etc

On automatic evaluation, similar to the text to image track, we will leverage existing methods that have developed automatic metrics to help in assessing the outputs of the image based on the prompt and instruction.

To determine winners, we use automatic evaluation to help prune the total number entries to 10 finalists. At 10, we would use human annotation and evaluation to determine the final winners.

Bonus Track - Movie Generation Challenge

The generated movie can be any video or topic, but should be created with generative AI solution. The movie should be no more than 5 minutes long. To submit to the bonus track, please use the link here: [Moviegen Challenge Submission]

Winner Prize

Top three contestants of each track will be awarded with the latest generation of AR/VR headsets from Meta.

Challenge Registration / Submission

Mailing List Registration

For all parties interested in participating, we ask you to sign up to our mailing list throught the google form here: [Register Mailing List].

Submission Instructions

Submission instructions have been sent out to participants via email. Please register above to receive instructions on the mailing list above.


To be updated on May 18 May 25.

Important Dates (Seattle / Pacific Time Zone)

Registration Open March 4
Release of Benchmarking Datasets March 22 April 5
Challenge Submission Deadline April 26, 5pm PST May 1st, 11:59pm PST
Announcement of Challenge Winners May 18 May 25
Workshop Date June 17

Workshop Schedule on 06/17/2024 (Pacific Time Zone)

Time Agenda Speech Title Speaker(s)
Opening Session 13:15 - 13:30 Opening Remarks Workshop Organizers
Session I 13:30 - 14:00 Keynote Speech Known Issues with FID and FVD Jun-Yan Zhu
14:00 - 14:30 Keynote Speech Video Generation: Past, Present, and a New Hope Sergey Tulyakov
14:30 - 15:00 Keynote Speech Incentivizing Opt-in & Enabling Opt-out for Text-to-Image Models Richard Zhang
Break 15:00 - 15:20 Coffee Break
Session II 15:20 - 15:50 Keynote Speech Controllable Media Generation Yuanzhen Li
15:50 - 16:20 Keynote Speech Multistep Distillation of Diffusion Models via Moment Matching Tim Salimans
16:20 - 17:00 Competition Evaluation, Goals, Awards and Participant Presentations Workshop Organizers, Ju Xuan, Manuel Brack, Lumina, Akio
17:00 - 17:30 Panel Discussion Evaluation and tracking progress for Generative AI Jun-Yan Zhu, Richard Zhang, Ishan Misra
Closing Session 17:30 - 17:40 Closing Remarks Workshop Organizers

Invited Speakers

Yuanzhen Li At Google Research, Yuanzhen supports a talented team and a few cross-team Computer Vision and Generative AI efforts. Recent work includes: Generative Product Imagery, "Imagen" in Google Cloud Vertex AI, Muse, DreamBooth, DreamBooth3D, Generative Uncrop, etc. Prior to Google, she spent 10 years in the startup world. She made iPhone computational-photography apps, and one of them (TrueHDR) was quite popular in the 2009-2012 era. She also founded a startup leveraging deep learning for image search; it was acquired by VSCO in 2015 and I then worked at VSCO as a Director of Engineering. Prior to that, she completed my PhD in 2009 at MIT with Prof. Edward H. Adelson. At MIT, she was a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Perceptual Science Laboratory.

Jun-Yan Zhu Dr. Jun-Yan Zhu is an Assistant Professor with The Robotics Institute in the School of Computer Science of Carnegie Mellon University. He also holds affiliated faculty appointments in the Computer Science Department and Machine Learning Department. He studies computer graphics, computer vision, and computational photography. Prior to joining CMU, he was a Research Scientist at Adobe Research. He did a postdoc at MIT CSAIL, working with William T. Freeman, Josh Tenenbaum, and Antonio Torralba. He obtained my Ph.D. from UC Berkeley, under the supervision of Alexei A. Efros. He received his B.E. from Tsinghua University, working with Zhuowen Tu, Shi-Min Hu, and Eric Chang.

Richard Zhang Dr. Richard Zhang’s research interests are in computer vision, machine learning, deep learning, graphics, and image processing. He obtained a PhD at UC Berkeley, advised by Prof. Alexei (Alyosha) Efros. He obtained BS and MEng degrees from Cornell University in ECE. He often collaborates with academic researchers, either through internships or university collaboration. Recently, he was included on MIT Technology Review's list of 35 Innovators Under 35.

Sergey Tulyakov Dr. Sergey Tulyakov is a Principal Research Scientist heading the Creative Vision team at Snap Research. His work focuses on creating methods for manipulating the world via computer vision and machine learning. This includes 2D and 3D methods for photorealistic object manipulation and animation, video synthesis, prediction and retargeting. His work has been published as 30+ top conference papers, journals and patents resulting in multiple tech transfers, including Snapchat Pet Tracking and Real-time Neural Lenses (gender swap, baby face, real-time try-on and many others).

Tim Salimans Tim is a Machine Learning research scientist working on generative modeling. He is well known for his work on GANs and VAEs, and their evaluation using the Inception score, as well as his work on autoregressive generative models like GPT-1 and PixelCNN++. More recently, he has been focusing on diffusion models for generating images (Imagen) and video (Imagen Video), and on making these models fast to sample using distillation.


Sam Tsai
Meta AI
Ji Hou
Meta AI
Bichen Wu
Meta AI
Matthew Yu
Meta AI
Wang Rui
Meta AI
Tianhe Li
Meta AI
Ajay Menon
Meta AI
Kunpeng Li
Meta AI
Tao Xu
Meta AI

Senior Advisors


To contact the organizers please use


Thanks to languagefor3dscenes for the webpage format.