2nd GenAI Media Generation Challenge Workshop @ CVPR2025

Jun 11th (9:00 - 12:00) Workshop Meeting Location: 110A

Join Zoom Meeting TBA.

Workshop Overview

This year, we are excited to host the 2nd GenAI Media Generation Challenge Workshop at CVPR 2025. Building on the success of last year's event, which focused on text-to-image and image editing tasks, we are expanding the challenge to include video generation.

We are proud to announce the launch of the 2nd GenAI Media Generation Challenge (MAGIC), featuring a media generation track and auto-evaluation track:

Media Generation Festival: For the first time, we are organizing a media generation festival with no restrictions on prompts. We would define a few different topics for which submitted media would compete in, and participants can submit their best generated videos or images for those specific topics. For each topic, we run a crowd sourced voting mechanism to determine the winners for each topic .
Auto Evaluation Challenge: We are introducing an auto evaluation challenge for both text-to-image and text-to-video tasks. Participants can develop and submit their auto evaluation score for a preselect set of images and videos that we will provide and enter into the media generation festival track. Auto evaluation submissions would be to predict the outcomes from the crowd sourced voting mechanism in the media generation festival The auto evaluation method that achieves the best correlation with the final results would be the winners for this challenge..

Crowedsourcing voting platform" is Open!

Click the link below to vote for your favorite AI-generated media.

Vote Now

Submission for "Challenge A - Media Generation Festival" is Open!

Click the link below to share your generated images or videos with us. See Submission Instruction.

Submit Now

Submission for "Challenge B - Auto Evaluation" is Open!

Click the link below to share your auto eval results with us. See Submission Instruction.

Submit Now

Challenges Overview

Challenge A - Media Generation Festival

Track A.1 - Video (Short)

For this task, there are no restrictions on prompts or models but limited to based on topics. Participants are free to use any models, including third-party video generation tools, and are encouraged to design their own creative prompts to produce engaging videos. Submissions can be single topic submissions or multiple topic submissions. However, the video length is limited to a maximum of 10 seconds.

For this track, we will have the following set of of topics:

People
Animals
Landscape

Track A.2 - Video (Long)

Similar to track A.1, there are no restrictions on prompts or models but limited to based on topics. For this track though, the video length is limited to a maximum of 5 minutes.

For this track, we will have the following set of of topics:

Action
History
Sci-Fi
Fantasy
Comedy

Track A.3 - Images

For this track, we will have the following set of of topics:

People
Animals
Landscape

Evaluation Protocol

All submitted video/images will be uploaded to our crowdsourcing platform for public voting. The top three submissions with the highest votes for each topic will be declared winners. Beyond the winners per topic, we also have joint submission winners where we would have the best overall performing solution. We would use the Elo ranking system to compute the final ranking between different submissions.

For the voting setup, we would use a pairwise comparison setup, showing two videos or images along with the topic, and ask the question "Which one would you prefer?". The selections would be 3-scale win/tie/lose rating.

Submission Instruction

For this challenge, we do not have much limitations excepts: categories (see above) and the video length (less than 10s for Track A.1, 5mins for Track A.2). Please upload your videos or images indicating the category and track. We do not limit the number of submissions of each team.

Challenge B - Results Prediction with Auto Evaluation (Artifacts/Flaws)

In this challenge, we will use provided (text, image) pairs and (text, video) pairs for which participants would run their own auto evaluation. Participants will submit a binary classification of these media indicating if the media (image or video) has flaws/artifacts.

Track B.1 - Video Generation Auto Eval

For this track, we use Movie Gen Video Bench for benchmarking. Each participant will be asked to download the 1003 videos and prompts from Movie Gen Video Bench and run their auto eval models to classify the 1003 videos if they have artifacts or flaws.

Track B.2 - Image Generation Auto Eval

For this track, we use Emu_1k for benchmarking. Each participant can download the 1000 emu-generated images and prompts from the benchmark and run their auto eval models to indicate if there are flaws or aritifacts in each image.

Evaluation Protocol

All submitted video/images will be evaluated against our internal human annotations.

Submission Instruction

To participate in Track B.1, please follow these steps:

Download the 1003 videos, indexed from 0 to 1002, from the Movie Gen Video Bench.
Run your auto-evaluation model to generate the rankings for these videos. You can also utilize the prompts and meta information provided in the Movie Gen Video Generation benchmark to enhance your evaluation.
To submit your results, prepare a text file with 1003 lines. Each line indicates whether this video has flaws or artifacts in the generation.

Example of a submitted txt file.

1
0
...
1

In this submission,

The first video (0.mp4) has flaws or artifacts, so the first line is 1.
The second video (1.mp4) has no flaws, so the second line is 0.
The last video (1002.mp4) has artifacts, so the last line is 1.

For Track B.2, please follow these instructions:

Download the "images.zip" file from the Emu 1k benchmark.
Inside "images.zip", you will find 1000 images along with their corresponding prompts. For example, "000000.jpg" has its prompts in "000000.txt".
To submit your results, create a text file with 1000 lines. Each line should indicate whether this image has flaws or artifacts.

Example of a submitted txt file.

1
0
...
1

In this submission,

The first image (000000.jpg) has flaws, so the first line is 1.
The second image (000001.jpg) has no flaws, so the second line is 0.
The last image (000999.jpg) has flaws, so the last line is 1.

Leaderboard

Auto Evaluation (Artifacts) on Video

Rank	Team	Precision	Recall
#1.	MIPAL	11.2%	41.7%
#2.	Tale Studio	11.0%	26.1%
#3.	Autuers Media	9.9%	14.9%

Important Dates

Description	Date
Submission opens for all Challenges.	3/3/2025
Submission closes for Challenge A.	4/14/2025
Crowd-sourced polling opens	4/21/2025
Submission closes for Challenge B.	6/2/2025
Crowd-sourced poll ends.	6/2/2025
Workshop Date	6/11/2025

Workshop Schedule

	Time	Agenda	Speech Title	Speaker(s)
Opening Session	09:15 - 09:30	Opening Remarks		Ji Hou
Session I	09:30 - 10:00	Keynote Speech	Boosting the Efficiency & Effectiveness of GenAI for Media Generation	Björn Ommer
Session I	10:00 - 10:30	Keynote Speech	BAGEL: The Open-Source Unified Multimodal Model	Haoqi Fan
Break	10:30 - 10:50	Coffee Break
Session II	10:50 - 11:20	Keynote Speech	Still Training GANs in 2025?	Jun-Yan Zhu
	11:20 - 11:50	Keynote Speech	Generating More with Less: A Representation Learning Perspective	Saining Xie
	11:50 - 12:20	Keynote Speech	The Canvas of You: Visual Personalization in Space and Time	Sergey Tulyakov
Closing Session	12:20 - 12:30	Closing Remarks		Yaqiao Luo

Invited Speakers

Björn Ommer Dr. Björn Ommer is a full professor of computer science at LMU Munich, where he leads the Computer Vision & Learning Group. Before joining LMU, he was a full professor at Heidelberg University and a director at both the Interdisciplinary Center for Scientific Computing (IWR) and the Heidelberg Collaboratory for Image Processing (HCI). He holds a Ph.D. from ETH Zurich, a diploma from University of Bonn, and he was a postdoctoral researcher at UC Berkeley. His research focuses on generative AI, visual understanding, and explainable neural networks. His group developed several influential approaches in generative modeling, such as Stable Diffusion, which has seen broad adoption across academia, industry, and beyond. Björn is a director of the Bavarian AI Council, an ELLIS Fellow, and he has served in senior roles at major conferences such as CVPR, ICCV, ECCV, and NeurIPS. His most recent recognitions include the German AI Prize 2024, the Eduard Rhein Technology Award, and a nomination for the German Future Prize by the President of Germany.

Jun-Yan Zhu Dr. Jun-Yan Zhu is the Michael B. Donohue Assistant Professor of Computer Science and Robotics at CMU’s School of Computer Science. Prior to joining CMU, he was a Research Scientist at Adobe Research and a postdoc at MIT CSAIL. He obtained his Ph.D. from UC Berkeley and B.E. from Tsinghua University. He studies computer vision, computer graphics, and computational photography. His current research focuses on generative models for visual storytelling. He is the recipient of the Samsung AI Research of the Year, the Packard Fellowships for Science and Engineering, the NSF CAREER Award, the ACM SIGGRAPH Outstanding Doctoral Dissertation Award, and the UC Berkeley EECS David J. Sakrison Memorial Prize for outstanding doctoral research, among other awards.

Sergey Tulyakov Dr. Sergey Tulyakov is the Director of Research at Snap Inc., where he leads the Creative Vision team. Sergey’s work focuses on building technology to enhance creators’ skills using computer vision, machine learning, and generative AI. His work involves 2D, 3D, video generation, editing, and personalization. To scale generative experiences to hundreds of millions of users, Sergey’s team builds the world’s most efficient mobile foundational models, which enhance multiple products at Snap Inc. Sergey pioneered video generation and unsupervised animation domains with MoCoGAN, MonkeyNet, and the First Order Motion Model, sparking several startups in the field. His work on Interactive Video Stylization received the Best in Show Award at SIGGRAPH Real-Time Live! 2020. He has published over 60 top conference papers, journals, and patents, resulting in multiple innovative products, including Real-time Neural Lenses, real-time try-on, Snap AI Video, Imagine Together, world's fastest foundational image-to-image model and many more. Before joining Snap Inc., Sergey was with Carnegie Mellon University, Microsoft, and NVIDIA. He holds a PhD from the University of Trento, Italy.

Haoqi Fan Haoqi Fan is a Research Scientist at Seed Edge, where he leads efforts to build world foundational models. He spent seven years at Facebook AI Research (FAIR), focusing on self-supervised learning and backbone design for image and video understanding. His works won the ActivityNet Challenge at ICCV 2019 and were nominated for Best Paper at CVPR 2020. He has also co-organized several tutorials at CVPR, ICCV, and ECCV.

Saining Xie Dr. Saining Xie is an Assistant Professor of Computer Science at NYU Courant and part of the CILVR group. He is also affiliated with NYU Center for Data Science. Before that I was a research scientist at Facebook AI Research (FAIR), Menlo Park. He received my Ph.D. and M.S. degrees from CSE Department at UC San Diego, advised by Zhuowen Tu. During his PhD study, he also interned at NEC Labs, Adobe, Facebook, Google, DeepMind. Prior to that, he obtained his bachelor degree from Shanghai Jiao Tong University. His primary areas of interest in research are computer vision and machine learning.