My App

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

text-to-video的基础模型, 能生成10秒的长视频, fps为16, 分辨率768x1360. 卖点是长视频和文本连贯性. 3D-VAE, expert transformer, 分阶段多分辨率训练, effective pipeline. 结果在生成质量和予以对齐上都有所改进.

Loading...