I think why the shape of movies works is because it's taking the entire video, where each cinematic shot is almost visible because most (all?) frames are included in the generated image, while yours is less than ten frames so there is no patterns emerging that looks visually interesting and pleasing.
Figuring out what a shot is cs. A scene is really neat stuff