Feature requests: 2k quality , Object touch & grip interaction, Eyelines, multiple characters, Controllable Relight, Simple matting for footage,
Multiple reference for larger smoother workflow is a long time request. Atleast in steps to accomplish lot more with AI.
For just 1 image in a story of a scene which need to be super detailed high quality 1440p,
here are the workflow i am thinking that if you can explore and discover simple integrated solutions.
T & I workflow to VIdeo.
- Make a AI world e.g room .office arena. basically a proper SET with depth, lighting and importing from 3d and video reference. A consistent world where scene gets generated for all video try makes life so easy. changing background leads to lot of Video generating wastage. A fixed Space will solve this.
- Take character ref [1,2]image or lora. (king character sheet & fighter character sheet), put them in an scene ref [3] (arena)
- they are located at a certain position in the space ( inpaint inside arena). Blend it well.
- both have a specific pose reference [4,5] ( king is in attacking position, fighter is blocking position)
- while they both holding few things according to story ref [6,7] ( king have a specific sword he won , Fighter have specific shield)
- While they are giving a facial expression of ref [vid 8, vid 9], (king is angry, fighter is inner turmoil to not fight king himself]
- They are given Specific Movement reference full body ( Video ref 10)
- they are given specific Acting video , emotion facial reference. Dialogues & lip sync ( Video ref 11, audio 12)
- They are given Video Ref Camera movement in the scene ( Vid ref 13) or text to Camera movement. to capture whole scene at once in multiple angle.
- turning Inference 3d space to a gaussian splats 3d/4d to be able to change camera angle later while keeping performance & animation is a great time saver. Can be used for multicam.
- Reprocess final for face for clarity on going distance from camera or moving fast, and high speed motion, blurry face & cloth motion need to be fix & intact while motion.
-Match Eyes lines when moving, looking , glancing squint, Prooper eye balls tracing on people with smaller eyelids.. No dead expression. - for any minor imperfection make a real time mocap driving , low res then inference controlnet to highres for precise directing.
References: Acting expression, movement Pre-Prossesor, Cloth details, Position details, instead of 2d based , make 3d based pre-processors which can object tracks multiple perfectly and inject data to compose multiple layers.
blending: Lights when all are migrated to scene without changing the performance or details of input video, scene with physics of water, air, VACE black and white for specific object animation tagging would be great but also multiple targeted inject will speed things up . All need to blend perfectly. this is help all the other Ai vid and filmmakers folks a lot. it will be less wastage and more Eco friendly & faster precise output ,time saved using AI.
here are few more feature requests:
- Ability to inject performance without considering the Mocap framing. There is an issue where the target composition get set to similar framing of motion driving video. If you can take motion data in 3d space and put it a character it will be much more help. As in your example everything was a webcam driving video , with similar frame on the end. and not anything was webcam driving to character doing action in a wide shot. a solution for that would unlock mocap filmmaking.
- keep texture & lighting of skin, Slightly oily and Ai images to photoreal would be nice for production workflows.
- Small details jumps now, a better coherent for tongue, beards, issue on LIke full Rotations, fast objects, blocked view multiple depth controlled persons, instead of just pose driving a 3d depth would be nice to maintain consistency. as its getting confused between opening mouth to yawn and bringing tongue forward.
- Ability to store a body mocap & motion and inject in target video in any orientation in 3d space of target video, so it dont get confined to any driving video's blocking and composition. e.g multiple vertical video can be stored and mapped in 3d for a character and their interactions for AI generation.
- Ability to target Multiple acting to multiple Ai characters at once. 95% scenes of film happen with other characters.
- Detailed face, cloth textures 2k quality of output to emulate censor overscan while working on consumer gpu.
- Object touch & grip interaction of character , story happens in a space and not all shots are infront of webcam.
- Eyelines. head tilt, neck lines ( on side face ... side eyes can add realism)
- Object obscure in foreground and still animation can be mapped to a character in the background.
- generation of movement in Alpha layers for easier layer editing coz model is great is following complex instruction on simple background. Only if we can get a performance and put in the scene later and blend it well.
- Very high detail hair , fur and edge matting . to bridge traditional film Filmed performance with Ai world. as Ai footage's greenscreen are compressed and bad. A simple tool to mat would be useful.
- An scene composer ( where all element of audio, video, acting, scenery , movment, camera , character can be injected for inference in simple ways.)
- better Point & target controlnets, where Draw to Video , on layers can happens from foreground to background using text or video.
Any Piece of work happens in multiple revisions & iterations. for AI generation as its not iterative in nature nor fast for high end production , Nor highest quality at the moment, also cost time quality comes in. if anyone gets what they want in low revisions & less geneartion not only it would save time , but also cost time energy and electricity in global power consumption.
thank you so much. these are challenges i faced so i mentioned. will be looking for next models & next superior Reference based Video generators.