Do can you make other new model with version like Llava like similar "Opus4.7-GODs.Ghost.Codex-4B.GGuF"?

#2
by alei707 - opened

Because model can add ability ocr (e.g: can read pdf, docx, doc, rtf, txt, md, json), vision (e.g: webcam, pattern recognition, detect thermal, detect expression facial)? do you can also too Model ai can generation image and video?

Thanks for the feedback! The current Opus4.7-GODs.Ghost.Codex-4B is highly specialized for text, reasoning, and coding tasks.

Adding vision and OCR (like LLaVA) requires a Vision-Language Model (VLM) architecture, where a vision encoder is attached to the base model. While this 4B model doesn't have that, I am actively developing upcoming multimodal models (like the Ghost Codex XI line) that will integrate image, audio, and video understanding!

Also, just to clarify: models like LLaVA only understand images. To actually generate images and video, the model needs to be hooked up to a diffusion generator, which is a different beast entirely. Stay tuned to the WithinUsAI page for future multimodal releases!

Thank you. :)

Your welcome

Also
Check out my Gemma4-Overlooked.Thinker.Uncensored-E2B model. Unlike LLaVA which needs a separate vision encoder, my Gemma 4 build is natively multimodal. It handles deep OCR, reads PDFs, and analyzes images and video frames right out of the box.

I just deployed a free interactive Space for it, so you can test its vision capabilities right now without downloading anything:

https://huggingface.co/spaces/WithinUsAI/Gemma4-Overlooked.Thinker.Uncensored-E2B.gguf

One quick note: it can read and analyze images/video perfectly, but it doesn't generate them (that requires a separate diffusion model). Give the Space a try and let me know what you think!

LLaVA models are multimodal video, which means they combine: Vision understanding (images), Language understanding and generation (text). Vision-Language Integration: LLaVA works by connecting a visual encoder (which "sees" the image) to a Large Language Model (like Vicuna or Llama, which understands both text and visual like animated .gif and .mp4, .avi or non-animated like .png, .webp, .jpg, .bmp). because model can reaction and watching and analyse video and image. e.g: Stable Diffusion → image generation. Sora → video generation. thank you :)

Yes, exactly! That is a perfect breakdown of how LLaVA combines a visual encoder with an LLM to understand images and text.

I brought up my Gemma4-Overlooked.Thinker.Uncensored-E2B model because it handles those exact vision tasks, but it is built differently. It handles deep OCR, reading PDFs, and analyzing images and video frames natively—meaning it doesn't need a separate visual encoder bolted on to "see" like LLaVA does.

Since you are looking for a model with strong OCR and vision capabilities, I highly recommend testing it out. You can just drop an image or document straight into the free interactive Space I deployed here to see how it performs:

https://huggingface.co/spaces/WithinUsAI/Gemma4-Overlooked.Thinker.Uncensored-E2B.gguf

Drop a screenshot or a PDF in there and let me know how its analysis compares to your experience with LLaVA!

Sign up or log in to comment