A tiny diffusion model, a mobile device, and a surprising amount of magic — here’s how I built a pocket-sized photobooth that can whisk real people into new worlds in under 30 seconds.
This post unpacks how quantization, ANE-optimized kernels, and smart schedulers shrink a 6GB diffusion model into a fast, mobile-ready package.
How I chased a diffusion model small enough for the iPhone, fast enough for real use, and resilient enough to avoid corruption—unpacking what works, what doesn’t, and why.
After painstakingly embedding a mini multi-modal LLaVA model, I'm ready to properly deploy it as an iOS app and enjoy the fruits of my labor. Let's see if we can truly shrink the impossible.
Armed with some newfound vision transformer knowledge, we're ready to extend the Machine Learning Compiler framework to support a new, tiny but promising multi-modal model.
Vision transformers, with help from training frameworks like CLIP and SigLIP, make multi-modal foundation models like LLaVA possible — bridging the gap between vision and text.
The open-source Machine Learning Compiler Engine project is transforming foundation models into efficient and portable powerhouses.
How does the gradient stability differ between REINFORCE, G(PO)MDP, G(PO)MDP+ whitening during policy learning?