🚀 What if your entire screen was generated live by an AI model instead of rendered through HTML, CSS, and a layout engine? That's exactly what the Flipbook team shipped: https://flipbook.page (Zain Shah, Eddie Jiao, and Drew O'Carr) No fixed UI. Just a continuous stream of intelligent, adaptive pixels that reshape themselves to fit your window, turn any region interactive, and evolve in real time. They optimized LTX Studio's video model to deliver live 1080p at 24 fps over WebSockets on Modal's serverless GPUs. The result feels less like browsing the web and more like stepping into a living, visual explanation engine. Right now it shines at open-ended exploration and visual storytelling (the demos are genuinely mind-bending). But the implications are much bigger: interfaces that can be anything you describe, on demand. Coding environments, data dashboards, travel planners, creative tools - all potentially reimagined as fluid, pixel-perfect experiences. This is still early (they're getting hugged by traffic and some demos are sped-up), but the direction is unmistakable. We're watching the first glimpses of post-GUI computing. What do you think - is this the beginning of the end for how we approach UI or just a fascinating experiment? I'd love to hear your take in the comments. Huge respect to the team for shipping something this ambitious and open. #ai #programming #softwareengineering
I just bookmark everything you post 🤣
Wow. Really interesting work! But also, will we need terrestrial-dyson-sphere-levels of Starlink sats to handle the bandwidth to support global scale use 😅
It’s very impressive. Wondering how long it is going to take until creating industrial grade HMI this way?
Mind bending. However, human nature likes predictability and familiarization when working with GUIs. Relearning a GUI every day because the model temperature is a tad too high will be tiresome. Approaches like json-render will be needed to guiderail
Isn’t that burning a GPU just to render a bunch of buttons the browser draws for free ?
Using a massive video model like LTX Studio to render a button or a text block is the definition of overkill. The cost alone makes it hard to justify — running a single session is reportedly around 2,000 times more expensive than standard web rendering. It’s impressive tech, but not every problem needs a sledgehammer. It shines where the experience is the product (like storytelling or exploration), but it's a nightmare for anything requiring precise data entry or low-cost scaling
This is cool! But also, it's like a canvas element, inaccessible to other tech. Super expensive to render in terms of compute as well. (Can you imagine running this on a phone, eating your battery? Only when inference is small, local, and much more efficient. How many years will that take?) But it does feel like future interfaces could go this way. I suspect they'll still have a design system lurking somewhere
Rendering dynamic UIs as 1080p video streams at 24fps is technically dazzling—but wildly inefficient. Browsers render buttons for free; GPUs rendering them pixel-by-pixel cost watts, latency, and carbon. This might work for exploratory storytelling today, but scaling it to enterprise apps would require orders-of-magnitude efficiency gains. The real innovation isn’t the output—it’s rethinking what ‘interface’ even means.
Technical question: LTX is a DiT, so stateless per generation. Yet in the demos the UI seems to keep coherence across frames and across interactions — which implies state must live somewhere outside the video model. Curious about the division of labor: is there an orchestration layer rebuilding the prompt/conditioning on every interaction (something like a state machine + keyframe references), or have you found a way to persist state in the model's KV cache across consecutive frames? Feels like the real architectural bottleneck before "more stateful models" even arrive.
Fascinating concept. Without the semantic structure of a DOM, how does a purely pixel-generated interface handle accessibility for screen readers?