Generative AI is pretty impressive in terms of its fidelity these days, as viral memes like papa balenciaga might suggest. The latest systems can evoke landscapes from city skylines to cafes, creating images that appear startlingly realistic, at least at first glance.
But one of the long-standing weaknesses of text-to-image AI models is, ironically, text. Even the best models have a hard time rendering images with legible logos, much less text, calligraphy, or fonts.
But that could change.
Last week, DeepFloyd, a research group backed by AI stability, sleepless Deep Floyd YES, a text-to-image model that can “intelligently” embed text into images. Trained on a dataset of over a billion images and text, DeepFloyd IF, which requires a GPU with at least 16GB of RAM to run, can create an image from a prompt such as “a teddy bear with a ‘Deep Floyd’” T-shirt, optionally in a variety of styles.
DeepFloyd IF is available as open source, licensed for commercial use, for now. The restriction was likely motivated by the current tenuous legal status of AI generative art models. Several commercial model vendors are coming under fire from artists who allege that the vendors are profiting from their work without compensating them by extracting that work from the web without permission.
But Nightcafe, the generative art platform, conceded early access to DeepFloyd YES.
Nightcafe CEO Angus Russell spoke to TechCrunch about what makes DeepFloyd IF different from other text-to-image models and why it could represent a significant step forward for generative AI.
According to Russell, the DeepFloyd IF design was largely inspired by Google’s Image model, which was never made public. Unlike models like the OpenAI DALL-E 2 and stable diffusionDeepFloyd IF uses multiple different processes stacked in a modular architecture to generate images.
With a typical diffusion model, the model learns to gradually subtract noise from an initial image composed almost entirely of noise, bringing it step by step closer to the target indicator. DeepFloyd IF does the diffusion not once but multiple times, outputting a 64x64px image, then scaling the image to 256x256px, and finally 1024x1024px.
Why the need for multiple diffusion steps? DeepFloyd IF works directly with pixels, Russell explained. Diffusion models are for the most part latent diffusion models, which essentially means that they work in a lower dimensional space that represents many more pixels but in a less precise way.
The other key difference between DeepFloyd IF and models like Stable Diffusion and DALL-E 2 is that the former uses a large language model to understand and represent indications as a vector, a basic data structure. D.you Given the size of the large language model built into the DeepFloyd IF architecture, the model is particularly good at understanding complex cues and even spatial relationships described in cues (eg, “a red cube on a pink sphere”).
“It’s also very good at generating legible and correctly spelled text in images, and can even understand prompts in multiple languages,” Russell added. “Of these capabilities, the ability to generate readable text on images is perhaps the biggest advance with DeepFloyd IF that stands out from other algorithms.”
Because DeepFloyd IF can generate text on images quite efficiently, Russell hopes it will unlock a wave of new generative art possibilities – think logo design, web design, posters, billboards, and even memes. The model should also be much better at generating things like hands, he says, and because it can understand prompts in other languages, it could create text in those languages as well.
“NightCafe users are excited about DeepFloyd IF in large part because of the possibilities unlocked by generating text on images,” said Russell. “Stable Diffusion XL was the first open source algorithm to advance text generation: it can accurately generate one or two words some of time, but it’s still not good enough for use cases where text is important.”
That’s not to say that DeepFloyd IF is the holy grail of text-to-image models. Russell points out that the base model ddoes not generate images that are as aesthetically pleasing like some diffusion models, though expect fine adjustments to improve that.
But the bigger question, for me, is to what extent DeepFloyd IF suffers from the same flaws as its generative AI brethren.
A growing body of investigation has uncovered racial, ethnic, gender and other stereotypes in the AI that generates images, including stable diffusion. Just this month, researchers at AI startup Hugging Face and the University of Leipzig published a tool demonstrating that models that include Stable Diffusion and OpenAI DALL-E 2 they tend to produce images of people who appear white and masculine, especially when asked to represent people in positions of authority.
The DeepFloyd team, to their credit, notes the potential for bias in the fine print that accompanies DeepFloyd IF:
Texts and images from communities and cultures using other languages may not be sufficiently taken into account. This affects the overall output of the model, as white and western cultures are often set as the default.
Apart from this, DeepFloyd IF, like other open source generative models, could be used for harm, such as generating pornography. celebrity deep fakes and graphic representations of violence. On the official DeepFloyd IF web page, the DeepFloyd team says they used “custom filters” to remove watermarks, “NSFW” and “other inappropriate content” from the training data.
But it’s unclear exactly what content was removed and how much may have been lost. In the end, time will tell.
—————————————————-
Source link