A research team from hardware manufacturer NVIDIA has introduced eDiffi, a diffusion model for creating images from text presets. Style transfer, style variations and word painting are part of the new AI model’s repertoire. The use of specialized denoising networks and the combination of CLIP and T5 encoders should lead to improved synthesis capabilities.
Unlike previously available models such as DALL-E from OpenAI, Stable Diffusion from Stability AI and other partners, Imagen from Google or Make a Scene from Meta AI, you can paint directly with words in eDiffi, for example create a rough and individual sketch Mark the segments of the image with words, from which the model creates a coherent image. In principle, users can divide the image area into segments, add terms to them, and eDiffi creates a coherent image from this segmented map.
More control over production
With this model, users should have more control over the desired outcome than with other alternatives (so far). However, eDiffi (unlike Stable Diffusion) is not open source and is not yet publicly available for testing. Therefore, such estimates should be treated with caution at this point in time. Some functions of the new model are very similar to well-known AI systems for image synthesis, but apparently perform the techniques in a different way. Here, too, the focus should be on more user control.
The model can create text-driven images, but it can also be instantly guided to the desired result from visual input such as a drawing or sketch and also in the combination of text and image specifications. To create an image in this way, you must first have a clear idea – clearer than with the piñata of a pure text prompt. Users can also control the output style by entering an image in the desired style – in addition to the text prompt – which doesn’t seem to be possible in this form with other models.
The fact that the model has very similar capabilities to well-known image generators on the one hand, and new qualities on the other, is due to changes in the basic architecture. The team had noticed that the stages of sampling (eg sampling) varied greatly over the course of the training. Therefore, the team decided to train an ensemble of different “denoise networks” that are specialized for the relevant noise interval. The team calls these individual experts Expert Denoisers. The eDiffi pipeline has interconnected three diffusion models and several such Expert Denoisers. Stepping from the 64×64 resolution to the super resolution set of 256×256 pixels, the model can extrapolate images up to a resolution of 1024×1024 pixels.
The team uses pre-trained models, combining two different types of encoders, CLIP text and image embedding (OpenAI method) and T5 text embedding (Google AI method). Depending on the input request, different aspects are displayed. While T5 strengthens text recognition capabilities, CLIP strengthens the accuracy of image-text pairing. The text and image translation work for the machine seems to work better with the combination and the model succeeds in producing photo-realistic image.
eDiffi not only appears to be top-notch, but also outperforms existing models in several aspects: eDiffi is apparently able to correctly describe the desired text visually (a skill that Imagen may have mastered, models like DALL- E 2 and Stable Diffusion, on the other hand, often fabulates fantasy text or secret characters for the desired tags). In the opinion of early reviewers, such as Louis Bouchard, who looked at the research paper along with numerous examples of images and text, eDiffi apparently delivers better results than previously available models.
Image templates are available on the companion website, and more details on the research and technology can be found in the eDiffi team’s research paper at arxiv.org.