Yesterday at the Amazon Web Services’ annual RE: Invent Conference, Amazon CEO Andy Jassy Amazon Nova, a new generation of advanced foundation models providing Frontier Intelligence and Industry-Founding Award performance. The Amazon NOVA models include understanding models of three different sizes for different latency, costs and accuracy needs. We also announced two new generation of creative content-generation-amazon Nova and Amazon Nova Reel-there can generate photos and videos for entry text and images.
The Amazon Nova canvas model enables a wide range of practical capabilities, included
- Text-to-image generation: Enter a text prompt and generate a new image as output;
- Image editing, including paint (adding visual elements), paint (removal of visual elements), automatic editing through text prompt and removal of background;
- Image variation: Input one to five images and an optional text prompt, and model generates a new image that presents the contents of the input images but varies their style and background;
- Image Conditioning: Enter a reference image and a text prompt, and model generates an image whose layout and composition follow the reference image, but whose content follows text prompt;
- Color Guide Content: Give a list of one to ten hex color codes along with a text prompt, and the generator image will incorporate the prescribed color palette.
Amazon Nova Reel Model supports two features: (1) Text to video and (2) Text and Image for Video. With both features, Amazon Nova Gena real video at 1280 x 720 resolution and 24 images per second with a duration of six seconds.
Amazon Nova canvas tries
Amazon Nova Reel samples
Question: “A snowman in a Venetian gondola tour, 4K, high resolution.” Made using Amazon Nova Reel.
Prompt: “A hole in the light of light shaft that reveals hidden underground pools, camera rolls counterclockwise.” Made using Amazon Nova Reel.
Model architecture
Both Amazon Nova Canvas and Amazon Nova Reel are latent diffusion models with transformations backbone or Transformers Broadcast. A diffusion model is one trained to iteratively to denoise a sample to which more noise has been added steps, and a latent distribution model is one where denoising occurs in the representative space.
The main components of Amazon Nova Canvas and Amazon Real include
- A Variation AutoCoder (VAE) that maps raw pixels to visual tokens (codes) and vice versa (decoder); VAEs are trained to send out the same data as they receive as input, but with an interventionic bottle that forces them to produce a low -dimensional latent (coding);
- A text codes; and
- To become transformed-based denoising network (gold refuse For shorts).
The inference process for Nova canvas/wheels to generate images/videos from a text input is as follows:
- The text codes convert the input text into a number of text tokens;
- With the text as a guide, Denoiser removes iterative noise from a set of random initialized visual tokens, resulting in noise -free visual tokens;
- The VAE decoder transforms the noise-free visual tokens into color images/video frames.
During training, the caption of captions or video-peers samples from the training data set, and the diffusion transform learns to link the visual signals to their paired text descriptions. This allows the model to use natural language to guide the synthesis of visual signals at inference.
Specifically, during training, the VAE cod maps the visual signal to visual symbols, and the text codes convert the prompt to text token. Noise is artificially added to the visual tokens at different sampling time steps, dictated by a predefined noise plan. The Denoising network, contingent on text tokenes, is then trained to predict the love of noise injected into the visual tokens at every time stage.
Training
The education process for both models had two phases, prior and fine tuning. Pretraining establishes a basic model that demonstrates high performance on generic tasks, and fine-tuning improves further model performance in terms of visual quality and text picture and text picture adjustment, especially in domains of high interest.
Inference
Runtime optimization is critical for both Amazon Nova Canvas and Amazon Nova Reel, as the iterative inference process for the big diffusions makes a significant calculation request. We used a number of techniques to improve inference efficiency, included for-off time (AOT) compilation, Multi-GPU inference, model distillation and a more effective sampling strategy that tries the solution course closely when needed. These optimizations were sensibly selected and tailored to the specific requirements of each model, providing faster and more efficient inference.