One decoder, one sequence, two cooperating experts. That's the recipe behind MODUS's any-to-any generation. Seven scrolls to unpack it.
Seven are 2D: dense spatial maps pixel-aligned with each other. RGB, depth, surface normal, two kinds of segmentation (Seg and SAM‑Seg), and two kinds of edges (Canny and SAM‑Edge). The other eight are 1D: discrete or sparsely structured. A caption, two box modalities (detection and visual grounding), plus five representational features: DINOv2 (global and local), CLIP, and ImageBind (global and local).
RGB
Depth
Normal
Seg
SAM-Seg
Canny
SAM-Edge
Caption
Detection
DINOv2
DINOv2 local
CLIP
ImageBind
ImageBind localThat split isn't cosmetic. It determines how each modality enters the model. 1D modalities are discrete or sparsely structured: a caption is a sequence of word IDs, detection and grounding are a few box coordinates, and feature vectors from DINOv2/CLIP/ImageBind are global or patch-level representations. 2D modalities are dense spatial maps of the same image: each pixel matters and they are pixel-aligned with each other.
RGB
Depth
Normal
Seg
SAM-Seg
Canny
SAM-Edge
Caption
Detection
DINOv2
DINOv2 local
CLIP
ImageBind
ImageBind local
Regardless of where they come from, every token is concatenated into a single
sequence with the same hidden dimension. A small <Start Modality>
marker tells the decoder what's coming next. Each 2D modality
contributes both VAE tokens (low-level pixel detail) and ViT tokens
(semantic features); each 1D modality contributes a short run
of discrete Tok tokens. The decoder doesn't care what the next block is;
it just sees tokens.
Inside the decoder, every layer routes tokens to the right expert based on the token type. The 1D Expert predicts the next discrete token autoregressively. The 2D Expert regresses a flow-matching velocity at the current noise level. Their attention is shared: every token, regardless of family, sees every preceding token.
Each training sample picks one modality as the target (1D or 2D, doesn't matter). The remaining modalities enter the sequence as clean conditions. 1D conditions go through their tokenizers, 2D conditions go through the shared ViT (clean semantic features, no noise). Only the target carries noise: a 2D target's clean VAE latent z1 is mixed with random z0 at flow time t, and the model predicts a velocity; a 1D target's tokens are predicted autoregressively. The corresponding loss is summed over target tokens only.
At inference, conditions go in as clean tokens (the same ViT / tokenizer that training used). The target's slot starts as pure noise; the model iteratively integrates its predicted velocity until the latent is clean. Inside the target segment, noisy tokens attend to each other (flow matching needs this). But the moment we move on to a next target, only the clean re-encoded version of the just-finished target enters its context. The noisy intermediate is masked out, so every modality always conditions on clean tokens, exactly as it did in training.
MODUS shows that a single decoder, cooperating between an autoregressive head and a flow-matching head over a unified token sequence, is enough to bring pretrained foundation priors to genuinely diverse any-to-any multimodal generation.