MODUS · Method walkthrough

How MODUS reads, generates, and unifies modalities.

One decoder, one sequence, two cooperating experts. That's the recipe behind MODUS's any-to-any generation. Seven scrolls to unpack it.

01The cast

MODUS works with 15 modalities.

Seven are 2D: dense spatial maps pixel-aligned with each other. RGB, depth, surface normal, two kinds of segmentation (Seg and SAM‑Seg), and two kinds of edges (Canny and SAM‑Edge). The other eight are 1D: discrete or sparsely structured. A caption, two box modalities (detection and visual grounding), plus five representational features: DINOv2 (global and local), CLIP, and ImageBind (global and local).

RGB
Depth
Normal
Seg
SAM-Seg
Canny
SAM-Edge
Caption
Detection
"the bus"Grounding
DINOv2
DINOv2 local
CLIP
ImageBind
ImageBind local
the fifteen modalities, ground truth for one scene · top row: 2D pixel maps · bottom row: 1D structured / representational signals
02Two families

The 15 modalities split into 1D and 2D.

That split isn't cosmetic. It determines how each modality enters the model. 1D modalities are discrete or sparsely structured: a caption is a sequence of word IDs, detection and grounding are a few box coordinates, and feature vectors from DINOv2/CLIP/ImageBind are global or patch-level representations. 2D modalities are dense spatial maps of the same image: each pixel matters and they are pixel-aligned with each other.

2DRGB
2DDepth
2DNormal
2DSeg
2DSAM-Seg
2DCanny
2DSAM-Edge
1DCaption
1DDetection
1D"the bus"Grounding
1DDINOv2
1DDINOv2 local
1DCLIP
1DImageBind
1DImageBind local
teal = 2D family (7) · warm = 1D family (8)
03Tokenization

1D: each modality keeps its own tokenizer. 2D: all modalities share one VAE and one ViT.

8modalities 8tokenizers
Caption Tok‑caption Detection Tok‑detection Grounding Tok‑grounding DINOv2 Tok‑dino DINOv2 local Tok‑dino‑local CLIP Tok‑clip ImageBind Tok‑imagebind ImageBind local Tok‑imagebind‑local 8 lanes · 8 separate tokenizers
Per-modality. Caption, detection, grounding, DINOv2 (global & local), CLIP, ImageBind (global & local). Each has its own vocabulary / quantizer.
7modalities 2encoders (VAE + ViT)
RGB Depth Normal Seg SAM-Seg Canny SAM-Edge VAE SHARED ViT SHARED 7 modalities · 2 shared encoders VAE for generation targets · ViT for conditioning inputs
Shared. One VAE encodes every 2D modality being generated; one ViT encodes every 2D modality being conditioned on.
parallel paths on the left · convergence on the right
04Unified sequence

All tokens land in one ordered sequence.

Regardless of where they come from, every token is concatenated into a single sequence with the same hidden dimension. A small <Start Modality> marker tells the decoder what's coming next. Each 2D modality contributes both VAE tokens (low-level pixel detail) and ViT tokens (semantic features); each 1D modality contributes a short run of discrete Tok tokens. The decoder doesn't care what the next block is; it just sees tokens.

StartDepth
VAE·
VAE·
ViT·
ViT·
StartNormal
VAE·
VAE·
ViT·
ViT·
StartDetection
Tok·
Tok·
Tok·
Tok·
StartRGB
VAE·
VAE·
ViT·
ViT·
teal cells are continuous (VAE / ViT) for 2D modalities · warm cells are discrete (Tok) for 1D modalities · grey are start markers
05The decoder

One backbone, two experts, shared attention.

Inside the decoder, every layer routes tokens to the right expert based on the token type. The 1D Expert predicts the next discrete token autoregressively. The 2D Expert regresses a flow-matching velocity at the current noise level. Their attention is shared: every token, regardless of family, sees every preceding token.

1D Expert
Autoregressive · next-token prediction
? predict next from past
Sample the next token from p(xt | x<t), the same way a language model does, except the vocabulary covers all 1D modalities.
2D Expert
Flow matching · noise → clean
noise target predict velocity field
Predict velocity vθ that points from current noisy latent toward the clean target. Solve the ODE at inference time.
shared attention context
causal attention is shared across families: every token sees the whole past
06Training

One target per sample, many conditions. Loss is computed only on the target's tokens.

Each training sample picks one modality as the target (1D or 2D, doesn't matter). The remaining modalities enter the sequence as clean conditions. 1D conditions go through their tokenizers, 2D conditions go through the shared ViT (clean semantic features, no noise). Only the target carries noise: a 2D target's clean VAE latent z1 is mixed with random z0 at flow time t, and the model predicts a velocity; a 1D target's tokens are predicted autoregressively. The corresponding loss is summed over target tokens only.

Training · one forward pass · one target
Conditions · clean
Caption
Tok
RGB
VAE + ViT · clean
Depth
VAE + ViT · clean
Target · noisy
Normal
VAE only · zt noisy
velocity prediction · loss only on target
FM target tokens only
conditions: clean GT, no noise, no loss
target: noisy at flow time t, sole contributor to ℒ
1D tokens cross‑entropy
AR
= − Σt ∈ target 1D  log p(xt | x<t)
summed only over the target's 1D tokens
+
2D tokens flow matching MSE
FM
= Σt ∈ target 2D  ‖vθ(ztt) (z1  z0)‖2
summed only over the target's 2D tokens
total =AR +FM
single backward pass · same trunk · no modality-specific weights or heads
07Inference

Denoise the target. Re‑encode it clean. Then move on to the next.

At inference, conditions go in as clean tokens (the same ViT / tokenizer that training used). The target's slot starts as pure noise; the model iteratively integrates its predicted velocity until the latent is clean. Inside the target segment, noisy tokens attend to each other (flow matching needs this). But the moment we move on to a next target, only the clean re-encoded version of the just-finished target enters its context. The noisy intermediate is masked out, so every modality always conditions on clean tokens, exactly as it did in training.

Inference · iterative denoising + clean re-encode
Pass 1 · generate target A (Depth)
Conditions · clean
Caption
Tok
RGB
VAE + ViT · clean
N flow steps
Denoise target A
Depth
VAE only · noisy
Depth
decoded · z1
re-encode the clean Depth through both VAE and ViT. This clean (VAE + ViT) encoding, not the noisy intermediate, becomes a condition for pass 2.
Pass 2 · target A is now a clean condition · generate target B (Normal)
Conditions · clean
Caption
Tok
RGB
VAE + ViT · clean
Depth
VAE + ViT · clean (re-encoded)
N flow steps
Denoise target B
Normal
VAE only · noisy
Normal
decoded · z1
noisy target tokens attend to each other and to clean conditions only, never to other modalities' noisy intermediates
every modality conditions on clean tokens only, the same distribution training saw
ENDthat's MODUS

One decoder, two experts, every modality.

MODUS shows that a single decoder, cooperating between an autoregressive head and a flow-matching head over a unified token sequence, is enough to bring pretrained foundation priors to genuinely diverse any-to-any multimodal generation.

← back to the project page · read the paper