MODUS — How it works

01The cast

MODUS works with 15 modalities.

Seven are 2D: dense spatial maps pixel-aligned with each other. RGB, depth, surface normal, two kinds of segmentation (Seg and SAM‑Seg), and two kinds of edges (Canny and SAM‑Edge). The other eight are 1D: discrete or sparsely structured. A caption, two box modalities (detection and visual grounding), plus five representational features: DINOv2 (global and local), CLIP, and ImageBind (global and local).

RGB

Depth

Normal

Seg

SAM-Seg

Canny

SAM-Edge

Caption

Detection

Grounding

DINOv2

DINOv2 local

CLIP

ImageBind

ImageBind local

the fifteen modalities, ground truth for one scene · top row: 2D pixel maps · bottom row: 1D structured / representational signals

02Two families

The 15 modalities split into 1D and 2D.

That split isn't cosmetic. It determines how each modality enters the model. 1D modalities are discrete or sparsely structured: a caption is a sequence of word IDs, detection and grounding are a few box coordinates, and feature vectors from DINOv2/CLIP/ImageBind are global or patch-level representations. 2D modalities are dense spatial maps of the same image: each pixel matters and they are pixel-aligned with each other.

2D

RGB

2D

Depth

2D

Normal

2D

Seg

2D

SAM-Seg

2D

Canny

2D

SAM-Edge

1D

Caption

1D

Detection

1DGrounding

1D

DINOv2

1D

DINOv2 local

1D

CLIP

1D

ImageBind

1D

ImageBind local

teal = 2D family (7) · warm = 1D family (8)

03Tokenization

1D: each modality keeps its own tokenizer. 2D: all modalities share one VAE and one ViT.

8modalities → 8tokenizers

Per-modality. Caption, detection, grounding, DINOv2 (global & local), CLIP, ImageBind (global & local). Each has its own vocabulary / quantizer.

7modalities → 2encoders (VAE + ViT)

Shared. One VAE encodes every 2D modality being generated; one ViT encodes every 2D modality being conditioned on.

parallel paths on the left · convergence on the right

04Unified sequence

All tokens land in one ordered sequence.

Regardless of where they come from, every token is concatenated into a single sequence with the same hidden dimension. A small <Start Modality> marker tells the decoder what's coming next. Each 2D modality contributes both VAE tokens (low-level pixel detail) and ViT tokens (semantic features); each 1D modality contributes a short run of discrete Tok tokens. The decoder doesn't care what the next block is; it just sees tokens.

StartDepth

VAE·

ViT·

StartNormal

VAE·

ViT·

StartDetection

Tok·

StartRGB

VAE·

ViT·

teal cells are continuous (VAE / ViT) for 2D modalities · warm cells are discrete (Tok) for 1D modalities · grey are start markers

05The decoder

One backbone, two experts, shared attention.

Inside the decoder, every layer routes tokens to the right expert based on the token type. The 1D Expert predicts the next discrete token autoregressively. The 2D Expert regresses a flow-matching velocity at the current noise level. Their attention is shared: every token, regardless of family, sees every preceding token.

1D Expert

Autoregressive · next-token prediction

Sample the next token from p(x_t | x_<t), the same way a language model does, except the vocabulary covers all 1D modalities.

2D Expert

Flow matching · noise → clean

Predict velocity v_θ that points from current noisy latent toward the clean target. Solve the ODE at inference time.

shared attention context

causal attention is shared across families: every token sees the whole past

06Training

One target per sample, many conditions. Loss is computed only on the target's tokens.

Each training sample picks one modality as the target (1D or 2D, doesn't matter). The remaining modalities enter the sequence as clean conditions. 1D conditions go through their tokenizers, 2D conditions go through the shared ViT (clean semantic features, no noise). Only the target carries noise: a 2D target's clean VAE latent z₁ is mixed with random z₀ at flow time t, and the model predicts a velocity; a 1D target's tokens are predicted autoregressively. The corresponding loss is summed over target tokens only.

Training · one forward pass · one target

Conditions · clean

Caption

Tok

RGB

VAE + ViT · clean

Depth

VAE + ViT · clean

→

Target · noisy

Normal

VAE only · z_t noisy

↓

velocity prediction · loss only on target

ℒ_FM target tokens only

conditions: clean GT, no noise, no loss

target: noisy at flow time t, sole contributor to ℒ

1D tokens cross‑entropy

ℒ_AR

= − Σ_{t ∈ target 1D} log p(x_t | x_<t)

summed only over the target's 1D tokens

+

2D tokens flow matching MSE

ℒ_FM

= Σ_{t ∈ target 2D} ‖v_θ(z_t, t) − (z₁ − z₀)‖²

summed only over the target's 2D tokens

ℒ_total = ℒ_AR + ℒ_FM

single backward pass · same trunk · no modality-specific weights or heads

07Inference

Denoise the target. Re‑encode it clean. Then move on to the next.

At inference, conditions go in as clean tokens (the same ViT / tokenizer that training used). The target's slot starts as pure noise; the model iteratively integrates its predicted velocity until the latent is clean. Inside the target segment, noisy tokens attend to each other (flow matching needs this). But the moment we move on to a next target, only the clean re-encoded version of the just-finished target enters its context. The noisy intermediate is masked out, so every modality always conditions on clean tokens, exactly as it did in training.

Inference · iterative denoising + clean re-encode

Pass 1 · generate target A (Depth)

Conditions · clean

Caption

Tok

RGB

VAE + ViT · clean

→N flow steps

Denoise target A

Depth

VAE only · noisy

Depth

decoded · z₁

re-encode the clean Depth through both VAE and ViT. This clean (VAE + ViT) encoding, not the noisy intermediate, becomes a condition for pass 2.

Pass 2 · target A is now a clean condition · generate target B (Normal)

Conditions · clean

Caption

Tok

RGB

VAE + ViT · clean

Depth

VAE + ViT · clean (re-encoded)

→N flow steps

Denoise target B

Normal

VAE only · noisy

Normal

decoded · z₁

noisy target tokens attend to each other and to clean conditions only, never to other modalities' noisy intermediates

every modality conditions on clean tokens only, the same distribution training saw

ENDthat's MODUS

One decoder, two experts, every modality.

MODUS shows that a single decoder, cooperating between an autoregressive head and a flow-matching head over a unified token sequence, is enough to bring pretrained foundation priors to genuinely diverse any-to-any multimodal generation.