MODUS unifies any-to-any multimodal generation with one decoderOne causal transformer trunk shared across every modality. No separate encoder + decoder, no modality-specific weights, no task pipelines., two expertsA 1D Expert handles discrete tokens via autoregressive next-token prediction. A 2D Expert handles continuous latents via flow matching. Both attend to the same causal context., and zero task headsTwo losses, summed: cross-entropy for 1D and flow matching for 2D. No segmentation, depth, or detection heads. No per-task decoders. Every modality goes through the same trunk..
Any-to-any modeling aims to flexibly relate arbitrary modalities within a single system, a need that arises in multimodal learning and in scientific domains like ecology and astronomy. Existing approaches mostly train from scratch with encoder–decoder or diffusion architectures, which limits performance and forgoes pretrained models.
We investigate decoder-only any-to-any multimodal modeling: one decoder that treats every modality symmetrically, with no modality-specific heads, losses, or task pipelines. The resulting model, MODUS, can reuse its own outputs as new inputs, check generated images through grounding and VQA, and compare how ViT and VAE features affect dense visual prediction. Across a range of benchmarks, MODUS performs strongly out of the box and composes modalities flexibly in a single model.
The name MODUS comes from modus, the Latin root of modality.
Capabilities
Where task-specific systems scale O(n × n), MODUS scales linearly in the number of modalities. The grid below shows every input modality decoded into every other, all produced by the same model.
Depth
Normal
RGB
Each generated modality is fed back into the context before MODUS predicts the next one, so later outputs condition on earlier outputs.
RGB
Canny
SAM-SegEach target modality is generated separately from the same text input, without conditioning on the other generated targets. Resample (⟳) any one for more.
Every cell is generated by the same MODUS decoder: rows are the input modality, columns are the target. The 14 modalities cover captions, pixels (RGB), geometry (Depth, Normal), semantics (Segmentation), edges (Canny, SAM‑Edge), masks (SAM‑Seg), detections, and representation spaces (DINOv2, DINOv2 local, CLIP, ImageBind, ImageBind local). Hover any row, column or cell to isolate it and preview at full size on the right.
Visual Quality
Drag the divider to compare the RGB input on the left with the modality MODUS generates on the right. Pick a target modality, browse samples with the arrows.
Visual Quality
Method
Animated walkthrough of one example sequence, end to end through MODUS. Press play to start it.
MODUS adapts the pretrained BAGEL-7B mixture-of-transformers with two experts over a shared causal token sequence: a 1D Expert for discrete sequences (text, grounding boxes, DINOv2 tokens) trained with next-token prediction, and a 2D Expert for continuous spatial latents (RGB, depth, normals, segmentation, canny edges) trained with flow matching on VAE + ViT features. Both experts attend to the same causal context, so a token produced by either expert conditions every token that follows.
ℒAR + ℒFM), and inference, all in plain visuals.
Capabilities
Anything MODUS generates can be fed back in as input for the next prediction. So it can reach a target by first generating a useful intermediate modality and then conditioning on it, with no retraining or architectural change.
This lets us ask a concrete question: does predicting an intermediate modality first, and then the target from it, beat predicting the target directly? We test it on surface normal estimation, comparing the direct RGB to normal prediction against routing through three different intermediates (depth, Canny edges, and DINO features).
| Pipeline | Intermediate | NYUv2 Normal MAE (°) ↓ |
|---|---|---|
RGB → Normal | — | 20.02 |
RGB → Depth → Normal | geometry | 20.06 |
RGB → DINO → Normal | semantics | 20.71 |
RGB → Canny → Normal | layout | 19.87 |
We report mean angular error in degrees on NYUv2 surface-normal estimation. Lower is better. The Canny intermediate gives the largest gain over direct RGB to normal prediction.
The same feedback mechanism works across tasks. Whether the target is an RGB image or a surface-normal map, the intermediate predictions stay coherent with the source, and the final output follows those intermediates closely. MODUS keeps structural and semantic consistency through every hop of the chain.
Capabilities
When a model samples several candidate images, some candidates match the prompt better than others. MODUS can score those candidates using modalities it already knows how to produce: grounding boxes and VQA answers. For each prompt, we sample four images, run MODUS again to ground the named objects and answer a prompt-derived question, and keep the candidate whose grounding and answer agree best with the prompt. This is a test-time selection step in the spirit of self-verification and test-time search methods such as SoTo, with no external verifier or separate reward model.
The generate and verify passes are the same MODUS decoder with shared weights. No external verifier or reward model.
| Verifier | GenEval ↑ |
|---|---|
| – | 0.81 |
| Object Grounding | 0.82 |
| VQA + Grounding | 0.84 |
We apply the verifier score to select the best-of-4 output on text-to-image generation.
# MODUS self-verification
candidates ← Text2RGB(prompt, n=4)
scores ← []
for img in candidates:
bbox ← RGB2Grounding(img, prompt) # same decoder
answ ← RGB2VQA(img, prompt) # same decoder
scores.append(agree(bbox, answ, prompt))
return candidates[argmax(scores)]
For each prompt we read the grounding logits MODUS produces when asked to localize the referred objects. These confidence values often correlate with whether the generated image contains the requested objects, and with cues such as approximate count or location. We simply keep the candidate with the highest confidence, no extra training and no external scoring model.
Capabilities
MODUS represents each 2D modality with two feature branches: a ViT branch that carries semantic information, and a VAE branch that preserves reconstruction detail. The ablation below asks what each branch contributes to dense visual prediction. ViT-only outputs often keep the scene identity but warp geometry; adding VAE features brings the prediction back to the image layout.

ViT only keeps the room's overall identity but warps the geometry of the dark monitor.
| Features | NYUv2 Depth ↓ | NYUv2 Normal ↓ |
|---|---|---|
| ViT only | 15.1 | 35.30 |
| VAE only | 6.9 | 19.96 |
| ViT + VAE | 6.5 | 19.92 |
Quantitative ablation. ViT + VAE wins on both depth and normal estimation.
A related observation appears in Ramachandran et al.: GPT-4o can describe an image fluently, but its depth and surface-normal predictions can look plausible while changing the scene geometry. We keep this comparison in the main text because it helps contextualize the ViT-only behavior above: semantic recognition alone is not enough to pin down dense geometry.
The hallucination we see with ViT only is not unique to MODUS. Sampling the same input from GPT-4o produces structurally similar hallucinations across samples. Adding VAE features on top of ViT pins the MODUS prediction to a single consistent geometry.
Training
In a multi-modality decoder, every 2D target starts from the same noise distribution. At high noise, the target contains little visual structure, so training still has to identify which modality is being requested. Logit-normal sampling, which works well for unimodal text-to-image, undersamples those high-noise cases; depth requests can then collapse into normals or RGB. Uniform timestep sampling exposes MODUS to those cases more often and reduces modality confusion without sacrificing image quality.
With uniform timestep sampling, MODUS commits to the correct target modality even at a single denoising step. Logit-normal sampling, by contrast, shows modality confusion at low step counts.
Results
MODUS extends decoder-only models from image–text settings to diverse modalities and is evaluated zero-shot. It matches or surpasses multitask baselines on the tasks they support, while also covering tasks they cannot solve at all.
Scores are reported in their original units. Filled teal = column best. = task not supported by the model. — = score not reported. † reproduced by us.
Dataset
We construct MODUS-Dataset by extending the BLIP-3o image–caption corpus with per-image pseudo-labels for surface normals, monocular depth, segmentation, and canny edges (via DepthAnything, Marigold, and Grounded-SAM), plus DINOv2 global features as a representational modality. This alignment supports modality transformations that are difficult to study with conventional datasets, such as transforming depth into canny, as well as multi-step chained generation. The full dataset will be released.
@article{ye2026modus,
title = {MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities},
author = {Ye, Mingqiao and An, Zhaochong and Gao, Zhitong and Liu, Xian
and Fleuret, Fran\c{c}ois and Li, Chuan and Zadeh, Amir
and Belongie, Serge and Dehghan, Afshin and Allardice, Jesse
and Mizrahi, David and Kar, O\u{g}uzhan Fatih and Bachmann, Roman
and Zamir, Amir},
journal = {arXiv preprint},
year = {2026},
}