MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities

TL;DR

MODUS unifies any-to-any multimodal generation with one decoderOne causal transformer trunk shared across every modality. No separate encoder + decoder, no modality-specific weights, no task pipelines., two expertsA 1D Expert handles discrete tokens via autoregressive next-token prediction. A 2D Expert handles continuous latents via flow matching. Both attend to the same causal context., and zero task headsTwo losses, summed: cross-entropy for 1D and flow matching for 2D. No segmentation, depth, or detection heads. No per-task decoders. Every modality goes through the same trunk..

Any-to-any modeling aims to flexibly relate arbitrary modalities within a single system, a need that arises in multimodal learning and in scientific domains like ecology and astronomy. Existing approaches mostly train from scratch with encoder–decoder or diffusion architectures, which limits performance and forgoes pretrained models.

We investigate decoder-only any-to-any multimodal modeling: one decoder that treats every modality symmetrically, with no modality-specific heads, losses, or task pipelines. The resulting model, MODUS, can reuse its own outputs as new inputs, check generated images through grounding and VQA, and compare how ViT and VAE features affect dense visual prediction. Across a range of benchmarks, MODUS performs strongly out of the box and composes modalities flexibly in a single model.

The name MODUS comes from modus, the Latin root of modality.

MODUS relates any modality to any other within one model.

Each modality is read or written as a token sequence through one shared trunk.

Capabilities

Any-to-Any Generation

Where task-specific systems scale O(n × n), MODUS scales linearly in the number of modalities. The grid below shows every input modality decoded into every other, all produced by the same model.

Chained prediction→sequential

sequential

"a vibrant red double-decker bus parked in front of a grand historic building with a prominent dome, under a partly cloudy sky"

Text

Depth

Normal

RGB

Each generated modality is fed back into the context before MODUS predicts the next one, so later outputs condition on earlier outputs.

Independent prediction→separate targets

parallel

"a vibrant red double-decker bus parked in front of a grand historic building with a prominent dome, under a partly cloudy sky"

Text

RGB

⟳

Canny

⟳

SAM-Seg

⟳

Each target modality is generated separately from the same text input, without conditioning on the other generated targets. Resample (⟳) any one for more.

Generation

hover a row to auto-play its outputs

Hover a row's input (leftmost column) to auto-play every modality MODUS generates from it.
Hover a single cell to inspect one pair.

Every cell is generated by the same MODUS decoder: rows are the input modality, columns are the target. The 14 modalities cover captions, pixels (RGB), geometry (Depth, Normal), semantics (Segmentation), edges (Canny, SAM‑Edge), masks (SAM‑Seg), detections, and representation spaces (DINOv2, DINOv2 local, CLIP, ImageBind, ImageBind local). Hover any row, column or cell to isolate it and preview at full size on the right.

Visual Quality

RGB → Any

Drag the divider to compare the RGB input on the left with the modality MODUS generates on the right. Pick a target modality, browse samples with the arrows.

RGB →

RGB

Depth

Visual Quality

Text → Any

Generation

PROMPT

RGB · MODUS output

Method

One decoder, two experts, a shared causal context

Animated walkthrough of one example sequence, end to end through MODUS. Press play to start it.

MODUS adapts the pretrained BAGEL-7B mixture-of-transformers with two experts over a shared causal token sequence: a 1D Expert for discrete sequences (text, grounding boxes, DINOv2 tokens) trained with next-token prediction, and a 2D Expert for continuous spatial latents (RGB, depth, normals, segmentation, canny edges) trained with flow matching on VAE + ViT features. Both experts attend to the same causal context, so a token produced by either expert conditions every token that follows.

Blog post · 7 steps · ~3 min read

Read the walkthrough: how MODUS works, step by step →

A short read-along blog post covering tokenizers, the unified sequence, the two experts, training (ℒ_AR + ℒ_FM), and inference, all in plain visuals.

Capabilities

Chained Prediction Through Intermediates

Anything MODUS generates can be fed back in as input for the next prediction. So it can reach a target by first generating a useful intermediate modality and then conditioning on it, with no retraining or architectural change.

This lets us ask a concrete question: does predicting an intermediate modality first, and then the target from it, beat predicting the target directly? We test it on surface normal estimation, comparing the direct RGB to normal prediction against routing through three different intermediates (depth, Canny edges, and DINO features).

Chained generation of surface normals through an intermediate modality — Chained generation of surface normals: MODUS maps the RGB input to an intermediate modality, then to normals. Left: through Canny edges. Right: through depth. Each row is a different scene.

Pipeline	Intermediate	NYUv2 Normal MAE (°) ↓
`RGB → Normal`	—	20.02
`RGB → Depth → Normal`	geometry	20.06
`RGB → DINO → Normal`	semantics	20.71
`RGB → Canny → Normal`	layout	19.87

We report mean angular error in degrees on NYUv2 surface-normal estimation. Lower is better. The Canny intermediate gives the largest gain over direct RGB to normal prediction.

Appendix · more chained examples Intermediates stay coherent all the way through ▾ click to expand

The same feedback mechanism works across tasks. Whether the target is an RGB image or a surface-normal map, the intermediate predictions stay coherent with the source, and the final output follows those intermediates closely. MODUS keeps structural and semantic consistency through every hop of the chain.

Chained text-to-image generation through Canny, depth, and normal intermediates — **Chained text-to-image generation.** Text prompts are transformed into intermediate 2D modalities, including Canny edges, depth maps, and surface normals, before producing the final RGB image. The examples show high visual quality and strong cross-modality consistency throughout the chained generation process.

Chained image-to-normal prediction through depth and Canny intermediates — **Chained image-to-normal prediction.** The input image is first transformed into intermediate modalities, such as depth or Canny edges, and the resulting representations are then used to produce the final surface normal map. The examples illustrate consistent and coherent predictions across the chained modalities.

Capabilities

Cross-modal Self-Verification

When a model samples several candidate images, some candidates match the prompt better than others. MODUS can score those candidates using modalities it already knows how to produce: grounding boxes and VQA answers. For each prompt, we sample four images, run MODUS again to ground the named objects and answer a prompt-derived question, and keep the candidate whose grounding and answer agree best with the prompt. This is a test-time selection step in the spirit of self-verification and test-time search methods such as SoTo, with no external verifier or separate reward model.

prompt“a blue vase…”

text

Text → RGB

MODUS

generate

×4

4 candidates

RGB → Grounding
RGB → VQA

MODUS

verify

agree/argmax

best of 4

output

The generate and verify passes are the same MODUS decoder with shared weights. No external verifier or reward model.

Self-verification candidates with confidence scores — For each prompt, MODUS samples several candidate generations and scores them with an auxiliary grounding or VQA pass produced by the same decoder. The most consistent candidate is kept.

Verifier	GenEval ↑
–	0.81
Object Grounding	0.82
VQA + Grounding	0.84

We apply the verifier score to select the best-of-4 output on text-to-image generation.

# MODUS self-verification
candidates ← Text2RGB(prompt, n=4)
scores     ← []
for img in candidates:
    bbox ← RGB2Grounding(img, prompt)   # same decoder
    answ ← RGB2VQA(img, prompt)         # same decoder
    scores.append(agree(bbox, answ, prompt))
return candidates[argmax(scores)]

Appendix · more verification examples Verifier confidence tracks prompt adherence ▾ click to expand

For each prompt we read the grounding logits MODUS produces when asked to localize the referred objects. These confidence values often correlate with whether the generated image contains the requested objects, and with cues such as approximate count or location. We simply keep the candidate with the highest confidence, no extra training and no external scoring model.

Additional self-verification examples with grounding confidence scores — **Text-to-image generation with self-verification.** We apply the grounding capability of MODUS to evaluate the quality of its own text-to-image outputs and select the sample with the highest verification score. This simple test-time search, using a task already supported by MODUS, leads to improved image quality and better alignment with the input prompt.

Capabilities

Visual Representation Composition

MODUS represents each 2D modality with two feature branches: a ViT branch that carries semantic information, and a VAE branch that preserves reconstruction detail. The ablation below asks what each branch contributes to dense visual prediction. ViT-only outputs often keep the scene identity but warp geometry; adding VAE features brings the prediction back to the image layout.

RGB →

↓ Click ViT only, VAE only, or ViT + VAE to preview it overlaid on the RGB input (right).

Input · RGB

Preview · ViT only · depth

RGB

ViT only

drag the handle — left side stays RGB, right side reveals the prediction

ViT only keeps the room's overall identity but warps the geometry of the dark monitor.

Features	NYUv2 Depth ↓	NYUv2 Normal ↓
ViT only	15.1	35.30
VAE only	6.9	19.96
ViT + VAE	6.5	19.92

Quantitative ablation. ViT + VAE wins on both depth and normal estimation.

A related observation appears in Ramachandran et al.: GPT-4o can describe an image fluently, but its depth and surface-normal predictions can look plausible while changing the scene geometry. We keep this comparison in the main text because it helps contextualize the ViT-only behavior above: semantic recognition alone is not enough to pin down dense geometry.

GPT-4o hallucinates geometry on depth and surface-normal prediction — GPT-4o on depth and surface normals: the scene reads correctly but the geometry is off (highlighted). Flat surfaces bulge outward, and even the chair's shape changes. *How Well Does GPT-4o Understand Vision?*

Appendix · cross-model comparison GPT-4o hallucinates the same way as ViT-only ▾ click to expand

The hallucination we see with ViT only is not unique to MODUS. Sampling the same input from GPT-4o produces structurally similar hallucinations across samples. Adding VAE features on top of ViT pins the MODUS prediction to a single consistent geometry.

ViT-only and GPT-4o both hallucinate; ViT+VAE pins it down — Each scene shows depth (top) and surface normal (bottom). **Modus ViT-only**: 5 independent samples. **GPT-4o**: 2 samples. **Modus ViT + VAE**: deterministic output.

Training

Early timesteps determine the modality

In a multi-modality decoder, every 2D target starts from the same noise distribution. At high noise, the target contains little visual structure, so training still has to identify which modality is being requested. Logit-normal sampling, which works well for unimodal text-to-image, undersamples those high-noise cases; depth requests can then collapse into normals or RGB. Uniform timestep sampling exposes MODUS to those cases more often and reduces modality confusion without sacrificing image quality.

Logit-normal vs uniform timestep sampling schematic — Logit-normal undersamples high-noise timesteps where the requested target modality must still be identifiable. Uniform sampling stabilises this.

Appendix · few-step generation Even one denoising step commits to the right modality ▾ click to expand

With uniform timestep sampling, MODUS commits to the correct target modality even at a single denoising step. Logit-normal sampling, by contrast, shows modality confusion at low step counts.

1/2/3/5/10/20/50-step generation comparison (Uniform vs Logit-Normal) — Per-scene generations at 1, 2, 3, 5, 10, 20, 50 denoising steps.

Results

Zero-shot Benchmarks

MODUS extends decoder-only models from image–text settings to diverse modalities and is evaluated zero-shot. It matches or surpasses multitask baselines on the tasks they support, while also covering tasks they cannot solve at all.

Model

MMMU ↑Accuracy (%)

GenEval ↑Score

DIODE Depth ↓AbsRel

NYUv2 Normal ↓MAE (°)

RefCOCO val ↑Accuracy (%)

IN-1k T1/T5 ↑Accuracy (%)

Enc-Dec4M-21

0.37

0.331

37.28

78.3 / 92.4

Enc-DecUnified-IO 2

—

0.369

28.55

—

DiffusionOneDiffusion

0.65

0.399

—

DecoderBAGEL^†

53.2

0.86

DecoderKosmos-2

—

52.3

DecoderJanus-Pro

41.0

0.80

DecoderGPT-4o

69.1

0.84

OursMODUS

51.1

0.81

0.285

19.92

54.5

77.9 / 92.5

Scores are reported in their original units. Filled teal = column best. = task not supported by the model. — = score not reported. ^† reproduced by us.

Dataset

MODUS-Dataset

We construct MODUS-Dataset by extending the BLIP-3o image–caption corpus with per-image pseudo-labels for surface normals, monocular depth, segmentation, and canny edges (via DepthAnything, Marigold, and Grounded-SAM), plus DINOv2 global features as a representational modality. This alignment supports modality transformations that are difficult to study with conventional datasets, such as transforming depth into canny, as well as multi-step chained generation. The full dataset will be released.

BibTeX

@article{ye2026modus,
  title   = {MODUS: Decoder-only Any-to-Any Modeling of Diverse Modalities},
  author  = {Ye, Mingqiao and An, Zhaochong and Gao, Zhitong and Liu, Xian
             and Fleuret, Fran\c{c}ois and Li, Chuan and Zadeh, Amir
             and Belongie, Serge and Dehghan, Afshin and Allardice, Jesse
             and Mizrahi, David and Kar, O\u{g}uzhan Fatih and Bachmann, Roman
             and Zamir, Amir},
  journal = {arXiv preprint},
  year    = {2026},
}