Abstract

EditMGT Overview
Figure 1: Overview of EditMGT and Crisp-2M. EditMGT, the first MGT-based model, performs editing in 2s with 960M parameters, 6× faster than models of comparable performance; Crisp-2M provides 2M high-resolution (≥1024) editing samples spanning 7 distinct categories.

Recent advances in diffusion models (DMs) have achieved exceptional visual quality in image editing tasks. However, the global denoising dynamics of DMs inherently conflate local editing targets with the full-image context, leading to unintended modifications in non-target regions. In this paper, we shift our attention beyond DMs and turn to Masked Generative Transformers (MGTs) as an alternative approach to tackle this challenge. By predicting multiple masked tokens rather than holistic refinement, MGTs exhibit a localized decoding paradigm that endows them with the inherent capacity to explicitly preserve non-relevant regions during the editing process. Building upon this insight, we introduce the first MGT-based image editing framework, termed EditMGT. We first demonstrate that MGT's cross-attention maps provide informative localization signals for localizing edit-relevant regions and devise a multi-layer attention consolidation scheme that refines these maps to achieve fine-grained and precise localization. On top of these adaptive localization results, we introduce region-hold sampling, which restricts token flipping within low-attention areas to suppress spurious edits, thereby confining modifications to the intended target regions and preserving the integrity of surrounding non-target areas. To train EditMGT, we construct Crisp-2M, a high-resolution (≥1024) dataset spanning seven diverse editing categories. Without introducing additional parameters, we adapt a pre-trained text-to-image MGT into an image editing model through attention injection. Extensive experiments across four standard benchmarks demonstrate that, with fewer than 1B parameters, our model achieves state-of-the-art image similarity performance while enabling 6× faster editing. Moreover, it delivers comparable or superior editing quality, with improvements of 3.6% and 17.6% on style change and style transfer tasks, respectively.

Method Overview

We present EditMGT, the first MGT-based image editing framework that leverages attention injection for parameter-free adaptation from text-to-image generation to image editing.

Method Framework
Figure 2: Overview of EditMGT. Our approach supervises edited image generation through original image attention injection. The right panel illustrates token-wise interactions within the multi-modal transformer block.

Architecture

Our approach introduces image conditional integration through attention injection. We define image condition tokens C_V that share parameters with image tokens C_I but maintain a fixed timestep at zero throughout the process. This design prevents drift while maintaining stable conditioning signals.

We introduce a bias term ℰ into the attention weight to control the strength of conditioning during inference, enabling seamless transformation from text-to-image to editing without additional parameters.

Inference Process

Multi-layer Attention Consolidation: We aggregate attention weights from coherent single-modality processing layers and apply adaptive filtering to enhance clarity and spatial precision.

Region-Hold Sampling: We preserve unmodified regions by explicitly flipping low-attention areas back to their original tokens, using a threshold λ to control the flipping frequency and maintain consistency with the source image.

Attention Mechanism
Figure 3: Attention Mechanism in EditMGT. The text-to-image attention maps encode rich semantic correspondences. We enhance their clarity through stacking and filtering operations.

Crisp-2M Dataset

We construct Crisp-2M, a high-resolution image editing dataset spanning 7 distinct categories with 2M samples (short edge ≥ 1024 pixels). Combined with additional high-resolution samples, we utilize 4M total samples for training. This dataset addresses the scarcity of high-quality, high-resolution image editing data, which is crucial for training effective editing models.

Data Collection Pipeline

Data Pipeline
Figure 4: Overview of the Crisp-2M dataset collection pipeline. The construction encompasses 4 stages: Image Curation, Customized Instruction Generation, Specific Edit Pipeline, and Data Quality Assurance.

Stage 1: Image Curation. We curate high-quality images from three sources: LAION-Aesthetics, Unsplash Lite datasets, and JourneyDB (FLUX re-generated version). Through systematic filtering based on aesthetic scores above 4.5, resolution requirements (short-side dimensions exceeding 1024 pixels), and content suitability assessment using Qwen3, we obtain approximately 5.5M samples. We employ rigorous filtering to exclude simple patterns, monotonous compositions, and images containing watermarks or text overlays.

Stage 2: Customized Instruction Generation. To enhance data quality and diversity, we propose a systematic two-stage framework. First, we employ Qwen2.5-VL to produce detailed image captions that explicitly delineate background elements, foreground objects, and their semantic attributes. The second stage leverages GPT-4o to systematically transform these descriptive captions into actionable editing instructions across multiple modalities. We implement an iterative self-refinement mechanism that progressively enhances instruction complexity and linguistic diversity.

Stage 3: Specific Edit Pipeline. Our data collection pipeline leverages state-of-the-art models including FLUX.1 Kontext and Step1X-Edit v1.2, subsequently employing VLMs to select superior results. This approach enhances data quality while enriching dataset diversity compared to traditional single-model approaches.

Stage 4: Data Quality Assurance. We establish a comprehensive two-stage filtering framework: (1) Pre-processing instruction validation to identify semantic inconsistencies and logical inconsistencies in LLM-generated instructions; (2) Post-processing quality verification using CLIP-based alignment metrics to ensure semantic correspondence between edited images and target descriptions, and visual similarity metrics to verify preservation of non-target content.

Dataset Statistics

  • Add: ~300k samples
  • Replace: ~300k samples
  • Remove: ~300k samples
  • Color alteration: ~500k samples
  • Background change: ~200k samples
  • Style transformation: ~400k samples
  • Motion modification: ~34k samples

Key Features: The dataset is optimized for high-resolution editing tasks, with images predominantly concentrated within the [1280, 1665) pixel range for the longer dimension. This high-resolution nature ensures that our model can handle detailed editing tasks while maintaining image quality.

Dataset Statistics
Figure 5: (a) Resolution interval distribution of Crisp-2M. (b) Pie chart of data types in the Crisp-2M dataset across seven editing categories.

Experimental Results

Quantitative Results

We evaluate EditMGT on four benchmarks: Emu Edit, MagicBrush, AnyBench, and GEdit-EN-full. Our model achieves state-of-the-art performance in image similarity while maintaining strong instruction adherence.

Key Results

  • CLIP Image Similarity: SOTA on all benchmarks with 1.1% improvement on MagicBrush
  • Style Transfer: 17.6% improvement over FluxKontext.dev
  • Style Change: 3.6% improvement over second-best method
  • Model Size: Only 960MB parameters vs. 2-8× larger baselines
  • Speed: 6× faster editing compared to diffusion models
AnyBench Results
Figure 6: (a) AnyBench (local part) Results on DINOv2 scores. (b) AnyBench (global part and implicit part) Results on DINOv2 scores. (c) Ablation study on the dataset scale effects.

Attention Visualization

The relationship between edited images and threshold λ shows how our region-hold sampling effectively controls the extent of editing while preserving non-target regions.

Attention Analysis
Figure 7: Visualizations of editing results, GEdit Bench semantic scores, and L1 distances from original images across varying threshold λ. The relationship demonstrates how region-hold sampling controls editing extent.

Qualitative Comparisons

Visual comparisons demonstrate EditMGT's superior instruction comprehension, object attribute understanding, and structural preservation capabilities.

Qualitative Results
Figure 8: Qualitative comparisons between EditMGT and other open-source editing models. Our model demonstrates superior instruction comprehension and object attribute understanding.