EditMGT
Unleashing the Potential of Masked Generative Transformer in Image Editing

Anonymous

Method Overview

We present EditMGT, the first MGT-based image editing framework that leverages attention injection for parameter-free adaptation from text-to-image generation to image editing.

Figure 2: Overview of EditMGT. Our approach supervises edited image generation through original image attention injection. The right panel illustrates token-wise interactions within the multi-modal transformer block.

Architecture

Our approach introduces image conditional integration through attention injection. We define image condition tokens C_V that share parameters with image tokens C_I but maintain a fixed timestep at zero throughout the process. This design prevents drift while maintaining stable conditioning signals.

We introduce a bias term ℰ into the attention weight to control the strength of conditioning during inference, enabling seamless transformation from text-to-image to editing without additional parameters.

Inference Process

Multi-layer Attention Consolidation: We aggregate attention weights from coherent single-modality processing layers and apply adaptive filtering to enhance clarity and spatial precision.

Region-Hold Sampling: We preserve unmodified regions by explicitly flipping low-attention areas back to their original tokens, using a threshold λ to control the flipping frequency and maintain consistency with the source image.

Figure 3: Attention Mechanism in EditMGT. The text-to-image attention maps encode rich semantic correspondences. We enhance their clarity through stacking and filtering operations.

Crisp-2M Dataset

We construct Crisp-2M, a high-resolution image editing dataset spanning 7 distinct categories with 2M samples (short edge ≥ 1024 pixels). Combined with additional high-resolution samples, we utilize 4M total samples for training. This dataset addresses the scarcity of high-quality, high-resolution image editing data, which is crucial for training effective editing models.

Data Collection Pipeline

Figure 4: Overview of the Crisp-2M dataset collection pipeline. The construction encompasses 4 stages: Image Curation, Customized Instruction Generation, Specific Edit Pipeline, and Data Quality Assurance.

Stage 1: Image Curation. We curate high-quality images from three sources: LAION-Aesthetics, Unsplash Lite datasets, and JourneyDB (FLUX re-generated version). Through systematic filtering based on aesthetic scores above 4.5, resolution requirements (short-side dimensions exceeding 1024 pixels), and content suitability assessment using Qwen3, we obtain approximately 5.5M samples. We employ rigorous filtering to exclude simple patterns, monotonous compositions, and images containing watermarks or text overlays.

Stage 2: Customized Instruction Generation. To enhance data quality and diversity, we propose a systematic two-stage framework. First, we employ Qwen2.5-VL to produce detailed image captions that explicitly delineate background elements, foreground objects, and their semantic attributes. The second stage leverages GPT-4o to systematically transform these descriptive captions into actionable editing instructions across multiple modalities. We implement an iterative self-refinement mechanism that progressively enhances instruction complexity and linguistic diversity.

Stage 3: Specific Edit Pipeline. Our data collection pipeline leverages state-of-the-art models including FLUX.1 Kontext and Step1X-Edit v1.2, subsequently employing VLMs to select superior results. This approach enhances data quality while enriching dataset diversity compared to traditional single-model approaches.

Stage 4: Data Quality Assurance. We establish a comprehensive two-stage filtering framework: (1) Pre-processing instruction validation to identify semantic inconsistencies and logical inconsistencies in LLM-generated instructions; (2) Post-processing quality verification using CLIP-based alignment metrics to ensure semantic correspondence between edited images and target descriptions, and visual similarity metrics to verify preservation of non-target content.

Dataset Statistics

Add: ~300k samples
Replace: ~300k samples
Remove: ~300k samples
Color alteration: ~500k samples
Background change: ~200k samples
Style transformation: ~400k samples
Motion modification: ~34k samples

Key Features: The dataset is optimized for high-resolution editing tasks, with images predominantly concentrated within the [1280, 1665) pixel range for the longer dimension. This high-resolution nature ensures that our model can handle detailed editing tasks while maintaining image quality.

Figure 5: (a) Resolution interval distribution of Crisp-2M. (b) Pie chart of data types in the Crisp-2M dataset across seven editing categories.

Experimental Results

Quantitative Results

We evaluate EditMGT on four benchmarks: Emu Edit, MagicBrush, AnyBench, and GEdit-EN-full. Our model achieves state-of-the-art performance in image similarity while maintaining strong instruction adherence.

Key Results

CLIP Image Similarity: SOTA on all benchmarks with 1.1% improvement on MagicBrush
Style Transfer: 17.6% improvement over FluxKontext.dev
Style Change: 3.6% improvement over second-best method
Model Size: Only 960MB parameters vs. 2-8× larger baselines
Speed: 6× faster editing compared to diffusion models

Figure 6: (a) AnyBench (local part) Results on DINOv2 scores. (b) AnyBench (global part and implicit part) Results on DINOv2 scores. (c) Ablation study on the dataset scale effects.

Attention Visualization

The relationship between edited images and threshold λ shows how our region-hold sampling effectively controls the extent of editing while preserving non-target regions.

Figure 7: Visualizations of editing results, GEdit Bench semantic scores, and L1 distances from original images across varying threshold λ. The relationship demonstrates how region-hold sampling controls editing extent.

Qualitative Comparisons

Visual comparisons demonstrate EditMGT's superior instruction comprehension, object attribute understanding, and structural preservation capabilities.

Figure 8: Qualitative comparisons between EditMGT and other open-source editing models. Our model demonstrates superior instruction comprehension and object attribute understanding.

EditMGT Unleashing the Potential of Masked Generative Transformer in Image Editing