MaskBit: Embedding-free Image Generation via Bit Tokens

Abstract

Masked transformer models for class-conditional image generation have become a compelling alternative to diffusion models. Typically comprising two stages - an initial VQGAN model for transitioning between latent space and image space, and a subsequent Transformer model for image generation within latent space - these frameworks offer promising avenues for image synthesis. In this study, we present two primary contributions: Firstly, an empirical and systematic examination of VQGANs, leading to a modernized VQGAN. Secondly, a novel embedding-free generation network operating directly on bit tokens -- a binary quantized representation of tokens with rich semantics. The first contribution furnishes a transparent, reproducible, and high-performing VQGAN model, enhancing accessibility and matching the performance of current state-of-the-art methods while revealing previously undisclosed details. The second contribution demonstrates that embedding-free image generation using bit tokens achieves a new state-of-the-art FID of 1.52 on the ImageNet 256×256 benchmark, with a compact generator model of mere 305M parameters.

MaskBit Framework Overview

High-level overview of the architecture and comparison. Our training framework comprises two stages for image generation. In Stage-I, an encoder-decoder network compresses images into a latent representation and decodes them back. Stage-II masks the tokens, feeds them into a transformer and predicts the masked tokens. Most prior art uses VQGAN-based methods (top) that learn independent embedding tables in both stages. In VQGAN-based methods, only indices of embedding tables are shared across stages, but not the embeddings. In MaskBit, however, neither Stage-I nor Stage-II utilizes embedding tables. The Stage-I predicts bit tokens by using binary quantization on the encoder output directly. The Stage-II partitions the shared bit tokens into groups (e.g., 2 groups), masks and feeds them into a transformer, and predicts the masked bit tokens.

Demystifying VQGAN training

Roadmap to build a modern VQGAN+. This overview summarizes the performance gains achieved by each proposed change to the architecture and training recipe. The reconstruction FID (rFID) is computed against the validation split of ImageNet at a resolution of 256. The popular and open-source Taming-VQGAN serves as the baseline and starting point.

Bit tokens are structured semantic representations

We visualize a robustness test involving bit flipping. Specifically, we encode images into bit tokens, where each token is represented by 12 bits in this example. We then flip the i-th bit for all the bit tokens and reconstruct the images as usual. Interestingly, the reconstructed images from these bit-flipped tokens remain semantically consistent to the original image, exhibiting only minor visual modifications such as changes in texture, exposure, smoothness, color palette, or painterly quality.

Main Experiment Results

BibTeX

@article{weber2024maskbit,
  author    = {Mark Weber and Lijun Yu and Qihang Yu and Xueqing Deng and Xiaohui Shen and Daniel Cremers and Liang-Chieh Chen},
  title     = {MaskBit: Embedding-free Image Generation via Bit Tokens},
  journal   = {arXiv:2409.16211},
  year      = {2024}
}