Filter-Guided Diffusion for Controllable Image Generation

Zeqi Gu*^1,2, Ethan Yang*², Abe Davis²,

¹Cornell Tech, ²Cornell University

Paper Code

FGD connects diffusion models to classical filtering techniques to translate an image based on a text prompt.

Abstract

Recent advances in diffusion-based generative models have shown incredible promise for zero shot image-to-image translation and editing. Most of these approaches work by combining or replacing network-specific features used in the generation of new images with those taken from the inversion of some guide image. Methods of this type are considered the current state-of-the-art in training- free approaches, but have some notable limitations: they tend to be costly in runtime and memory, and often depend on deterministic sampling that limits variation in generated results. We propose Filter-Guided Diffusion (FGD), an alternative approach that lever- ages fast filtering operations during the diffusion process to support finer control over the strength and frequencies of guidance and can work with non-deterministic samplers to produce greater variety. With its efficiency, FGD can be sampled over multiple seeds and hyperparameters in less time than a single run of other SOTA meth- ods to produce superior results based on structural and semantic metrics. We conduct extensive quantitative and qualitative experi- ments to evaluate the performance of FGD in translation tasks and also demonstrate its potential in localized editing when used with masks.

Paintings

"a painting of a cat in a red hat"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.4, normalization: off

"a portrait of a dog"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: off

"a painting of a dog in a wig"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.0, normalization: off

"a painting of a bear playing the electric guitar"

σ_spatial=2, σ_value=1, t_end=15, δ=1.2, normalization: off

"a watercolor of cats at the beach"

σ_spatial=2, σ_value=1, t_end=15, δ=1.2, normalization: off

"a painting of a cat pouring milk"

σ_spatial=2, σ_value=1, t_end=15, δ=1.2, normalization: off

"a painting of a dog in a blue headband"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.0, normalization: off

"a painting of a cat in a black dress"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.0, normalization: off

Pareidolia

"a photo of a cow"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: on

"a photo of a bowl of fruit"

σ_spatial=3, σ_value=1, t_end=15, δ=1.9, normalization: on

"a photo of an elephant"

σ_spatial=2, σ_value=1, t_end=15, δ=1.6, normalization: on

"a photo of a turtle"

σ_spatial=2, σ_value=1, t_end=15, δ=1.9, normalization: on

"a photo of a boat"

σ_spatial=3, σ_value=1, t_end=15, δ=1.4, normalization: on

"a photo of a duck"

σ_spatial=2, σ_value=1, t_end=15, δ=1.6, normalization: on

Food

"a photo of a pizza"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.6, normalization: on

"a photo of a cake"

σ_spatial=2, σ_value=1, t_end=15, δ=1.9, normalization: on

"a photo of steak"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.6, normalization: on

"a photo of steak"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.6, normalization: on

"a photo of meatballs"

σ_spatial=2, σ_value=1, t_end=15, δ=1.6, normalization: on

"a photo of apartment buildings"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: on

Landscapes

"a photo of a desert"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: on

"a photo of a mountain"

σ_spatial=2, σ_value=1, t_end=15, δ=1.6, normalization: on

"a photo of a sea"

σ_spatial=2, σ_value=1, t_end=15, δ=1.6, normalization: on

"a photo of a mountain"

σ_spatial=2, σ_value=1, t_end=15, δ=1.9, normalization: on

"a photo of an island"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: on

"a photo of a desert"

σ_spatial=2, σ_value=1, t_end=15, δ=1.9, normalization: on

Other Images

"a photo of a bear"

σ_spatial=3, σ_value=1, t_end=15, δ=1.4, normalization: on

"a photo of a modern bedroom"

σ_spatial=2, σ_value=1, t_end=15, δ=1.6, normalization: on

"a statue of a golden fish"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: on

"a photo of a dog in the snow"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.6, normalization: on

Video Gallery

We show a few videos where we sweep the filter strength δ from 0 to 1.6

"a painting of a dog in a red hat"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.4, normalization: off

"a painting of a cat in a red hat"

σ_spatial=3, σ_value=0.3, t_end=15, δ=1.4, normalization: off

"a photo of a dog"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: off

"a photo of a rabbit"

σ_spatial=2, σ_value=1, t_end=15, δ=1.4, normalization: off