Filter-Guided Diffusion for Controllable Image Generation

1Cornell Tech, 2Cornell University

FGD connects diffusion models to classical filtering techniques to translate an image based on a text prompt.

Abstract

Recent advances in diffusion-based generative models have shown incredible promise for zero shot image-to-image translation and editing. Most of these approaches work by combining or replacing network-specific features used in the generation of new images with those taken from the inversion of some guide image. Methods of this type are considered the current state-of-the-art in training- free approaches, but have some notable limitations: they tend to be costly in runtime and memory, and often depend on deterministic sampling that limits variation in generated results. We propose Filter-Guided Diffusion (FGD), an alternative approach that lever- ages fast filtering operations during the diffusion process to support finer control over the strength and frequencies of guidance and can work with non-deterministic samplers to produce greater variety. With its efficiency, FGD can be sampled over multiple seeds and hyperparameters in less time than a single run of other SOTA meth- ods to produce superior results based on structural and semantic metrics. We conduct extensive quantitative and qualitative experi- ments to evaluate the performance of FGD in translation tasks and also demonstrate its potential in localized editing when used with masks.


Paintings

"a painting of a cat in a red hat"

σspatial=3, σvalue=0.3, tend=15, δ=1.4, normalization: off

"a portrait of a dog"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: off

"a painting of a dog in a wig"

σspatial=3, σvalue=0.3, tend=15, δ=1.0, normalization: off

"a painting of a bear playing the electric guitar"

σspatial=2, σvalue=1, tend=15, δ=1.2, normalization: off



"a watercolor of cats at the beach"

σspatial=2, σvalue=1, tend=15, δ=1.2, normalization: off

"a painting of a cat pouring milk"

σspatial=2, σvalue=1, tend=15, δ=1.2, normalization: off

"a painting of a dog in a blue headband"

σspatial=3, σvalue=0.3, tend=15, δ=1.0, normalization: off

"a painting of a cat in a black dress"

σspatial=3, σvalue=0.3, tend=15, δ=1.0, normalization: off

Pareidolia

"a photo of a cow"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: on

"a photo of a bowl of fruit"

σspatial=3, σvalue=1, tend=15, δ=1.9, normalization: on

"a photo of an elephant"

σspatial=2, σvalue=1, tend=15, δ=1.6, normalization: on

"a photo of a turtle"

σspatial=2, σvalue=1, tend=15, δ=1.9, normalization: on

"a photo of a boat"

σspatial=3, σvalue=1, tend=15, δ=1.4, normalization: on

"a photo of a duck"

σspatial=2, σvalue=1, tend=15, δ=1.6, normalization: on

Food

"a photo of a pizza"

σspatial=3, σvalue=0.3, tend=15, δ=1.6, normalization: on

"a photo of a cake"

σspatial=2, σvalue=1, tend=15, δ=1.9, normalization: on

"a photo of steak"

σspatial=3, σvalue=0.3, tend=15, δ=1.6, normalization: on

"a photo of steak"

σspatial=3, σvalue=0.3, tend=15, δ=1.6, normalization: on

"a photo of meatballs"

σspatial=2, σvalue=1, tend=15, δ=1.6, normalization: on

"a photo of apartment buildings"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: on

Landscapes

"a photo of a desert"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: on

"a photo of a mountain"

σspatial=2, σvalue=1, tend=15, δ=1.6, normalization: on

"a photo of a sea"

σspatial=2, σvalue=1, tend=15, δ=1.6, normalization: on

"a photo of a mountain"

σspatial=2, σvalue=1, tend=15, δ=1.9, normalization: on

"a photo of an island"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: on

"a photo of a desert"

σspatial=2, σvalue=1, tend=15, δ=1.9, normalization: on

Other Images

"a photo of a bear"

σspatial=3, σvalue=1, tend=15, δ=1.4, normalization: on

"a photo of a modern bedroom"

σspatial=2, σvalue=1, tend=15, δ=1.6, normalization: on

"a statue of a golden fish"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: on

"a photo of a dog in the snow"

σspatial=3, σvalue=0.3, tend=15, δ=1.6, normalization: on


Video Gallery

We show a few videos where we sweep the filter strength δ from 0 to 1.6

"a painting of a dog in a red hat"

σspatial=3, σvalue=0.3, tend=15, δ=1.4, normalization: off

"a painting of a cat in a red hat"

σspatial=3, σvalue=0.3, tend=15, δ=1.4, normalization: off

"a photo of a dog"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: off

"a photo of a rabbit"

σspatial=2, σvalue=1, tend=15, δ=1.4, normalization: off