Building PetPaint (Part 2): Why the Cat's Face Always Looks Wrong
From SD v1.5 to Flux.1 Kontext — an engineering journey through the identity preservation problem

This is the second part of the PetPaint engineering retrospective. Part 1 covered Phase 1 and Phase 2 — from local POC to production deployment. This part covers Phase 3 — improving image quality.
Where's the Problem?
At the end of Part 1, the product was live, but the output quality was terrible. The same cat photo produced completely different results every time — sometimes it looked like a long-haired cat, sometimes the eyes were unsettlingly large, sometimes the fur twisted into strange textures. The slipper in the background kept changing too — a wine glass one time, a flower the next. The outputs didn't just fail to look like the original cat — they didn't even look like the same animal. Parameter tuning couldn't fix it.
So where's the actual problem?
To answer that, I needed to understand what was happening inside the img2img pipeline.
Part 1 briefly covered the pipeline. Here's a closer look at SD v1.5's img2img pipeline, because the cause of the distortion is hidden in this process. The original image goes through five steps from input to output:
VAE Encoder: Compresses the original image from 512×512 pixel space to 64×64 latent space, retaining key features and discarding redundancy
Add noise: Mixes noise into the latent.
strengthcontrols how much noise is addedCLIP: Converts the text prompt into vectors the model can understand
UNet denoising: Combines prompt information and removes noise step by step, predicting "how much noise is here" at each step and subtracting it
VAE Decoder: Converts the denoised latent back to pixel space, producing the final image
The root cause of the cat face distortion is in the first two steps.
First Layer of Information Loss: VAE Compression
The VAE Encoder compresses the original image from 512×512×3 (about 786,000 numbers) to 64×64×4 (about 16,000 numbers).
Spatially, the VAE Encoder performs 3 rounds of downsampling, halving the dimensions each time (512→256→128→64), so every 8 pixels maps to 1 position in the latent. The number of channels changes from RGB's 3 to the latent's 4 — these 4 channels are abstract representations learned during VAE training, and don't directly correspond to human-interpretable colors.
Take the cat's eye as an example: the eye area occupies roughly 20×20 pixels in the original image, which maps to about 2.5×2.5 positions in latent space (20÷8=2.5). The numbers describing this area drop from about 1,200 (20×20×3 channels) to about 25 (2.5×2.5×4 channels).
This doesn't mean the cat eye "disappears." If you just compress with VAE and decompress without adding noise, the cat face is still recognizable — VAE reconstruction quality isn't that bad. But precision is reduced: details that 1,200 numbers could express now have only 25 numbers to carry them. The latent is clean and organized, just lower in resolution.
Second Layer of Information Loss: Adding Noise
Noise addition is an element-wise operation on the latent — every number at every position gets mixed with a random noise value. The original signal is weakened, noise is layered on top, and the two blend together. The higher the strength, the more noise dominates, and the harder it becomes to recover the original signal.
After the first layer of compression, the latent has lower precision but the signal is clean. After noise is added, the cat eye signal — already down to just 25 numbers — gets contaminated at every position.
During denoising, the model needs to predict how much noise is at each position and subtract it. For large, simple areas like the floor, this prediction is relatively easy. But for the cat's face, where details are complex, the situation is different — precision is already low, and the signal is corrupted by noise. The model struggles to accurately reconstruct the details. It can only fill in based on what it learned during training about "what cats generally look like" — this is the root cause of cat face distortion in SD v1.5. On top of that, the random noise is different every time the model runs, so the denoising starts from a different point, producing a different cat face each time — this is why the same photo produces different results every run.
With this root cause understood, I started experimenting with different approaches to solve it.
Attempt 1: ControlNet + Canny
If SD v1.5 lacks reference information about the cat face during denoising, could we feed it some extra structural signals?
That's what ControlNet does — it uses the Canny edge detection algorithm to extract edge lines from the original image, then injects these edge signals during the UNet denoising process to constrain the output's structure. The original img2img pipeline stays the same; ControlNet adds an extra information channel alongside it.
The results showed clear improvement in large-scale structure — the cat's body pose, limb positions, and even background accuracy were much better than SD v1.5's raw output. The overall facial structure also improved, but finer details like the eyes and nose were still off. And the output still wasn't the original cat — identity preservation remained unsolved. ControlNet helped the model nail the "big outline," but "what this specific cat looks like" was beyond its reach.
Why? Two reasons:
First, ControlNet provides additional structural guidance during denoising, but the degradation of cat face information happens before denoising starts — VAE compression reduces precision, and noise corrupts the signal. By the time UNet begins denoising, the cat face detail signal is already degraded. ControlNet can't reverse the information loss from those two steps; it can only provide extra structural reference on top of the already-degraded signal.
Second, Canny edge signals in the cat face area are more complex and noisier than body outlines. The cat face has fur, whiskers, and eyes — dense edge lines everywhere. The structural constraints ControlNet extracts from this aren't clean enough to meaningfully guide fine facial details.
Conclusion: ControlNet didn't solve the root problem — it provided additional structural constraints during denoising and improved large-scale accuracy, but the information loss from VAE compression and noise addition remained.
Attempt 2: SDXL + LoRA
ControlNet improved large-scale accuracy, but identity preservation was still unsolved. Going back to the root cause analysis: the first layer of information loss is VAE compression reducing precision — what if the latent resolution were higher?
SDXL's pipeline structure is essentially the same as SD v1.5; the key difference is higher native resolution. SD v1.5's native resolution is 512×512, with a 64×64 latent. SDXL bumps this to 1024×1024 native resolution, with a 128×128 latent. The spatial compression ratio stays the same (still 8x), but because the input image is twice as large, the same cat eye area goes from about 25 numbers (2.5×2.5×4 channels) to about 100 numbers (5×5×4 channels) in the latent — a 4x precision increase.
I also added an oil painting style LoRA (ClassiPeintXL) to improve the painting effect at the same time.
Result: the cat's shape and proportions were more natural than SD v1.5, but it still didn't look like the original cat — the output was "a cat," not "my cat." And I picked the wrong LoRA; ClassiPeintXL leaned more watercolor than oil painting, so the style wasn't what I wanted either.
There were also a few deployment issues along the way (missing PEFT library, LoRA local path loading, Dockerfile missing COPY for safetensors file), but these were engineering problems that don't affect the image quality analysis.
Conclusion: SDXL increased latent resolution, and the first layer of information loss was indeed reduced — the same area was described by more numbers. But the second layer remained: noise still corrupted these signals, and the model still had to guess the cat face details during denoising. LoRA only changes style; it doesn't solve identity preservation.
Attempt 3: IP-Adapter
After SDXL + LoRA, identity was still the problem. ControlNet's approach was to inject Canny edge lines during denoising — spatial structural information, telling the model "where the edges are." IP-Adapter took a different angle: instead of spatial structure, use semantic information.
The approach was to use a CLIP image encoder to encode the original image into a set of semantic feature vectors, then inject them into the UNet via cross-attention. These semantic features capture high-level information about the image — something like "an orange cat, lying on the floor" — telling the model "what's in the image" rather than "where the edges are." The two types of information serve different purposes, and in theory semantic information would be more targeted for identity preservation.
Running it kept hitting OOM. SDXL's base model plus IP-Adapter's additional CLIP image encoder and adapter weights exceeded the T4 small's 16GB VRAM. I tried enable_model_cpu_offload(), reducing image size, and variant="fp16" — none were enough. I ended up upgrading to an L4 (24GB VRAM) to get it running.
It ran. It produced images. But identity preservation showed little improvement — the cat still didn't look like the original.
Why wasn't semantic information enough? Two levels:
First, CLIP's training objective is to match images with text descriptions. The representations it learns are better at distinguishing between different categories (cat vs. dog, orange tabby vs. black cat) than at distinguishing between different individuals within the same category — like two different orange tabbies. IP-Adapter's semantic guidance can tell the model "this is an orange tabby," but the information about "what this specific orange tabby looks like" is insufficient, so its help with identity preservation is limited.
Second, like ControlNet, IP-Adapter intervenes at the denoising stage. The information degradation from VAE compression and noise addition has already happened. IP-Adapter's semantic guidance can improve the direction, but it can't make up for details that are already lost.
Conclusion: Whether it's edge lines (ControlNet) or semantic vectors (IP-Adapter), as long as information is being supplemented at the denoising stage, it can't fully compensate for the information loss from the first two steps.
At this point I realized: the core problem isn't in the denoising stage — it's in the noise addition step itself. As long as the original image is first noised then denoised, style transfer and identity preservation will always be in tension — high strength gives you oil painting style but distorts the cat, low strength preserves the cat but loses the painting effect.
The Turning Point: Flux.1 Kontext
Flux.1 Kontext uses a fundamentally different architecture — it has no "add noise to the original image" step.
The traditional img2img pipeline (SD v1.5, SDXL):
Two layers of information loss stacked, identity preservation fails.
Flux.1 Kontext's pipeline:
The key difference: the original image is not noised. Here's how it works step by step:
Step 1: The original image is compressed into a latent by VAE (same as SD v1.5), then the latent is split into a set of tokens. These are the input tokens — they represent the original image's information, and they're clean, not corrupted by noise.
Step 2: A separate set of output tokens is generated, initialized with random noise — this is the canvas the model will "paint" on.
Step 3: The input tokens and output tokens are concatenated into one sequence and fed into a Transformer. The Transformer processes this sequence with self-attention — every token can "see" every other token in the sequence.
This means: at every step of generation, the output tokens can directly reference the original image information in the input tokens (the cat's pose, eyes, fur color), while also referencing the text information from the prompt ("oil painting, thick brushstrokes").
Where does identity come from? From the input tokens — they stay clean throughout the entire process, and the output tokens can read the original image's details through attention at any time.
Where does the style come from? From the prompt guidance and the output tokens' own denoising process — the output tokens start from random noise, and the model learned during training to generate specific styles based on the prompt's direction.
In SD v1.5, identity and style are in conflict — because the original image is noised, wanting more style means adding more noise, and more noise means losing more identity. In Flux, these two come from different paths — identity is preserved through the input tokens, style is generated through the output tokens' denoising process, without interfering with each other.
VAE compression (the first layer of information loss) still exists, but eliminating the second layer of noise corruption alone was enough to make identity preservation work.
Quantized Deployment
Flux.1 Kontext's Transformer has 12B (12 billion) parameters. Each parameter stored in bfloat16 takes 2 bytes, so the Transformer weights alone require about 24GB of VRAM — and the L4 GPU only has 24GB total, with VAE, the T5 text encoder, and runtime activations still needing space. Loading it at full precision won't fit in memory.
The solution is quantization: compressing the Transformer weights from bfloat16 (16 bit) to 4-bit, so each parameter takes only 0.5 bytes. Weight size drops from about 24GB to about 6GB. The remaining space is more than enough for the other components.
The quantization scheme is NF4. With 4 bits, only 16 distinct values can be represented. NF4 distributes these 16 values according to a normal distribution — denser near zero, sparser at the extremes. Since neural network weights mostly cluster around zero, this distribution lets most weights find a close match, minimizing precision loss.
Only the Transformer is quantized — because it accounts for the vast majority of parameters. The VAE and T5 text encoder are much smaller and are loaded normally in bfloat16.
# Define quantization config
nf4_config = BitsAndBytesConfig(
load_in_4bit=True, # Store weights in 4-bit
bnb_4bit_quant_type="nf4", # Use NF4 quantization scheme
bnb_4bit_compute_dtype=torch.bfloat16, # Convert back to bfloat16 for computation
)
# Load Transformer separately with quantization
transformer = FluxTransformer2DModel.from_pretrained(
"black-forest-labs/FLUX.1-Kontext-dev",
subfolder="transformer",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16,
)
# Load full pipeline, plug in the quantized Transformer
pipe = FluxKontextPipeline.from_pretrained(
"black-forest-labs/FLUX.1-Kontext-dev",
transformer=transformer, # Use the quantized Transformer
torch_dtype=torch.bfloat16, # Other components in bfloat16
)
pipe.enable_model_cpu_offload() # Park unused components on CPU to save GPU memory
The result — the cat finally looked like the original cat.
Looking Back at the Journey
Each attempt made it clearer where the problem lay. The final conclusion: as long as the original image is first noised then denoised, style transfer and identity preservation will always be in tension. Flux eliminated the noise addition step for the original image, and that's why it worked.
What PetPaint Looks Like Now
Try the product: pet-paint.vercel.app
The backend runs on a paid GPU and isn't always online. If you'd like to try it, reach out to me on X (@czhoudev) and I'll turn it on for you.
The test photos throughout this series were of my other cat. But the reason I started this project was Mijiang. Now, I could finally paint his portrait too.
I miss you, Mijiang.



