Skip to content

[Bug]: SDXL Models Fail After WebUI Restart Due to VAE Error Recovery Not Persisting #17126

@mlworks90

Description

@mlworks90

Checklist

  • The issue exists after disabling all extensions
  • The issue exists on a clean installation of webui
  • The issue is caused by an extension, but I believe it is caused by a bug in the webui
  • The issue exists in the current version of the webui
  • The issue has not been reported before recently
  • The issue has been reported before but has not been fixed yet

What happened?

SDXL models generate properly on fresh WebUI installation but consistently fail after any WebUI restart, producing corrupted output ("color splashes" or grey boxes). SD 1.5 models remain unaffected. The issue appears to be related to WebUI's automatic VAE error recovery mechanism not initializing properly on restart for SDXL models.

WebUI Version: v1.10.1 (Commit: 82a973c)
Hardware: NVIDIA GeForce RTX 3050 (6GB VRAM)
CUDA: 12.9
PyTorch: 2.1.2+cu121
Python: 3.10.6
OS: Windows

Steps to reproduce the problem

Reproduction Steps
Consistent Reproduction:

Fresh WebUI installation
Load any SDXL model
Generate image → Works perfectly
Close WebUI terminal completely
Restart WebUI (one or two times before the failure happens)
Load same SDXL model
Generate image → Produces corrupted output (color splashes/grey boxes)

Key Observations:

SD 1.5 models continue working normally after restart
Some specific SDXL models remain stable across restarts (model-dependent)
Loading certain SD 1.5 models can "fix" SDXL generation but contaminates output with SD 1.5 model's characteristics
Issue persists across different SDXL checkpoints

What should have happened?

SDXL models should generate consistently after WebUI restarts, with the same automatic VAE error recovery that occurs on fresh installation.

What browsers do you use to access the UI ?

Mozilla Firefox

Sysinfo

sysinfo-2025-09-14-09-57.json

Console logs

C:\webui_fresh>webui-user.bat
venv "C:\webui_fresh\venv\Scripts\Python.exe"
Python 3.10.6 (tags/v3.10.6:9c7b4bd, Aug  1 2022, 21:53:49) [MSC v.1932 64 bit (AMD64)]
Version: v1.10.1
Commit hash: 82a973c04367123ae98bd9abdf80d9eda9b910e2
Launching Web UI with arguments: --xformers
C:\webui_fresh\venv\lib\site-packages\timm\models\layers\__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)
Loading weights [3343bb16c2] from C:\webui_fresh\models\Stable-diffusion\SDXL_madejanss.safetensors
Creating model from config: C:\webui_fresh\configs\v1-inference.yaml
C:\webui_fresh\venv\lib\site-packages\huggingface_hub\file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
C:\webui_fresh\venv\lib\site-packages\huggingface_hub\file_download.py:945: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Startup time: 12.5s (prepare environment: 3.7s, import torch: 4.3s, import gradio: 1.1s, setup paths: 0.8s, initialize shared: 0.3s, other imports: 0.5s, load scripts: 0.9s, create ui: 0.4s, gradio launch: 0.5s).
Applying attention optimization: xformers... done.
Model loaded in 2.0s (create model: 1.3s, apply weights to model: 0.2s, calculate empty prompt: 0.3s).
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.46it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.50it/s]
Total progress: 100%|██████████████████████████████████████████████████████████████████| 20/20 [00:04<00:00,  4.87it/s]

Additional information

VAE Error Recovery Pattern: The key indicator of working vs failing state is the presence of these console messages:

Working State (Fresh Install):
A tensor with all NaNs was produced in VAE.
Web UI will now convert VAE into 32-bit float and retry.
To disable this behavior, disable the 'Automatically revert VAE to 32-bit floats' setting.
To always start with 32-bit VAE, use --no-half-vae commandline flag.

Failing State (After Restart):
Above VAE error messages do NOT appear
SDXL models fail to generate properly
Corruption occurs during generation process

Hypothesis
WebUI's automatic VAE error recovery mechanism that converts VAE to 32-bit floats is not initializing properly on restart for SDXL models. This protective conversion that prevents VAE corruption works on fresh install but fails to trigger on subsequent sessions.
Attempted Solutions (All Failed)

--no-half-vae flag
--medvram with various memory optimizations
NumPy downgrade to <2.0
Clearing model cache and pycache directories
PyTorch reinstallation with specific CUDA versions
Various precision flags (--no-half, --upcast-sampling, etc.)
Model configuration resets

Additional Context
Model-Specific Behavior:

Some SDXL models maintain stability across restarts
Others fail immediately after any model switching
The corruption appears when WebUI stops automatically handling VAE NaN errors

SD 1.5 Model "Fixing" Pattern:

Loading specific SD 1.5 models can restore SDXL functionality
However, SDXL output becomes contaminated with characteristics of the SD 1.5 model
This suggests model state cross-contamination during loading

Metadata

Metadata

Assignees

No one assigned

    Labels

    bug-reportReport of a bug, yet to be confirmed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions