Add Stable Diffusion 1.5 support with data-parallel inference by csgoogle · Pull Request #418 · AI-Hypercomputer/maxdiffusion

csgoogle · 2026-06-13T14:10:36Z

Summary

Adds Stable Diffusion 1.5 to MaxDiffusion and makes its inference data parallel.

configs/base15.yml (new) — SD 1.5 config. Mirrors base14.yml (same architecture, different weights), points at the stable-diffusion-v1-5 checkpoint, sets from_pt: True (upstream ships PyTorch weights only), and defaults to the checkpoint's PNDM / epsilon scheduler.
generate.py
- Build the sampler from the checkpoint's scheduler config via create_scheduler (config-driven, mirrors the SDXL path) instead of a hardcoded FlaxDDIMScheduler, and iterate the full PNDM schedule (skip_prk_steps emits one extra timestep).
- Shard the latent batch over the data axis using with_sharding_constraint + a batch-sharded out_shardings, so GSPMD propagates data-parallelism through the whole UNet/VAE instead of replicating the entire batch on every device. A single get_batch_sharding helper is the source of truth and replicates for sub-device batches (per_device_batch_size < 1).
- override_scheduler_config is now tolerant of scheduler configs that omit keys (e.g. SD 1.5's older PNDM config).

Why

The previous generate path declared a data mesh axis but the program ran fully replicated (device_put inside the jit is a weak hint GSPMD ignored), so each chip recomputed the whole batch. Forcing the batch sharding makes inference genuinely data parallel.

Performance

TPU7x (8 chips), SD 1.5 1024px, 20-step PNDM, 8 images: 0.130s . Scales cleanly to larger batches.

Test plan

Generates coherent images matching the prompts.

Prompt: "A cinematic photo of a glass greenhouse on a snowy mountain"

github-actions · 2026-06-13T14:10:48Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

github-actions · 2026-06-13T14:35:11Z

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-06-13T14:40:09Z

🤖 I'm sorry @Perseus14, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-06-13T14:41:21Z

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

This PR successfully adds support for Stable Diffusion 1.5 to MaxDiffusion and implements genuine data-parallel inference by sharding latents and context along the data mesh axis. The changes are elegant, well-structured, and provide significant performance improvements by avoiding redundant recomputation of batches across devices.

🔍 General Feedback

Excellent Alignment with Checkpoint Semantics: Iterating over the actual timesteps shape in the loop instead of hardcoded steps ensures perfect alignment with schedulers like PNDM which emit an extra timestep when skip_prk_steps is enabled.
Robust Parallelism: Forcing genuine batch-sharding constraints is a clean and robust approach to propagation of data parallelism across UNet and VAE layers.
Configuration-driven Design: Transitioning scheduler instantiation to use config-driven create_scheduler mirrors the SDXL pipeline beautifully and increases code reuse.
Robustness Improvement: An issue was identified in the config-override fallback mechanism where explicit falsy overrides (such as False) were ignored, and checkpoint defaults were bypassed. An inline code suggestion has been provided to resolve this.

Add base15.yml for SD 1.5 (PyTorch weights via from_pt, PNDM/epsilon scheduler) and wire generate.py to it: - Build the sampler from the checkpoint's scheduler config via create_scheduler instead of a hardcoded DDIM scheduler, and iterate the full PNDM schedule (skip_prk_steps emits one extra timestep). - Shard the latent batch over the data axis with sharding constraints plus out_shardings so inference runs data parallel instead of replicating the whole batch on every device. Sub-device batches replicate. - Make override_scheduler_config tolerant of scheduler configs that omit keys (e.g. SD 1.5's PNDM config).

github-actions · 2026-06-13T16:31:04Z

🤖 Hi @csgoogle, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-06-13T16:33:19Z

🤖 I'm sorry @csgoogle, but I was unable to process your request. Please see the logs for more details.

github-actions · 2026-06-14T17:31:34Z

🤖 Hi @Perseus14, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-06-14T17:34:12Z

🤖 I'm sorry @Perseus14, but I was unable to process your request. Please see the logs for more details.

csgoogle requested a review from entrpn as a code owner June 13, 2026 14:10

Perseus14 added the gemini-review label Jun 13, 2026

Perseus14 added gemini-review and removed gemini-review labels Jun 13, 2026

github-actions Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread src/maxdiffusion/maxdiffusion_utils.py Outdated

csgoogle force-pushed the add-sd15-support branch 2 times, most recently from 5de6089 to 308e742 Compare June 13, 2026 16:24

csgoogle force-pushed the add-sd15-support branch from 308e742 to 164ca87 Compare June 13, 2026 16:28

csgoogle added gemini-review and removed gemini-review labels Jun 13, 2026

csgoogle self-assigned this Jun 13, 2026

Perseus14 added gemini-review and removed gemini-review labels Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Stable Diffusion 1.5 support with data-parallel inference#418

Add Stable Diffusion 1.5 support with data-parallel inference#418
csgoogle wants to merge 1 commit into
mainfrom
add-sd15-support

csgoogle commented Jun 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

csgoogle commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Performance

Test plan

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

csgoogle commented Jun 13, 2026 •

edited

Loading