You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on May 22, 2026. It is now read-only.
I am working on migrating this great work to SDXL, starting with AFA. However, I found neither direct cascade nor AFA work.
I am using the Clip-ViT-big-G (project to 1280) and Clip-ViT-Large (project to 768) for image embedding and concatenate them as 768+1280=text embedding size. That would be considered as a token. And I parse them to the attention processors. Inside the attention, the image embedding is repeated to 77 to match text features.
However, even when using the direct concate method, I still find it no working. Any suggestion?
Hi!
I am working on migrating this great work to SDXL, starting with AFA. However, I found neither direct cascade nor AFA work.
I am using the Clip-ViT-big-G (project to 1280) and Clip-ViT-Large (project to 768) for image embedding and concatenate them as 768+1280=text embedding size. That would be considered as a token. And I parse them to the attention processors. Inside the attention, the image embedding is repeated to 77 to match text features.
However, even when using the direct concate method, I still find it no working. Any suggestion?