This project explores training Lora models within the stable diffusion framework to generate images from text descriptions. The focus is on understanding how the captions used before training impact the model's performance and ability to accurately reproduce results. This research addresses the growing interest in AI-generated imagery and its applications in fields like digital art, content creation, and education (Li et al.)
LoRA (Low-Rank Adaptation) is a technique that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This greatly reduces the number of trainable parameters for downstream tasks. (Hu et al.).
This dataset was specifically chosen for the information provided by each image, and it is predominantly used in AI training for GAN models.
(“Old Photos,” n.d.)
https://www.kaggle.com/datasets/marcinrutecki/old-photos
BLIP (Bootstrapping Language-Image Pre-training) is a vision-language pre-training framework that enables a wider range of downstream tasks than existing methods. BLIP introduces a multimodal mixture of encoder-decoder architecture for effective multi-task pre-training and flexible transfer learning. It also proposes a new dataset bootstrapping method called Captioning and Filtering (CapFilt) for learning from noisy image-text pairs, where a captioner generates synthetic captions and a filter removes noisy ones (Li et al.).
https://arxiv.org/pdf/2201.12086.pdf
https://github.com/salesforce/BLIP
To train a LoRA model which uses BLIP captions, following steps are taken:
Human captioning involves manually writing descriptive captions for the training images, as opposed to using automatically generated captions from models like BLIP. The goal is to provide high-quality, contextually rich annotations that can improve the model's understanding and generation of detailed visual content (Li et al.).
The human captioned model training pipeline is similar to the BLIP pipeline, with the key difference being the caption generation step. Instead of using BLIP, captions are written manually by human annotators.
The pipeline includes:
Here are the evaluation parameters and the evaluations performed with two different server configurations.
The BLIP captioned model used a dataset of 11 images. The captions had a low token count, and training took 2 hours. Prompt tokens were also low. The model achieved 100% performance speed and high reproducibility. Overall, the effort required for the BLIP captioning approach was low while the impact was high.
In contrast, the human captioned model also used 11 images, but the captions had a high token count. Training duration was longer at 4 hours, and prompt tokens were high. Like the BLIP model, 100% performance speed was achieved. However, reproducibility was only medium.
Comparing the two approaches, human captioning required significantly more effort in writing the captions and longer training time. Despite this high effort, the overall impact was medium to low. BLIP captioning was much more efficient, needing less effort in dataset preparation and training while delivering high impact.
This study provides valuable insights into the impact of captioning methods on LoRA training for Stable Diffusion models. However, there are several areas that warrant further research:
The project embarked on training an AI model in image recognition, presenting a blend of challenges and discoveries. Despite computational and time constraints, the journey revealed unexpected findings, setting the stage for future exploration.
Key takeaways include:
Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2106.09685.
Invoke, dir. 2024. Creating Embeddings and Concept Models with Invoke Training - Textual Inversion & LoRAs. https://www.youtube.com/watch?v=OZIz2vvtlM4.
Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation.” arXiv. https://doi.org/10.48550/arXiv.2201.12086.
“Old Photos.” n.d. Accessed April 1, 2024. https://www.kaggle.com/datasets/marcinrutecki/old-photos.