AI LoRA Evaluation: BLIP vs Human Captioning

Evaluation of LoRA training in Kohya SS with Stable Diffusion: BLIP Captions Vs Human Captioning

Context

Academic Project

Role

Developer and Researcher

Year

2024

Industry

Artificial Intelligence

Inline Styled Line

LoRA Training Evaluation: BLIP vs Human Captioning

This project explores training Lora models within the stable diffusion framework to generate images from text descriptions. The focus is on understanding how the captions used before training impact the model's performance and ability to accurately reproduce results. This research addresses the growing interest in AI-generated imagery and its applications in fields like digital art, content creation, and education (Li et al.)

Inline Styled Line

LoRA Training

LoRA (Low-Rank Adaptation) is a technique that freezes pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture. This greatly reduces the number of trainable parameters for downstream tasks. (Hu et al.).

Inline Styled Line

Dataset

This dataset was specifically chosen for the information provided by each image, and it is predominantly used in AI training for GAN models.

(“Old Photos,” n.d.)

https://www.kaggle.com/datasets/marcinrutecki/old-photos

CC0: Public Domain

Inline Styled Line

Blip Captioning

BLIP (Bootstrapping Language-Image Pre-training) is a vision-language pre-training framework that enables a wider range of downstream tasks than existing methods. BLIP introduces a multimodal mixture of encoder-decoder architecture for effective multi-task pre-training and flexible transfer learning. It also proposes a new dataset bootstrapping method called Captioning and Filtering (CapFilt) for learning from noisy image-text pairs, where a captioner generates synthetic captions and a filter removes noisy ones (Li et al.).

https://arxiv.org/pdf/2201.12086.pdf

https://github.com/salesforce/BLIP

Inline Styled Line

Training BLIP Captioned Model

To train a LoRA model which uses BLIP captions, following steps are taken:

Collect and curate a focused dataset of images
Generate captions for the dataset automatically using BLIP
Implement LoRA training using Kohya SS in the Runpod secure server
Evaluate the trained model's performance through systematic testing with various prompts and captions
Document the entire process comprehensively (Li et al.)

‍

‍

Inline Styled Line

Output

Inline Styled Line

Human Captioning

Human captioning involves manually writing descriptive captions for the training images, as opposed to using automatically generated captions from models like BLIP. The goal is to provide high-quality, contextually rich annotations that can improve the model's understanding and generation of detailed visual content (Li et al.).

Inline Styled Line

Training Human Captioned Model

The human captioned model training pipeline is similar to the BLIP pipeline, with the key difference being the caption generation step. Instead of using BLIP, captions are written manually by human annotators.

The pipeline includes:

Collecting and curating the image dataset
Manually writing captions for each image
Implementing LoRA training using Kohya SS in the Runpod secure server
Evaluating model performance
Documenting the process (Li et al.).

Inline Styled Line

Output

Inline Styled Line

Evaluation

Here are the evaluation parameters and the evaluations performed with two different server configurations.

Inline Styled Line

Results

The BLIP captioned model used a dataset of 11 images. The captions had a low token count, and training took 2 hours. Prompt tokens were also low. The model achieved 100% performance speed and high reproducibility. Overall, the effort required for the BLIP captioning approach was low while the impact was high.

In contrast, the human captioned model also used 11 images, but the captions had a high token count. Training duration was longer at 4 hours, and prompt tokens were high. Like the BLIP model, 100% performance speed was achieved. However, reproducibility was only medium.

Comparing the two approaches, human captioning required significantly more effort in writing the captions and longer training time. Despite this high effort, the overall impact was medium to low. BLIP captioning was much more efficient, needing less effort in dataset preparation and training while delivering high impact.

"BLIP captioning provides an advantage over human captioning in terms of the effort-to-impact ratio. The automated caption generation enables the creation of rich text descriptions with minimal manual work, leading to strong performance in the trained models"

Inline Styled Line

Recomendations

This study provides valuable insights into the impact of captioning methods on LoRA training for Stable Diffusion models. However, there are several areas that warrant further research:

Bias in Human Knowledge: The quality and diversity of human-written captions may be limited by the annotators' knowledge and familiarity with certain objects, ethnicities, or styles. Future work should explore methods to mitigate potential biases, such as using more diverse annotator pools or incorporating external knowledge sources to enrich the captions.

Dataset Sample Size: Due to computational constraints, this study trained the models on a small dataset of 11 images. To assess the scalability and robustness of the findings, it is crucial to conduct experiments with larger datasets. Increasing the sample size would provide a more comprehensive understanding of how the captioning methods impact model performance and help identify potential fluctuations or limitations.

Streamlining Training Procedures: The complexity of model training on overly technical platforms necessitates meticulous planning and organization. Future studies should develop and adhere to structured protocols, such as comprehensive checklists, to ensure that all critical steps are followed without omission. This approach minimizes the risk of errors that could compromise the training outcomes and enhances the efficiency and reproducibility of the process.

Diverse Participant Inclusion: Engaging participants with a wide range of backgrounds and experience levels in AI prompting can significantly enrich the evaluation of models trained on scaled datasets. This inclusive testing approach is likely to uncover nuanced insights into the models' performance across different user demographics, thereby offering a more holistic view of their applicability and limitations.

Inline Styled Line

Conclusion

The project embarked on training an AI model in image recognition, presenting a blend of challenges and discoveries. Despite computational and time constraints, the journey revealed unexpected findings, setting the stage for future exploration.

Key takeaways include:

The process introduced me to new training methods and platforms, enriching my understanding of AI in image recognition.

Contrary to the expectations that models using human-written captions would perform better, the actual outcomes demonstrated the opposite, highlighting the value of testing in AI model training.

The project has ignited a desire to further investigate checkpoint models and larger datasets, as well as to deepen my understanding of biases within AI and machine learning for upcoming endeavors.

Inline Styled Line

Inline Styled Line

Repository

Inline Styled Line

References

Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2106.09685.

Invoke, dir. 2024. Creating Embeddings and Concept Models with Invoke Training - Textual Inversion & LoRAs. https://www.youtube.com/watch?v=OZIz2vvtlM4.

Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation.” arXiv. https://doi.org/10.48550/arXiv.2201.12086.

“Old Photos.” n.d. Accessed April 1, 2024. https://www.kaggle.com/datasets/marcinrutecki/old-photos.