Monday, May 5, 2025

Rephrase single title from this title 🚀 Fine-Tuning BLIP-2 on Flickr8k: Teaching Vision-Language Models to Describe the World . And it must return only title i dont want any extra information or introductory text with title e.g: ” Here is a single title:”

Share

Write an article about

In the era of multimodal AI, Vision-Language Models (VLMs) like BLIP-2 and Flamingo have emerged as game changers — capable of understanding and generating textual descriptions from visual inputs. This project focuses on building a fine-tuned pipeline using BLIP-2 with Flan-T5-XL to generate rich, coherent image captions using the Flickr8k dataset. The goal? To go beyond generic descriptions and teach the model to produce detailed, semantically meaningful visual narratives.

This blog documents my journey from dataset preparation to model training, generative decoding, evaluation, and error analysis — all optimized for Google Colab with GPU acceleration.

Flickr8k consists of:

8,092 images5 captions per imageSubsampled to 8,000 image-caption pairs for efficiency

Each caption provides a different perspective on the same image — offering a rich training signal for generative VLMs.

🧠 Model: BLIP-2 with Flan-T5-XL from Salesforce LAVIS🔧 Libraries: PyTorch, HuggingFace Transformers, LAVIS, torchvision, scikit-learn, evaluate💻 Platform: Google Colab (Tesla T4 / A100)📊 Metrics: BLEU, METEOR, ROUGE-L, SPICE, Distinct-n

We treated the Flickr8k caption .txt file as a CSV and structured the dataset like so:

import pandas as pdfrom sklearn.model_selection import train_test_splitdf = pd.read_csv(‘Flickr8k.token.txt’, sep=’|’, names=[‘image’, ‘caption’])df = df.sample(n=8000, random_state=42).reset_index(drop=True)# Split into train/testtrain_df, test_df = train_test_split(df, test_size=0.1, random_state=42)train_df.to_csv(‘train.csv’, index=False)test_df.to_csv(‘test.csv’, index=False)

We loaded BLIP-2 using Salesforce’s LAVIS library:

from lavis.models import load_model_and_preprocessdevice = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)model, vis_processors, txt_processors = load_model_and_preprocess(name=”blip2_t5″, model_type=”flan_t5_xl”, is_eval=False, device=device)

This configuration supports both image captioning and visual QA with generation capabilities.

We then used a DataLoader for efficient mini-batch processing.

To fine-tune efficiently on Colab:

We used AMP (Automatic Mixed Precision) via torch.cuda.ampOptimized memory using small batch sizesUsed AdamW optimizer with a low learning ratefrom transformers import AdamWfrom torch.cuda.amp import autocast, GradScaleroptimizer = AdamW(model.parameters(), lr=1e-5)scaler = GradScaler()for epoch in range(3):for batch in train_loader:image = batch[‘image’].to(device)caption = batch[‘caption’]with autocast():output = model(image=image, caption=caption, train=True)loss = output[“loss”]scaler.scale(loss).backward()scaler.step(optimizer)scaler.update()optimizer.zero_grad()

We implemented multiple decoding strategies to balance creativity and factuality:

model.generate({“image”: image_tensor}, use_nucleus_sampling=False, num_beams=5)model.generate({“image”: image_tensor}, use_nucleus_sampling=True, top_k=50, temperature=0.8)model.generate({“image”: image_tensor}, use_nucleus_sampling=True, top_p=0.9, temperature=0.7)

This flexibility enabled us to produce both safe and creative outputs.

We evaluated the model using:

BLEU-4METEORROUGE-LSPICESelf-BLEU (to measure diversity)

Example:

from evaluate import loadbleu = load(“bleu”)meteor = load(“meteor”)results = bleu.compute(predictions=preds, references=refs)print(“BLEU Score:”, results[‘bleu’])results = meteor.compute(predictions=preds, references=refs)print(“METEOR Score:”, results[‘meteor’])

We also manually reviewed 20 samples for:

HallucinationsRepetitionOmissions

We visually analyzed mismatches and common error types:

import matplotlib.pyplot as pltsample = test_df.sample(20)for _, row in sample.iterrows():image = Image.open(f”images/{row[‘image’]}”)image_tensor = vis_processors[“eval”](image).unsqueeze(0).to(device)pred = model.generate({“image”: image_tensor}, use_nucleus_sampling=True, top_p=0.9)[0]plt.imshow(image)plt.title(f”GT: {row[‘caption’]}nPred: {pred}”)plt.axis(“off”)plt.show()

Findings:

Top-p sampling produced the most natural descriptionsBeam search was more factual, but less variedSome images suffered from object misidentification under challenging lighting or backgrounds

Metric Score BLEU-4 27.5 METEOR 31.2 ROUGE-L 45.7 SPICE 20.9 CIDEr 65.1

The captions successfully mentioned ≥ 3 distinct elements in 85% of samples with low hallucination rates — meeting the assignment criteria.

Our project structure was clean and modular:

├── data_module.py # Dataset loading & preprocessing├── train.py # Fine-tuning loop├── generate.py # Caption generation├── evaluate.py # Metric computation├── utils.py # Image loading, decoding strategies├── requirements.txt

The code is reproducible with fixed seeds and includes logging for training loss.

This project was an eye-opener into the world of generative VLMs. BLIP-2 with Flan-T5-XL showcased impressive multimodal reasoning. Key takeaways:

Prompt-based tuning is memory-efficient and powerfulDecoding strategies significantly impact output qualityError analysis is essential to uncover hidden model flawsAdd LoRA for ultra-lightweight fine-tuningExplore Flamingo for long-form multimodal narrativesAdd multilingual captioning with translation pipelines

Sure! Here’s a dedicated 🔗 Useful Links section you can append to your blog post:

Let me know if you’d like help generating a visual architecture diagram or a README file for your repo.

đź“· Used BLIP-2 + Flan-T5-XL to fine-tune on Flickr8k🛠️ Built an end-to-end captioning pipeline with decoding controls📊 Achieved BLEU > 25, CIDEr > 60đź§Ş Analyzed hallucinations, repetitions, and omissionsđź§Ľ Modular, optimized for Google Colab with AMP make it easy to read for teens.Organize the content with appropriate headings and subheadings (h1, h2, h3, h4, h5, h6) and made content unique. Include conclusion section and do not include the title. it must return only article i dont want any extra information or introductory text with article e.g: ” Here is rewritten article:” or “Here is the rewritten content:”

Latest News

Related News