Vigyata.AI
Is this your channel?

Fine-tuning Multimodal Embeddings on Custom Text-Image Pairs

8.9K viewsΒ· 256 likesΒ· 27:56Β· Jan 31, 2025

πŸ›οΈ Products Mentioned (5)

🀝 Want your team maximizing Claude? I run 1:1 and team AI workshops for companies doing $1M+ per year: https://aibuilder.academy/yt/W4s6b2ZM6kI In this video, I walk through how to fine-tune CLIP on my YouTube titles and thumbnails using the Sentence Transformers Python library. Resources: πŸ“° Blog: https://medium.com/towards-data-science/fine-tuning-multimodal-embedding-models-bf007b1c5da5?source=friends_link&sk=df620c0d7aa4959d566771fe09766c41 πŸ’» GitHub Repo: https://github.com/ShawhinT/YouTube-Blog/tree/main/multimodal-ai/4-ft-mm-embeddings πŸ€— Model: https://huggingface.co/shawhin/clip-title-thumbnail-embeddings πŸ’Ώ Dataset: https://huggingface.co/datasets/shawhin/yt-title-thumbnail-pairs References: [1] https://youtu.be/YOvxh_ma5qE [2] arXiv:2103.00020 [cs.CV] [3] arXiv:1705.00652 [cs.CL] [4] https://youtu.be/hOLBrIjRAj4 Intro - 0:00 Multimodal Embeddings - 0:44 0-shot Use Cases - 2:30 Limitations of CLIP - 3:50 Fine-tuning CLIP - 5:14 Step 1: Gather training data - 6:46 Step 2: Preprocess data - 15:20 Step 3: Define evals - 17:20 Step 4: Fine-tune model - 19:22 Step 5: Evaluate model - 26:04

🎬 More from Shaw Talebi