Welcome to the Multimodal-verse: A Beginner's Guide

Welcome to the Multimodal-verse: A Beginner's Guide

Hey there, weary traveler! Feeling overwhelmed by the AI revolution? Everywhere you look, it's AI this, AI that. And now you're hearing whispers about "multimodal something something." Don't sweat it, my friend. I've got your back!

Let's dive into the fascinating world of multimodality together. This blog post series is designed to be your friendly guide through the multimodal landscape. We'll keep things simple and beginner-friendly (but you'll need at least some AI 101 under your belt).

Get ready to expand your human brain with some cool concepts and techniques!

Here's what we'll cover in this series:

  1. Intro to the Multimodal

    • Plato's Cave: A philosophical warm-up

    • Unimodality and its limitations: Why one just isn't enough

    • Unique information in different modalities: Spice up your AI life

  2. From Satellite to Earth: A Case Study on Why We Should Go Multimodal

    • Satellite bands: More than meets the eye

    • Night vision with thermal imagery

  3. This is Cool, But Why So Complex?

    • Which task are you trying to solve?

    • The messy world of fusion

    • Timing is everything: When to fuse?

    • Fusion techniques: From simple to sophisticated

  4. You've Got My Attention, Where Do I Start?

    • Welcome to Flax: Your new best friend

    • Data loaders, modeling, and training loops

    • Evaluation, tracking, and model management

  5. Colab 🧡 GitHub Codespaces

Throughout this journey, we'll tackle some common questions and challenges:

  • Why is classification so picky about representations?

  • What's the deal with modality translation and high-quality features?

  • The great encoder debate: To leak or not to leak?

  • Fusion timing: Early, mid, or late? (Spoiler: It depends!)

  • Embedding shenanigans: Size matters, and so does normalization

  • Fusion techniques: From "concatenate and chill" to "attention fusion and thrill"

So, buckle up, prepare a whole berrad of Moroccan tea, and sip on the knowledge I am about to drop on you.

a photo by @kaoutharelouraoui

Acknowledgments

Google AI/ML Developer Programs team supported this work by providing Google Cloud Credit.

References

I will try to use the same numbers for citations for the rest of the blogs.

Resources

Papers and Theses

  1. Le-Khac, P. H., Healy, G., & Smeaton, A. F. (2020). Contrastive representation learning: A framework and review. IEEE Access, 8, 193907–193934.

    https://doi.org/10.1109/ACCESS.2020.3031549

  2. Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q., Sung, Y., Li, Z., & Duerig, T. (2021). Scaling up Visual and Vision-Language representation learning with noisy text supervision. International Conference on Machine Learning, 4904–4916. http://proceedings.mlr.press/v139/jia21b/jia21b.pdf

  3. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. arXiv. https://arxiv.org/abs/2103.00020

  4. Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023, October). Sigmoid loss for language image pre-training. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 11941-11952). IEEE. https://doi.org/10.1109/ICCV51070.2023.01100

  5. Li, S., Zhang, L., Wang, Z., Wu, D., Wu, L., Liu, Z., Xia, J., Tan, C., Liu, Y., Sun, B., & Stan Z. Li. (n.d.). Masked modeling for self-supervised representation learning on vision and beyond. In IEEE [Journal-article]. https://arxiv.org/pdf/2401.00897

  6. Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham, H., Le, Q., V., Sung, Y., Li, Z., & Duerig, T. (2021, February 11). Scaling up Visual and Vision-Language representation learning with noisy text supervision. arXiv.org. https://arxiv.org/abs/2102.05918

  7. Bachmann, R., Kar, O. F., Mizrahi, D., Garjani, A., Gao, M., Griffiths, D., Hu, J., Dehghan, A., & Zamir, A. (2024, June 13). 4M-21: An Any-to-Any Vision model for tens of tasks and modalities. arXiv.org. https://arxiv.org/abs/2406.09406

  8. Bao, H., Dong, L., Piao, S., & Wei, F. (2021, June 15). BEIT: BERT Pre-Training of Image Transformers. arXiv.org. https://arxiv.org/abs/2106.08254

  9. Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., Bordes, F., Bardes, A., Mialon, G., Tian, Y., Schwarzschild, A., Wilson, A. G., Geiping, J., Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., LeCun, Y., & Goldblum, M. (2023, April 24). A cookbook of Self-Supervised Learning. arXiv.org. https://arxiv.org/abs/2304.12210

  10. Zadeh, A., Chen, M., Poria, S., Cambria, E., & Morency, L. (2017, July 23). Tensor Fusion Network for Multimodal Sentiment Analysis. arXiv.org. https://arxiv.org/abs/1707.07250

  11. Tian, Y., Sun, C., Poole, B., Krishnan, D., Schmid, C., & Isola, P. (2020, May 20). What makes for good views for contrastive learning? arXiv.org. https://arxiv.org/abs/2005.10243

  12. Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., & Huang, L. (2021, June 8). What Makes Multi-modal Learning Better than Single (Provably). arXiv.org. https://arxiv.org/abs/2106.04538

  13. Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., & Sun, C. (2021, June 30). Attention bottlenecks for multimodal fusion. arXiv.org. https://arxiv.org/abs/2107.00135

  14. Liu, Z., Shen, Y., Lakshminarasimhan, V. B., Liang, P. P., Zadeh, A., & Morency, L. (2018, May 31). Efficient Low-rank Multimodal Fusion with Modality-Specific Factors. arXiv.org. https://arxiv.org/abs/1806.00064

  15. Wang, X., Chen, G., Qian, G., Gao, P., Wei, X., Wang, Y., Tian, Y., & Gao, W. (2023, February 20). Large-scale Multi-Modal Pre-trained Models: A comprehensive survey. arXiv.org. https://arxiv.org/abs/2302.10035

  16. Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q., Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S., & Wei, F. (2022, August 22). Image as a Foreign Language: BEIT Pretraining for all Vision and Vision-Language tasks. arXiv.org. https://arxiv.org/abs/2208.10442

  17. Liang, P. P. (2024, April 29). Foundations of multisensory artificial Intelligence. arXiv.org. https://arxiv.org/abs/2404.18976

  18. Huang, S., Pareek, A., Seyyedi, S., Banerjee, I., & Lungren, M. P. (2020). Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. Npj Digital Medicine, 3(1). https://doi.org/10.1038/s41746-020-00341-z