【MAPs】多模态方向上的一些探索

2022年05月04日 5.6k 字大概 34 分钟

Main From GitHub

Abstract

Introduction

PDF+code
阅读笔记
带多种细粒度的图像文本匹配训练
使用交叉注意力以及跨模态编码器

PDF+code
阅读笔记

Method/model

提高商品跨模态检索的思路

理解单元

在语义对齐空间中，对齐的质量直接决定了图文之间的匹配程度。多粒度抽取可以获得图片中更丰富的语义以便搜索caption可以更好地定位这张图片（X-VLM）。
从充分利用模态信息的角度，可以加入ocr方法提取其中的信息，从而获得对应图片的文字信息，高维是一个极稀疏的空间，只要有样本和算力，不怕没有分界

技术单元

CBAM

double cross attention

special ocr model

Experiments

model 1:CLIP_zeroshot

使用Vit-B/32 zero_shot
使用VLM，原始model
model 2:VLM
多粒度cross—attention
model 3:TCL
Pre_train
fine_turn

Concluesion

References

生成方法用于增补跨模态对齐
 自监督手段用于词嵌入

下面是多模态领域值得一读的相关文献

感谢总结这些相关文献的作者：
更新连接

Reading List for Topics in Multimodal Machine Learning

By Paul Liang (pliang@cs.cmu.edu), Machine Learning Department and Language Technologies Institute, CMU, with help from members of the MultiComp Lab at LTI, CMU. If there are any areas, papers, and datasets I missed, please let me know!

Course content + workshops

New course 11-877 Advanced Topics in Multimodal Machine Learning Spring 2022 @ CMU. It will primarily be reading and discussion-based. We plan to post discussion probes, relevant papers, and summarized discussion highlights every week on the website.

【MAPs】多模态方向上的一些探索

Abstract

Introduction

Related Work

Cross-Modal Retrieval SOTA Work(20022.04):X_VLM

Zero-Shot Cross-Modal Retrieval SOTA(20022.04):TCL

Method/model

提高商品跨模态检索的思路

理解单元

技术单元

CBAM

double cross attention

special ocr model

Experiments

model 1:CLIP_zeroshot

model 2:VLM

model 3:TCL

Pre_train

fine_turn

Concluesion

References

下面是多模态领域值得一读的相关文献

Reading List for Topics in Multimodal Machine Learning

Course content + workshops

Table of Contents

Research Papers

Survey Papers

Core Areas

Multimodal Representations

Multimodal Fusion

Multimodal Alignment

Multimodal Pretraining

Multimodal Translation

Crossmodal Retrieval

Multimodal Co-learning

Missing or Imperfect Modalities

Analysis of Multimodal Models

Knowledge Graphs and Knowledge Bases

Intepretable Learning

Generative Learning

Semi-supervised Learning

Self-supervised Learning

Language Models

Adversarial Attacks

Few-Shot Learning

Bias and Fairness

Human in the Loop Learning

Architectures

Multimodal Transformers

Multimodal Memory

Applications and Datasets

Language and Visual QA

Language Grounding in Vision

Language Grouding in Navigation

Multimodal Machine Translation

Multi-agent Communication

Commonsense Reasoning

Multimodal Reinforcement Learning

Multimodal Dialog

Language and Audio

Audio and Visual

Media Description

Video Generation from Text

Affect Recognition and Multimodal Language

Healthcare

Robotics

Autonomous Driving

Finance

Human AI Interaction

Workshops

Tutorials

Courses