Have you ever been overwhelmed by cross-modal data, where different kinds of data are not compatible and integrated? For example, you vaguely remember a quote from a book and want to find the original text, but the PDF is a scanned version. Or you want to make a movie compilation, but you can’t remember which part of the scene in your mind belongs to the shot.

To meet the latter need, with the support of Jina and Datawhale, we have completed the tool for Video Clip Extraction by Description.


What we want to implement is a text-to-video retrieval system. It is widely known that video is made up of a finite number of frames, and there are already mature models like CLIP for matching images and text. Thus all that needs to be done is to draw frames from the video and construct the image and text pairing from the cloud service.

Yikun Han
