I spent the last few days building a multimodal video RAG platform. With a lot of help from Claude (vibecoding is real) and me tuning the retrieval numbers until they stopped being embarrassing.

You paste a YouTube video, the system watches it (transcribes the audio, samples visual frames, captions them with Claude), and then you can ask questions about it. Like Google for the inside of videos.

It returns timestamped answers grounded in what was actually said or shown. Not vibes. Citations.

The thing is live: multimodal-video-rag-web.vercel.app

I have about 13 videos indexed right now. Try searching. Ask it something. Break it. I want honest feedback.