The video in YT and the music played in YTMusic are two different uploads, you can easily get one in YT by checking the YTM URL and getting the ID. So yeah, yt-dlp should get you only the song if you created a playlist with only songs instead of music videos.
Not sure what’d you consider lightweight, I’ve been using https://github.com/jhj0517/Whisper-WebUI with fast whisper.
The GPU integration has never worked well for me, but the CPU one works wonders.
You’ll have to check if the models offer good results for those languages.