Is music the key to language?

Another shower thought. We presumably had song before we had language, since other animals have song without fully developed language. Could pre-training a model on music tokens before even looking at text give a better understanding of context, giving a better ability to infer the important parts of a sentence?

The reason I think this is because if attention is all you need, then in humans it’s really all about rhythm and pitch and mouth sounds, text is just a proxy to that and models trained on text only have this by proxy. How much is wasted figuring this out after the fact, when human language actually comes from song and poetry first and foremost?

Could music be the key to training smaller models? I dunno. Just a thought.