It sounds as though you have worked on digitized audio.
I work chiefly with the popular free Audacity software. When I zoom in on speech, I can see it consists of short bursts of noise, with silent gaps between.
Music is generally continuous waveforms.
Speech recognition software is available. I don't know what output it sends during music, however.
You'll need to be a very clever programmer, just to recognize the difference between speech and music. Your program will need to go through millions of data points. It will need to know how the data is formatted, whether 32/24/16/8 bit, 44/41/22 kHz, etc.
To read data in mp3's, aac's, etc. I'm not sure how easy it is. These are compressed audio. They might first need to be converted into wav or aiff format. (Audacity can do this job.)
Have you encountered any programs that can tell the difference between music and speech? Just to do that much could be useful to people. Yet I have not heard of any program which does it. If it is possible, I imagine it would already be available, commercially or shareware.
As for broadcasts...
You might get somewhere with high-pass and low-pass filters. Speech is in a certain frequency range. Music contains a large range frequencies.
Your morning show probably plays a jingle as it comes back from a commercial. Detection might be possible by constructing a narrow band-pass to detect those musical pitches.
In earlier days a high-pitched beep might be transmitted just before a commercial. Perhaps this is no longer done, however. The industry is motivated to make you hear their commercials, rather than to warn you one is coming.