Ad insertion with streaming audio is based on the simple idea of one audio element ending and another audio element starting. The more closely that element-end and element-start are aligned the better the experience is for the listener. A great listener experience is when they can't detect the transition between elements (aside from the obvious change in content).
In brief, when the metadata that controls the break says, “commercials are starting!” that is exactly when the inserted commercials need to start. If the inserted commercial starts early, the ending element is clipped; if the inserted commercial starts late, the broadcast ad break content (i.e., the underlying content) might start playing before the inserted (replacement) ad content does, causing a jarring audio effect when the underlying ad content gets abruptly cut and the injected ads take over. Any variation from this precise timing can cause an unpleasant listener experience.
This is why metadata timing and alignment is important. It's also why streaming audio that uses injected ads does not use cross-fades; because the transition from one element to the other needs to be a clean break.
The best way to address this is easy, but it requires a bit of a re-think. It's simply this: avoid crossfades into commercial breaks. That means no jocks or show hosts talking over the start of a commercial (such as some hosts like to do when the ad starts with music) and no talking over a sweeper that is intended to be replaced as part of a break.
Exception: if your streaming ad breaks are set up to start after the sweeper (instead of before) then you can talk over the start of the sweeper. Just be sure to not talk over the end of the sweeper where it cuts to the inserted content.