An NSF is actually already basically a ROM. It just doesn't have graphics or gameplay code, and has some addresses in it to help the player play it. So every NSF has its own audio engine and its own music data.
EZNSF is not an audio engine, because the audio engine is inside the NSF. It includes a very small amount of code that uses that addresses the NSF standard requires to play it. I could create an NSF that uses 98% of the RAM available, which obviously wouldn't work in combination with a game. And it wouldn't even work with EZNSF (if the docs are right), but it would still be a valid NSF.
Part of the problem with just supporting everything from anywhere is that the music engine and the game must share space in RAM and ROM (as well as CPU time). You can include an NSF in a ROM and play it somewhat "easily" so long as it's not a bank switching NSF. The problem is, given a random NSF it's hard to find out what RAM it's using. If it's using the same RAM the game engine is using, there will be all kinds of very hard to debug glitches.
The reason for GGSound is that the RAM it uses is known, its CPU use is low, and it supports sound effects without a hack. There's nothing in the NSF standard that deals with sound effects (as far as playing simultaneously with music), even NSFs ripped from games that have sound effects aren't guaranteed to have the code that handles them inside.