I am in a love-hate relation with digital audio processing,
never experienced a converter in person that is comparable to an all-analog chain.
Speacially the very highs and 3Dness. 

I suspect you've never double-blind tested this. If you have, good for you.

And I guess there is no mastering-studio running 48k in 2020, no offence intended.

This proves absolutely nothing. Audio engineers are no more immune to marketing BS than anyone else.

It seems like this is taking care of the drifting clocks with a buffer and alignment?

zita_a2j will resample the stream from hardware it uses to make it match the apparent difference in speed between the hardware it is using, and the hardware that the JACK server is using. There will be no drift and no alignment issues.

The RME as master? Does that mean the hardware-clock of the RME would define the whole DSP-chain? Somewhere I read that RME-cards can only run as slave in Linux, but maybe this is outdated?

This is false., and was never true.

There might be some other down-sides? Phase issues ect? ... I will read more ...

All the downsides come from your desire to build a digital audio system with 9 clocks in it. This is absolutely the wrong thing to do. The fact that you can use software (like zita_aj2) to hide or gloss over the fact that this is wrong doesn't stop it being wrong, and doesn't get rid of the downsides of having 9 clocks. Rule #1 for digital audio: 1 clock.