This may have to do with protecting ignorant library users
from bad things such as the unmodeled / badly modeled
delay of cascaded tgates; built-in buffering fixes that.
Maybe this is not the tradeoff you'd have picked if you
were doing the design full-custom with SPICE and good
parasitics extraction. But library developers are stuck
with a full spectrum of client cluefulness and have to
make the dummies, as well as the experts, succeed.
So you get things like double-buffered nand3_1 gates
and no unbuffered version, like it or not.