Making the device small is a step toward maximizing current
density (current vs parasitic capacitance) which is generally
a good thing. But it is not the only thing.
Once other layout parasitics are comprehended, you may find
that larger devices run at higher currents is the way to "bury"
the interconnect and driven-device loading (which are "fixed
losses" more or less). You should put this burden on your
design as early as possible, so you don't optimize yourself
into a perfect (in isolation), but uselessly weak (in application)
design result.
Getting more gain up front is key to high speed operation
at low overdrive levels. The variation of delay with overdrive
can be quite significant in low-stage-count topologies. More
differential high bandwidth stages, there.
Your cross-coupled load is good for higher DC gain (local
regenerative feedback) but may give you a stage-gain-node
impedance which, with local parasitic C, has too low a BW
to meet your goals. Your front end might want to look like
a low noise differential RF amplifier lineup, more than a
classical DC precision comparator (which the back end might
still resemble, somewhat).
Now, the only design value you have articulated is speed.
Of course everybody wants nil power and area and offset,
and infinite gain and bandwidth. But you owe it to yourself
to make more of these dimensions explicit, and see whether
they can in fact be met at once in a relevant application
setup. Point being that this is going to drive you into certain
topology and bias choices, and so should get figured out
early on.