Sounds like you want a complex op amp's performance
from one of the simplest topologies. Not likely.
Without changing topology, about the only idea I have for
increasing negative input common mode range it to see if
your process offers a "native" / "zero-VT" FET. Some do,
some (esp. low cost digital-only) don't. A rail-rail input
topology would roughly double the area involved.
If you want phase margin, run it hotter. But that steals
from DC gain (decreasing Rout). Adding a second
differential gain stage can help small signal AVOL & BW
a lot but you don't want to increase area.
Have you even tried to play with simple optimization,
"stimulus, response" style, device (pair) by device (pair)
seeing what benefits and ehat costs you, these interests?