If for the next-generation design you could pick one type of instruction to make twice as fast (half the latency), which instruction type would you pick? Why?
The answer to this question is not so simple, particuarly due to each type of application certain instructions are more or less frequently used, and also the latency of some of them is not fixed, but depends on its arguments. For example if your goal is to use this in a DSP core architecture, will have to focus much efforts in the multiplication and division instructions.