Although there are some threads about how to implement a pipeline, I believe this
won't be a duplicated thread. Also, english is not my native language, so if something
is not clear, please ask me.
I am coding a VLIW processor (harvard arch) with a pipeline of seven stages and variable
instruction size. Theorically I would implement a branch prediction, but I
am almost giving up...
Here is the pipeline
IF -> EXP -> ID -> SH -> EX -> MEM -> WB
EXP: this processor uses a variable instruction size like mips16 and thumb. In
this stage the instructions are expanded and next pc is calculated
SH: just a stage to perform shift and rotation operations
The other stages are similar to the ones used in MIPS etc.
I am having two major problems now: update the program counter and branch prediction. Next pc
is calculated in the EXP stage. I wonder if there's a good performance solution that would allow me
to calculate next pc in IF stage. Basically:
if (inst_size == 16 bits) {
pc = pc + 2;
} else if (inst_size == 32 bits) {
pc = pc + 4;
} ...
The problem is because I need the instruction to calculate the next pc:
pc -> addr cycle -> read cycle -> next_pc -> update pc
Cleary this is not possible to be done in one cycle and get a throughput of one instruction per cycle. My problems
would be solved if there was a fast ram with async read. But apparently Xilinx only has block ram with sync read =/
The second problem is branch prediction. I wonder if is still worth to implement such a thing on a hardware with so great
latency. Does it worth to put hardware that won't help that much with the latency problem?
The processor uses the ZNCV flags and cmp + beq, bne etc instructions. Prediction would be performed in the ID
stage and actual verification in the EX stage. But because of sync read from ram, this is what actually happening:
1 - The branch is taken
2 - The pc is updated with the jump address
3 - Needs to wait one cycle to read data be available
4 - Now update the pipeline register. Do not update pc yet, we still need to know the instruction size
5 - Calculate next pc
6 - update pc
7 - one more cycle waiting for read data to be ok
As you can see, seven cycles.. Does it still worth using branch prediction? We already have around 7 cycles of penalty...Maybe
it would be much better to not put more hardware and just use nops.
Again, such complication would not exist if there was a fast ram with async read:
1 - In the same cycle, the instruction is fetch and next pc is calculated.
I wonder how soft processors like NIOS II, Microblaze, LEON and other deal with the instruction memory.
I would be happy if I could code a processor to compete with these processors, but the only solution I can see to my problem
is to use two cycles per stage and this is a huge decrease of performance =/
There's a VLIW processor named r-vex that I read it's i_mem file. The only problem it is that is a ROM, not a RAM =/
Anyone could help me, please? Insights, coding sugestions etc