Thanks for your reply.
I did not know about c-slowing. I did a bit of quick reading up on it, and it does indeed look like I can use it on this project.
**broken link removed** gave me some idea on what c-slowing does.
If you know any more reading material on this that would be much appreciated!
One particular module that looks like a good candidate is this one:
Code:
module bubble8_to_bin3_ce (
input clk,
input ce,
input [7:0] bubble_count_in,
output reg [2:0] bin_count_out
);
initial begin
bin_count_out = 0;
end
always @(posedge clk) begin
if (ce) begin
casez(bubble_count_in)
// Decode RIGHT running counter
8'b1?000000: bin_count_out <= 3'b000;
8'b01?00000: bin_count_out <= 3'b001;
8'b001?0000: bin_count_out <= 3'b010;
8'b0001?000: bin_count_out <= 3'b011;
8'b00001?00: bin_count_out <= 3'b100;
8'b000001?0: bin_count_out <= 3'b101;
8'b0000001?: bin_count_out <= 3'b110;
8'b?0000001: bin_count_out <= 3'b111;
default : bin_count_out <= 3'b000;
endcase // bubble_count_in
end
end
endmodule // bubble8_to_bin3_ce
For the technology schematic view of this module, see the "bubble8_to_bin3_ce-techview.pdf " attachment.
Looks to me like it would benefit from stuffing extra registers between the 1st level of LUTs and the 2nd level of LUTs. I did notice that before, and tried to add an extra pipeline stage by hand. Buuuut, that resulted in unreadable HDL real fast, so I deciced against that. Ideally I would like to tell the tool (xilinx ise) that after 2 cycles I would like to end up with that function, and it is free to stuff 1 extra stage of registers wherever it deems necessary.
So using the c-slowing + retiming approach I would need to add one extra stage of registers, tell the tool "Yo, retiming opportunity here. Gogogo!", wait a long time, et voila. Right?
As you may have guessed, the above module uses a clock enable "ce", and I have 2 of those running in parallel at the even/odd clock cycles. That does work, but if I can make it properly place the extra flip-flops between the LUT stages that would be even better.
Thinking a bit more about this, strictly speaking this is probably not c-slowing, but just retiming.
As you say, the retiming optimization is very slow. Are there any specific methods of writing verilog, constraints, attributes, etc that will help the tool so that this retiming optimization doesn't any longer than absolutely necessary?
Now for the part with the clock enables... As you say I can manually pipeline the design to get the desired increase in clock frequency. For the time critical path I have already done that and broken it down to a lot of smaller stages with typically 1 logic level.
As it happens the first stage in the pipeline is 1 logic level, and at full clock speed this still doesn't meet timing. I went over the timing report, and typically it is 60% or more routing. For example I have one that is 35% logic, 65% routing. When I noticed that, the first thing I did was to add another stage of flip-flops between that, to give it some more freedom to span the gap between the slices so to speak. But even doing that was not all that great. As far as I understand it, adding the extra flip-flop does give it extra slack.
But not as much slack as it would gain from having the freedom to take twice as long..
What I mean is:
Code:
First case:
FF_src --- long route --- [logic] --- FF_dst
Second case:
FF_src --- shorter route A --- FF_intermediate --- shorter route B --- [logic] --- FF_dst
Third case:
FF_src --- long route --- [logic] --- FF_dst
(CE) (CE)
The first case is without any modification. Because the source flip-flop (FF_src) and the destination flip-flop (FF_dst) are so far apart the data path delay is largely routing delay.
So we add an intermediate flip-flop, which essentially cuts the long route in two shorter routes. This does indeed help, but only by so much. Due to the fact that there are only so many locations where it can put that intermediate FF, you end up throwing away some slack when compared to the third case.
The third case is essentially the same as the first case, only now we use a clock enable (CE) on the flip-flops, so that it only clocks in new data at half the normal rate.
Like I said, I did already try adding the intermediate flipflop but that still was giving me some problems. Which is why I decided to use clock enables there.
So unless I am really missing something (entirely possible
, I will still be needing to use clock enables for that part of the design. So I still need to find a good solution for the problems as described in the original post.
But you did give me some new methods that I can use for at least one other part (that bubble counter decoder), which will improve things. Thanks!