HELP please! multi-cycle paths using clock enable

mrflibble · Mar 7, 2011

Argh! I've been using clock enables for datapaths that are allowed to take multiple cycles to operate. Basically duplicate the logic and run it at half the speed to still have full throughput.

Now in principle things are working, but two things are not going exactly the way I would like.

1 - Specifying timing constraints for multi-cycle paths is totally annoying.
2 - The fanout for the clock enables is such that it does not hit the limit for MAX_FANOUT yet. As such no register duplication occurs. But the clock enable has to reach flip flops that are spread out far enough that some of the more distant ones do not meet timing.

Now I can specify a MAX_FANOUT attribute on the clock enable flip-flop, such that it should start duplicating registers. The fun part is that when I do this, the timings actually get worse. I also read that things can get unpredictable with a MAX_FANOUT value of 30 or less. Now for one particular clock enable that had a fanout of 63, I thought I'd override that and put a MAX_FANOUT = 40 on it. And as said, it did change things, only for the worse.

Since the clock enable was for a 2 cycle path, the clock enable was nothing more than something like this:

Code:

(* MAX_FANOUT = 40 *) reg ce = 1'b0;

always @(posedge clk) begin
    ce <= (~ce)
end

And then for the 2 cycle operation:

Code:

always @(posedge clk) begin
    if (ce) begin
        // stuff 
    end
end

Like I said, that works well, right up until the point where the Q output of that "ce" flip-flop has to travel too far to the CE pin of the various flip-flops.

Now the fix for that is relatively simple, namely make a couple of local copies by hand at roughly the right locations. That does work, and it's also tedious and annoying. This is precisely the sort of thing I would expect a tool to do for me. No doubt I am doing something wrong, but not sure what that is. Maybe I should be setting the MAX_FANOUT to some magic number? Mmmh, haven't tried 42 yet... Or maybe I should enable/disable the magic synthesis option? Anyone have any idea on how to go about this?

As for timing constraints... I have tried to setup constraints using the clock enable in a TNM_NET, and then do the FROM and TO on that TNM_NET. So:

Code:

TIMESPEC  "TS_GCLK" = PERIOD "GCLK"  2.666 ns; # global clock

NET "*/something/ce"  TNM_NET = "TNM_MY_CE";
TIMESPEC TS_MCP_CE = FROM "TNM_MY_CE"  TO "TNM_MY_CE"  TS_GCLK*2; # allow 2 cycles

The use of a NET for the "TNM_MY_CE" TNM_NET makes sure that it will trace from that net to the first synchronous element it encounters ... i.e the CE pin of the flip-flop it connects to. That sortof works, but has some side-effect that prevent this method from being really useful. So far the best method I have found is to just make TNM_NET's for the source and destination INSTances, and do timespecs for those. So something like:

Code:

INST "*/module_A/source_ff"  TNM_NET = "TNM_FROM_HERE";
INST "*/module_B/dest_ff" TNM_NET = "TNM_TO_THERE";
TIMESPEC TS_MCP_WORKS_BUT_IS_TEDIOUS = FROM "TNM_FROM_HERE"  TO "TNM_TO_THERE"  TS_GCLK*2; # allow 2 cycles

This does work, has no side-effects, and is a pain to maintain. Ideally I would like something that uses the clock enable to define the TIMESPEC, but I am not sure how to go about that...

Thanks in advance for any ideas/hints/tips/reading material that you can think of!

permute · Mar 8, 2011

c-slowing might be a better option in your case.

in c-slowing, conceptually, you take each register in the design and turn it into a shift register of length C. eg, an accumulator normally looks like
a <= a + b;
would now look like (for a factor of 2)
a2 <= a1;
a1 <= a2 + b;

likewise, if b was a register, there would be b1, and b2.

The circuit is then fed with two independent data sources, on alternating clock cycles. eg, src0 = 1,2,3,4. src1 = 2,2,2,2
input = 1 2 2 2 3 2 4 2
if you look at the accumulator, you'll notice that you have (for a1 = a2 =0 at t=0)
(a1, a2) = (1,0) (2,1) (3,2) (4,3) (6,4) (6,6) (10,6) (8,10)
so that at the end of 2*4 cycles you have accumulated 4 values from 2 independent data streams.

Retiming is an automated process of moving registers to balance combinatorial delays. Retiming is often associated with the c-slowing method, as it allows the designer to take an algorithm and convert it easily. But the retiming optimization is very slow. You can also manually pipeline the design to get the desired increase in clock frequency. The issue is remembering that some sections will need additional registers for the c-slowing scheme, even though the sections might not contain complex logic.

mrflibble · Mar 8, 2011

Thanks for your reply.

I did not know about c-slowing. I did a bit of quick reading up on it, and it does indeed look like I can use it on this project.

**broken link removed** gave me some idea on what c-slowing does.

If you know any more reading material on this that would be much appreciated!

One particular module that looks like a good candidate is this one:

Code:

module bubble8_to_bin3_ce (
    input            clk,
    input            ce,
    input      [7:0] bubble_count_in,
    output reg [2:0] bin_count_out
    );

initial begin
    bin_count_out = 0;
end


always @(posedge clk) begin
    if (ce) begin
        casez(bubble_count_in)
          // Decode RIGHT running counter
          8'b1?000000: bin_count_out <= 3'b000;
          8'b01?00000: bin_count_out <= 3'b001;
          8'b001?0000: bin_count_out <= 3'b010;
          8'b0001?000: bin_count_out <= 3'b011;
          8'b00001?00: bin_count_out <= 3'b100;
          8'b000001?0: bin_count_out <= 3'b101;
          8'b0000001?: bin_count_out <= 3'b110;
          8'b?0000001: bin_count_out <= 3'b111;
          default    : bin_count_out <= 3'b000;
        endcase // bubble_count_in
    end
end

endmodule // bubble8_to_bin3_ce

For the technology schematic view of this module, see the "bubble8_to_bin3_ce-techview.pdf " attachment.

Looks to me like it would benefit from stuffing extra registers between the 1st level of LUTs and the 2nd level of LUTs. I did notice that before, and tried to add an extra pipeline stage by hand. Buuuut, that resulted in unreadable HDL real fast, so I deciced against that. Ideally I would like to tell the tool (xilinx ise) that after 2 cycles I would like to end up with that function, and it is free to stuff 1 extra stage of registers wherever it deems necessary.

So using the c-slowing + retiming approach I would need to add one extra stage of registers, tell the tool "Yo, retiming opportunity here. Gogogo!", wait a long time, et voila. Right?

As you may have guessed, the above module uses a clock enable "ce", and I have 2 of those running in parallel at the even/odd clock cycles. That does work, but if I can make it properly place the extra flip-flops between the LUT stages that would be even better.

Thinking a bit more about this, strictly speaking this is probably not c-slowing, but just retiming.

As you say, the retiming optimization is very slow. Are there any specific methods of writing verilog, constraints, attributes, etc that will help the tool so that this retiming optimization doesn't any longer than absolutely necessary?

Now for the part with the clock enables... As you say I can manually pipeline the design to get the desired increase in clock frequency. For the time critical path I have already done that and broken it down to a lot of smaller stages with typically 1 logic level.

As it happens the first stage in the pipeline is 1 logic level, and at full clock speed this still doesn't meet timing. I went over the timing report, and typically it is 60% or more routing. For example I have one that is 35% logic, 65% routing. When I noticed that, the first thing I did was to add another stage of flip-flops between that, to give it some more freedom to span the gap between the slices so to speak. But even doing that was not all that great. As far as I understand it, adding the extra flip-flop does give it extra slack. But not as much slack as it would gain from having the freedom to take twice as long..

What I mean is:

Code:

First case:
FF_src --- long route --- [logic] --- FF_dst

Second case:
FF_src --- shorter route A --- FF_intermediate --- shorter route B --- [logic] --- FF_dst

Third case:
FF_src --- long route --- [logic] --- FF_dst
(CE)                                  (CE)

The first case is without any modification. Because the source flip-flop (FF_src) and the destination flip-flop (FF_dst) are so far apart the data path delay is largely routing delay.

So we add an intermediate flip-flop, which essentially cuts the long route in two shorter routes. This does indeed help, but only by so much. Due to the fact that there are only so many locations where it can put that intermediate FF, you end up throwing away some slack when compared to the third case.

The third case is essentially the same as the first case, only now we use a clock enable (CE) on the flip-flops, so that it only clocks in new data at half the normal rate.

Like I said, I did already try adding the intermediate flipflop but that still was giving me some problems. Which is why I decided to use clock enables there.

So unless I am really missing something (entirely possible , I will still be needing to use clock enables for that part of the design. So I still need to find a good solution for the problems as described in the original post.

But you did give me some new methods that I can use for at least one other part (that bubble counter decoder), which will improve things. Thanks!

HELP please! multi-cycle paths using clock enable

mrflibble

Advanced Member level 5

permute

Advanced Member level 3

mrflibble

mrflibble

Advanced Member level 5

Attachments

Similar threads

HELP please! multi-cycle paths using clock enable

mrflibble

Advanced Member level 5

permute

Advanced Member level 3

mrflibble

Advanced Member level 5

Attachments

Similar threads

Privacy & Transparency

Privacy & Transparency