Arithmetic in FPGA design

Status
Not open for further replies.

matrixofdynamism

Advanced Member level 2
Joined
Apr 17, 2011
Messages
593
Helped
24
Reputation
48
Reaction score
23
Trophy points
1,298
Visit site
Activity points
7,681
Often, we may want to carry out arithmatic in our FPGA design which means that this shall be implemented in the RTL. Under such circumstances, is it a good idea to merely write C=A+B, C=A-B in the RTL or does one create an adder of the correct size in RTL, instantiate it and then use that instance to carry out the aritmatic? My question also applies to multiplication. The only difference with multiplication is that we now have hard DSP blocks in the FPGAs thus writing A*B can be directly synthesized. Assume that we are talking about standard logic vectors which represent fixed point arithmatic.

Also, is it that division is only rarely carried out in digital circuits? I mean, if we write A+B, A-B or A*B, the synthesisi tool (lets say Altera Quartus) can synthesize circuit to do this. What would synthesis tools do with division operator? After all, it can lead to fractional results and irrational numbers too.
 

Addition, subtraction and multiplication can use the "easy" method:
a <= b + c;
a <= B*c;

etc.
The synthesis tools will probably do a much better job creating the circuits than you could do (most of the time).
As a side not, you should NOT being using std_logic_vectors for these values. a std_logic_vector is NOT an integer or fixed point value. It is meant to represent a collection of bits. Use the numeric_std package for unsigned and signed values, and the fixed_pkg for ufixed and sfixed values. These libraries have functions for arithmatic on these vectors.
std_logic_unsigned/signed and std_logic_arith packages are NOT part of the VHDL standard.

For division, you can use / function in altera devices (xilinx wont let you use it), but it will not be pipelined and have poor clock performance.
You are much better off using the divider IP core provided by the manufacturer.
 
OK, I should have said "signed" instead of std_logic_vector.
Anyway, I would really like to know when we would want to make an adder ourself and then instantiate it. After all, there are so many type of adders starting from ripple and then carry look ahead e.t.c. The same applies to the multiplier as well.

Now with the FPGA part clear. If we are designing an ASIC, I guess we shall have to do this bit manually there?
 


There is no way you're going to create any adder that works faster than using the built in carry chain of an FPGA no matter if you implement the latest and greatest radix algorithm. Once it starts using LUTs to implement it your clock period will increase.

In an ASIC that is a different story hence all the research papers associated with attempting to find a better solution for carry chains, multiplication, division, etc. In this case the allowable customization allows for changing the topology and the number and drive of any gates in the critical path to improve clock period.
 
aha, thank you ads-ee. This means that we need to have knowledge about different type of arithmatic circuits and their pros and cons when we are doing ASIC design. However, with FPGAs the story is simplifier. The synthesis tool shall take care of the mess for us. hmmmmm

Some guy once told me that in a design when he used the DSP blocks of the FPGA, he could not meet timing. So for that specific case, he built a simpler adder using LUTs and was able to meet the timing in that case. So here he could not simply write A*B. That worked but did not meet timing for the clock frequency that he wanted.
 

In the case of DSPs, usually timing fails because you struggle to reduce the path into/out of the DSP block. The best way to fix it is extra registers before/after the DSP. That really helps timing.
Ive spent months getting a design to fit in a stratix 4 at 368Mhz, but we did it.
 
hmmm. may I know how precisely it took months? I guess that the design was quite huge thus each iteration from makign a change to a full design compile took a very very long time which costed months.
 

To meet timing it meant running several builds overnight and analysing the failing paths - finding the common ones, fixing code/timing constraints appropriatly and then rebuilding again.
During this time the code/functionality was also changing and being added to, hence the extra time required.

- - - Updated - - -

To meet timing it meant running several builds overnight and analysing the failing paths - finding the common ones, fixing code/timing constraints appropriatly and then rebuilding again.
During this time the code/functionality was also changing and being added to, hence the extra time required.

And it also had 60% DSP usage, 75% logic and 90% memory usage.
 
TrickyDicky, was the issue that a certain path in the design had a huge delay (combinatorial + path delay) and you had to determine how to reduce the combinational block size and this meant tha the design had to be modified? The was achieve by adding "pipeline" stages in the combinatorial block, however, this meant that the design functionality was being modified and thus the design had to be redesigned to take this latency into consideration. Is this how all timing issues are resolved?

By the way, while this design certainly had a lot of logic resource usage, how often have you run into issues where the routing resources were exhausted instead and thus the design could not fit, even though there were sufficient logic resources?
 

Logic stages between registers were already minimal. The problem was often ensuring related logic was grouped properly to prevent long paths from one register to another. This is a real extreme example, and I hope I never have to do anything like it again.

The problems just came because so much logic was used at a specified speed. Failures were often in ps, and the failing paths would be different from build to build because of the the fitter seed. The job was to see what was common accross seeds, fix those then try again. Often it was like whackamole - fix one path and it lengthened the path on the other side of the register, so sometime you just needed more registers. Or analysing the code for false paths and multicycle path constraints.

No functionality was changed as a rule - but if it needed to be (ie. extra pipeline stages), there were plenty of testbenches to ensure the end result was still correct.

Since this, I havent had such a tight project...
 
Thanks TrickyDicky, I am feeling so happy to read from a person doing this stuff

by grouped properly you mean that the logic blocks that were being used to make up the combinatorial block were required to be in close proximity right?
since the timing analysis is carried out using a highly pessimistic scenario, if timing was failing in ps, shouldn't it be possible to ignore them?

have you had to do a design where the design was constrained to both clock edges i.e positive edge and also negative edge caused data to be latched?

false paths would be, reset input or any signal that only occurs like once at the start of the design power-up, clock crossing paths where we have put clock crossing bridges, and any other asynchronous input for which we have put in a register chain. Are there some other false paths

usually the data is latched at the next positive edge from the current positive edge of clk which is called launch edge, if this relationship of launch and latch edges is not intentionally held true in our design then we must add multi cycle path. have you had to use this often? when we add multi-cycle path constraint and then synthesize the design, will the fitter try to fit the design such that the delay from launch to latch register becomes as exactly much as given in the multi-cycle constraint (even if it can fit the design with less delay?)

wow, this is so awesome !
 

by grouped properly you mean that the logic blocks that were being used to make up the combinatorial block were required to be in close proximity right?

All logic is timed register to register. So none of the code was combinatorial. It was all synchronous - but this compiles to register -> comb logic -> register. The proximity thing is making sure an etire entity (or multiple entities) are grouped in the same part of the chip. In altera these are called LogicLock regions.

since the timing analysis is carried out using a highly pessimistic scenario, if timing was failing in ps, shouldn't it be possible to ignore them?

have you had to do a design where the design was constrained to both clock edges i.e positive edge and also negative edge caused data to be latched?

No, the whole design was a single rising_edge clock.


False paths are rare, but are usually for clock domain crossing (but you really should be using max delays for these). Some of them were for impossible logic combinations.
You should NEVER have an async input in a register chain. You risk meta-stable register output.


You dont have to use multi cycle paths. You can use them when you have a periodic clock enable (say it is high one in every N clocks)
This way you know that the data wont be latched until N clocks after the current clock edge, hence the multi cycle relationship

so the usual way to specify these would
set_multi_cycle_path -from *my_multicycle_regs* -to *my_multicycle_regs* N

- - - Updated - - -

All SDC constraints should be used by the fitter and timing analyser. In qaurtus (at least) you can add tcl to your SDC file, so you can set constraints only to the fitter (for examples, we had some max delays that were an overconstraint of the clock, so that they should pass timing when it came to the actually timing analysis).
 

Status
Not open for further replies.

Similar threads

Cookies are required to use this site. You must accept them to continue using the site. Learn more…