[SOLVED] Generation of desired size ANN for FPGA

tomsld · Nov 26, 2013

Hi, i'm first year PhD interesting in efficient implementation of ANN (espetially dynamic ANN) in FPGA.
I need to clarify few issues and make sure about existing solutions to avoid double work not reinventing the wheel. After a survey i found many open and commertial tools for simulation and implementation of the circuits in general. I'm looking for a tool that allow us to easily describe an ANN (no difference which one) with pre-defined constraints. For example, let say, that we gives for the algorithm only 6 DSP slices and one LUT (as BRAM) for activation function. And the goal of the algoritm is to create a desired ANN (ex. FFNN, dynamic or with FIR filters on each synaps, don't care) with max classification rate as possible. The main idea is to squeeze the last juice from the given resouces and schedule them achieving max performance for desired net structure. Or vice versa, we chose a desired classification speed and then algorithm shows the amount of resources that will be utilized. Xil inx has an FIR IP worked in same way. But with ANNs it is much difficult.

So, my questions is:
1) Exist there a soft that can generate desired ANN as VHDL files?
2) What do you think about creation such a program? Is it attractive or i must to concentrate in other direction.

As the input the algorithm read a lines:
y1 =f(f(x1*w11+x2*w12+x3*w13+...+b1)*w31+f(x1*w21+x2*w22+x3*w23+...+b2)*w32+ ...)
y2 =f(f(x1*w11+x2*w12+x3*w13+...+b1)*w31+f(x1*w21+x2*w22+x3*w23+...+b2)*w32+ ...)
...
yn =f(f(x1*w11+x2*w12+x3*w13+...+b1)*w31+f(x1*w21+x2*w22+x3*w23+...+b2)*w32+ ...)

Y1 = f(y1*w18 + y2*w19 + ... + yn*wnn)

From general point of view it must be a process that schedules the arithmatic operation with min idle time.

Many thanks.

mrflibble · Nov 28, 2013

You trying to do unsupervised learning?

If you're asking if there is any easy software that spits out NN's, then you are probably out of luck. Best you can hope for (I think) is you provide it the math framework (aka we use this matrix), and then the software spits out the matrix values. Then it's up to you to cobble the matrix operations together using coregen etc.

Anyways, if you can be more specific about your problem that would be helpful.

tomsld · Nov 28, 2013

Thanks. Unsupervised learning - maybe in the future.
Let's say, than the weights are known. We gives the final equations for the algorithm and it tries to schedule the operations with max performance considering constrained resources.
In the literature i found two types of ANN implementation:
1) Hardwired - one DSP slice for each neuron or each weight. So, the number o neurons in the net (size of the net) is limited be physical number of DSP.
2) Virtual - as few physical neurons multiplmexed in time and creates larger net. But, let say, that in the hidden layer we have neurons with much less weights than in other layers. Or the structure of the net is more complicated: parts with Lattice–Ladder, feedback from another layers. Then few physical neurons are unsuitable.

I want to find or create a program that generates final VHDL code of such a net.

mrflibble · Nov 28, 2013

It (hardwired vs virtual) will depend on neuron count + interconnect density of what you're looking for. But there is nothing stopping you from using both. For example hardwired for some hidden layers + output, and then virtual for your inputs. For example when you have a large number of pixels as input, and low number of neuron on the output side after you classified your objects.

But to answer your question, fat chance of a tool that spits out HDL. Hell, you will be lucky enough if you find matlab/C++/python code for this that has a Do What I Want button.

tomsld · Nov 28, 2013

Thanks. My supervisor says, that it should be good and i could prove the efficient structure (resource utilization) of the net. But i think, that any matlab/C++/python code while is converted to hardware description level loses the optimality. So, from matlab level i can't to control and be sure for the real structure of the net. I can't say for the compilator, that you should use only ex. 10 DSP and 2 activation functions for the whole net.

TrickyDicky · Nov 28, 2013

Matlab natively works with floating poiint, and doubles by default.
You can keep all this precision on hardware if you want, if you dont mind using tons and tons of resources.

This is where compromises have to be sought. FPGAs are designed to do integer arithmetic, which means fixed point. Obviously, with fixed point you lose the range of float, and to some degree the precision. So you have to decide how much loss is acceptable to you on a hardware design.

mrflibble · Nov 28, 2013

Who said anything about direct translation from matlab/C++/python to HDL. You use the higher level stuff to figure out your design. And then you code up the HDL. Going straight up to HDL with only a vague idea of your NN or "have the tools do it for you" will be rather expensive in terms of time. I am all for "the tools do it for you" but we are not quite there yet. And for floating point ... you had better cast your NN into a fixed point form, or else it will cost far too many resources. This will limit some capability, but it's not all bad. With 16-bit weights you can get a lot done. Hell, if you optimize for LUT6, even 6-bit weights have their use. At least that maps pretty well to hardware.

I still have to figure out a clever way to do time multiplexed neuron connections. As in you have a sparse connection matrix, how to map that onto the available resources on fpga such as LUTs + shift registers + BRAMS. A good candidate for cleverness here is shift registers, but everything I come up with is rather ... meh. Or maybe just set the bar lower like all those boring "research" papers in search of the next grant.

tomsld · Nov 29, 2013

Boring papers

for grant
The precision is an another question. You mean that high level stuff only describes the design in some human understandable way. Then i have to code (ex. VHDL, manually) net or equations is such a way as higher level staff suggests?
I still imagine that i can and believe that it is possible to create many little processors for each used DSP, write a scheduler of commands (load, mul, add, look-up, save, ...) for each with max load. Why do I want to? The FPGA chip will never have enough DSP to hardwired net and not always huge speed is needed. The time multiplexing of one neuron is effective only if whole structure contains the same kind on neurons. But when i want to automatic implement any equation which has N multiplication, M summation, K activations (look-ups), J delays having constrained resources for them, then, i think, such a program is required.

mrflibble · Nov 29, 2013

With higher level I also meant matlab etc, well before you start coding HDL. Testing if something is a viable approach is so much cheaper in matlab + a bit of scripting or even some C++ if you need some speed. Once you know what form your NN will be, then you start HDL work. At least, that is how I would do it.

The time multiplexing of one neuron is effective only if whole structure contains the same kind on neurons.

Why do you think that?

But when i want to automatic implement any equation which has N multiplication, M summation, K activations (look-ups), J delays having constrained resources for them, then, i think, such a program is required.

Maybe, but performance tends to be shit. I do agree with you that such a tool "should be there", but if you find something that has even half-decent performance, pleeeeeease send me a link.

TrickyDicky · Nov 29, 2013

Tools will come out of commercial need. So I guess as no companies have needed a NN compiler. They will tend to build their own cut down algorithm to fit requirements that is highly optimised, or use off the shelf components.

tomsld · Nov 29, 2013

Why do you think that?

Thanks. I understood already. The MAC really not care about incoming data, we can write own processes to deal with any kind of neuron.

mrflibble · Nov 29, 2013

0_o

Having problems with the edit button?

At any rate, it was a serious question...

tomsld · Nov 29, 2013

Now I am more worried about the optimal final structure of the net. How can it be proved that the net described in VHDL level is optimal for selected FPGA chip. DSP load as Busy/Idle time? LUT/Slice? It would be perfect if it would be possible to describe a net in a lowest level as possible. I heard about JBITS but only for old XC4000, which can edit compile final *.bit file. Use not open source Verilog-to-Routing (VTR), which i has not tried, but i think it can't generate final file.

mrflibble · Nov 29, 2013

Instead of worry about optimal structures of the final net you can approach the problem the other way around.

Hand craft several classes of neurons, and put those into the pool of capabilities. So you purposefully do NOT try to do everything. You try to do a small subset of things and do those well.

Making sure you can make the most out of an fpga's routing resources is still going to be an interesting problem. But you had that problem anyways.

You can even code the equivalent neuron in your favorite prototyping environment (before you expend effort into routing), and simulate a bunch of nets to see if those neurons you cobbled together are any good.

Mmmh, that gives me an idea. Thank you mr rubber ducky sir!

tomsld · Nov 30, 2013

Thank you. I will have it in mind.

tomsld · May 30, 2014

Hi again,
I have chosen to work with ANN that have the filters on the inputs instead of weights. Filters are called latticearma (if Matlab) or lattice moving-average filter (in general). The Schur recursion is applied, therefore the k coefficients in the lattice part are replaced with combinations of sin(theta) and cos(theta) as shown in figure.

I have implemented this filter and their delta theta updating circuit (not shown in the scratch) in VHDL using signals only in a process. If I put the data on the x input, the y is ready in M*2+2 clk (M-filter order).
I tried to implement all calculations in single process using variables only. And I'am surprised that independent on the order M the y is computed in one clk. I'm misunderstand something, how the circuit chain *+*+*+ ... can be synthesized to calculate the y in a clock? Checked it on simulation and on Zynq. Results are same. From the PS side i give the x, theta, v values for PL and enable the process for one clk only. Then check the y, and it is correct.
So, I'm confused and I can't believe that the chain (DSP, SUM, DSP, SUB, ...) can give result in a one clock.

TrickyDicky · May 30, 2014

a long arithmatic chain can easily be completly in a single clock cycle. The longer the chain, the slower the maximum frequency in a real world system.

So, the question here is - what is your code, and what is your clock speed?

ads-ee · May 31, 2014

tomsld said:
I tried to implement all calculations in single process using variables only. And I'am surprised that independent on the order M the y is computed in one clk. I'm misunderstand something, how the circuit chain *+*+*+ ... can be synthesized to calculate the y in a clock? Checked it on simulation and on Zynq. Results are same. From the PS side i give the x, theta, v values for PL and enable the process for one clk only. Then check the y, and it is correct.
So, I'm confused and I can't believe that the chain (DSP, SUM, DSP, SUB, ...) can give result in a one clock.

TrickyDicky said:
a long arithmatic chain can easily be completly in a single clock cycle. The longer the chain, the slower the maximum frequency in a real world system.

So, the question here is - what is your code, and what is your clock speed?

If tomsld actually used VHDL variables then there is probably code like this:

Code:

process (clk)
  variable  a,b : signed(3 downto 0);
  variable  c,d : signed(7 downto 0);
  variable  e : signed(8 downto 0);
begin
  if rising_edge(clk) then
    a := in_a;
    b := in_b;
    c := in_c;
    d := a * b;
    e := c + d;
  end if;
end process;

which isn't pipelined like they want but does everything in 1 clock cycle.

tomsld · May 31, 2014

Now all is understandable. I pay the max speed price if variables are used. The graph is attached.

If variables are used the DSP limit if reached with filter order M > 10 (because signals are 32b and each MUL utilizes 4 DSPs).
If signals then two MUL operators in lattice part are shared over all filter and one MUL in ladder chain. Therefore, 3*4 = 12 DSP is used in total independent on filter order having maximum load.
It is interesting what is better choice for ex. if M=3:
- implement filter on variables, with Fmax = 22.7MHz and utilize 64 DSPs having x[n]->y[n] in one clock cycle.
- or use signals with Fmax = 98.4MHz and utilize 12 DSPs having latency 3+2M=9 clocks?

TrickyDicky · May 31, 2014

latency and signals is always the winner, as you can get get alot more data through the system.
Looking at your graph, even 100Mhz seems rather poor. I would expect at least 200 with a reasonable design. Depending on the chip, pushing it to 300MHz would be possible with some serious scrutiny.

[SOLVED] Generation of desired size ANN for FPGA

Newbie level 6

Advanced Member level 5

Newbie level 6

Advanced Member level 5

Newbie level 6

Advanced Member level 7

Advanced Member level 5

Newbie level 6

Advanced Member level 5

Advanced Member level 7

Newbie level 6

Advanced Member level 5

Newbie level 6

Advanced Member level 5

Newbie level 6

Newbie level 6

Advanced Member level 7

Super Moderator

Newbie level 6

Advanced Member level 7

Similar threads

Privacy & Transparency

Privacy & Transparency