Efficiently comparing 64bit values

dpaul · Oct 6, 2016

Need ideas...

I have multi-input channels (lets assume 8 channels for now) each carrying 256 bits of AXI-Stream data. Now each input (the tdata bus) has a 64 bit unique value given by the position, tdata_i(95 downto 32).
My requirement is to compare each of the 64 bits values from all the 8 input data streams, determine which input data has the least 64 bit value and then save that data input into a FIFO first. Then the tdata with the next bigger 64 bit value will go into the second location of the FIFO and so on.
I will have corresponding tvalid_i and tlast_i signals for all the 8 channels.
I have to keep in mind that data might be continuously coming in from one or all of the channels or it might be completely random.

Target device is xc7a200t and clock is 125MHz.

How to do the comparison in the most efficient way?

Regards.

K-J · Oct 6, 2016

To compare numbers you simply use the less than operator (i.e. if a < b then...). If that approach meets your timing requirements, then you're done. Don't make it harder than it needs to be. If performance is not met then break the comparison up into a comparison of two 32 bit numbers where the result of the 32 bit compares get stored in a flip flop and on the next clock cycle you perform the rest of the comparison. If that doesn't work, break it into four 8 bit compares that get stored in the flip flop.

Kevin Jennings

dpaul · Oct 6, 2016

Don't make it harder than it needs to be.

I appreciate your 'KISS' approach. Will try it as a first go, still not into the implementation stage.

ads-ee · Oct 6, 2016

Basically what you want is a sort algorithm. From the research I did many years ago this isn't something that is quick and efficient in hardware nor is it quick and efficient in software. Given you want to do this in real time with incoming data likely means a large pipelined tree structure with compares between each stage to "bubble sort" between each stage. First guess is probably 8 pipeline stages with compares between each adjacent pair of values with each stage alternating the adjacent pairs, i.e. abcdefgh: 1st ab cd ef gh, 2nd bc de fg, 3rd ab cd ef gh, ...

TrickyDicky · Oct 6, 2016

ads-ee said:
Basically what you want is a sort algorithm. From the research I did many years ago this isn't something that is quick and efficient in hardware nor is it quick and efficient in software. Given you want to do this in real time with incoming data likely means a large pipelined tree structure with compares between each stage to "bubble sort" between each stage. First guess is probably 8 pipeline stages with compares between each adjacent pair of values with each stage alternating the adjacent pairs, i.e. abcdefgh: 1st ab cd ef gh, 2nd bc de fg, 3rd ab cd ef gh, ...

This is a "merge sort", not a bubble sort. Merge sorts are very fast on FPGAs, but take quite a few resources, but allow for a nice pipelined approach.

ads-ee · Oct 6, 2016

Yes it takes 8 stages to do this...

- - - Updated - - -

TrickyDicky said:
This is a "merge sort", not a bubble sort. Merge sorts are very fast on FPGAs, but take quite a few resources, but allow for a nice pipelined approach.

Okay thanks for pointing that out, as I said it's been a long long time since I last looked at this stuff. You are right it is a merge sort not a bubble sort and as I show it only takes 8 stages (very expensive but doable).

I would be particularly careful with how you deal with streams that are inactive or run out of data. That is something that needs clarification. Perhaps an extra bit MSB that is enabled when there is valid data so it will always sort to one end or the other and you only write the valid words into your sorted word FIFO.

vGoodtimes · Oct 7, 2016

First, the problem can reduce to a sort. However the problem here is find_min. There is no reason to find the 2nd lowest, 3rd lowest, etc...

The entire circuit has the comparators and the selection. 125MHz is probably a bit high. There are a few options:
1.) the 7 compare + 7 2:1 muxes. tvalid can be used either as part of the 2:1 muxes or as a 65th bit.
2.) 7 compares and an 8:1 mux.
3.) 4 + 6 compares and 4 2:1 muxes and a 4:1 mux.
4.) a few others.

The long path in the design is that the results of a compare are used to select the arguments for another compare. However, 6 compares can be done on 4 inputs and the 6b result used to generate the select bits.

The compares for the first methods are:
(a,b), (c,d), (e,f), (g,h),
(min(a,b),min(c,d)), (min(e,f), min(g,h)),
(min(a,b,e,f), min(c,d,g,h)).

for the 6 compare method, they are (a,b), (a,c), (a,d), (b,c), (b,d), (c,d).
if A is lowest, then 000???
if B is lowest, then 1??00?
if C is lowest, then ?1?1?0
if D is lowest, then ??1?11
where 0 means the left side is lower and 1 means the left side is greater or equal. In anycase, it comes down to 2 LUT6's.

dpaul · Oct 7, 2016

Thank you for all the responses.
If I implement a pipeline approach for comparison, then I must allow for adequate data buffering in each channel and this would be somewhat costly....to be kept in mind!
The resources for just data buffering could be - 256bits x no. of pipeline-stages x no. of channels

ads-ee · Oct 7, 2016

dpaul said:
Thank you for all the responses.
If I implement a pipeline approach for comparison, then I must allow for adequate data buffering in each channel and this would be somewhat costly....to be kept in mind!
The resources for just data buffering could be - 256bits x no. of pipeline-stages x no. of channels

Based on the below quote from your original set of requirements...

My requirement is to compare each of the 64 bits values from all the 8 input data streams, determine which input data has the least 64 bit value and then save that data input into a FIFO first. Then the tdata with the next bigger 64 bit value will go into the second location of the FIFO and so on.

vGoodtimes neglected to account for the stuff in red (i.e. they want the sorted values not the minimum value)

If the sort is done using a pipeline then the input buffers don't have to be larger unless there is a feedback mechanism required that uses the sorted field to do something with the incoming channel data, which wasn't specified as part of the requirements.

Also if you are stuffing all these values into a FIFO then you will be restricted by the time sliced time of writing each of the sorted values into the FIFO one at a time. Is that what you are referring to as the reason for data buffering? If you need to do this FIFO write operation without the 8 TDMA operation then you'll need an 8-port RAM, which you could implement with distributed RAM to save block RAMs for your input buffers.

vGoodtimes · Oct 7, 2016

ah, ok. i did misunderstand the problem. I'd go with something like the bitonic sorting network or even-odd merge sort network.

I would do this only for the index order+valids and do the data muxing at the end. I'm guessing you want the data on the next cycle to be written to the next addresses of the fifo with no padding. Either way, doing the muxing at the end makes it easy to do either.

this could be 8 cycles of latency, that sounds like a reasonable estimate.

if the data actually is 256b, then BRAM get expensive. 64b writes mean 4 BRAM per input or 32 BRAM.

If the data is written over 8 cycles, then the find_min becomes a viable circuit or it makes sense to have a non-pipelined sort that also takes 8 cycles.

Welcome to EDAboard.com

Efficiently comparing 64bit values

dpaul

Advanced Member level 5

K-J

Advanced Member level 2

dpaul

dpaul

Advanced Member level 5

ads-ee

Super Moderator

TrickyDicky

Advanced Member level 7

dpaul

ads-ee

Super Moderator

dpaul

vGoodtimes

Advanced Member level 4

dpaul

dpaul

Advanced Member level 5

ads-ee

Super Moderator

vGoodtimes

Advanced Member level 4

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics