Offloading for memory throughput?

Status
Not open for further replies.

Saltwater

Member level 4
Joined
Aug 30, 2015
Messages
79
Helped
0
Reputation
0
Reaction score
0
Trophy points
1,286
Activity points
1,951
Hi,

I'm preparing to route and have had several prototype memory routings for my PCB design but the troughput does not seem all that great in the maximum latency scenario. Worst case I can fetch a couple of samples at high audio rates, where I was hoping to be in/close to the hundreds. And some specs seem phenomenal on DDR3 controller IP's. No way i'm thinking it would be time well spent to build the controller myself.

But it made me wonder about a couple of things..
I have been looking at PHY chips, but can't seem to find any. Are there any?
Also, the Altera mem PHY is probably what it's going to be.. But could there be scenarios where adding another PLD+IP could improve overall speed of the design. Like having a crunchy PLD offloading the FPGA?
 

This question is very vague. You start off with a title about offloading (memory access?) to improve throughput, then change to something about routing a PCB and finding the latency is causing low data rates. Then you start in about DDR3 controllers and how great the specs are? The whole post is disjointed and does not make any sense.

Get to the basics...What are you trying to build and what is the problem? You haven't clarified either.

The only questions you actually wrote are:
1) You want a PHY chip (what kind of PHY chip are you looking for Ethernet, USB, flash, sdram, ?)
2) can you use a crunchy PLD? What the heck is a crunchy PLD? Is it one that gets stepped on and goes CRUNCH!? and how is a PLD going to be better than using a fast Altera part? Have you looked at the Tpd/Tco of a 22V10 or the Tpd/Tco of even a CPLD? And there's no way your going to be able to compensate for clock insertion delays as there are no PLD/CPLDs with PLLs in them.

I think you'll be better off telling us what you are trying to do and then asking how you should try implementing it to meet your requirements. As it is based on your post nobody but you knows what you are trying to do and how much memory bandwidth you need and why DDR3 can't keep up with ridiculously low bandwidth audio data (compared to video, which typically uses memory like DDR3).

Perhaps you are developing something that not even bats (200KHz+) and dolphins (150KHz+) can hear. More likely is the architecture of your design isn't suitable for the application.
 

It's these PHY's like the Synopsys DDR3 PHY, which I suspect may be IP. But you cant really know for shure because the site is holding the actual specs behind a login form.
But they promise a large throughput. So im thinking there might be a pretty nifty controller behind that.

I made the footprints for 1333/9-9-9 DDR3 SDRAM BGA's. I wondered if there are solutions offloading "the" FPGA and getting a solid ±100 samples at 96K?
That and fast compared to using a differential PLL with something like Altera's PHY?
 


Synopsys is not a chip manufacturer, they do not have physical chips they have soft IP. And the for the PHY says it's IP


Having high throughput in DDR3 isn't related to being a nifty controller or not, it's based on having enough bandwidth on the user side to stuff data across in large enough bursts to reduce the overhead to a minimal number. If you are setup to use 1333 memory then you'll get whatever throughput you get with any controller (though you may have latency variations between controllers)

What do you mean by "differential PLL with something like Altera's PHY"? I'm confused as to what you are trying to convey here.

What do you mean by "±100 samples at 96K"? How can you have positive 100 samples or negative 100 samples? Are you saying you get + or - 100 samples at 96KHz how is that possible how do you get samples taken away (negative)? Besides this 96K (bits or bytes?) isn't high bandwidth data by any stretch and can easily fit in the 1333.3 MBps (assuming 8-bit wide DDR3 parts). If you are using 16-bit parts or have a 64-bit SODIMM then you'll have significantly more bandwidth. Regardless of the part 1333.3 MBps is >> 96KBps.

Normal terminology for data bandwith descriptions are usually like 1000 Mbps over a 3.125Gbps serial link using 8b10 (i.e. 2.5Gbps available bandwidth). In this example you have a 2.5x the bandwidth required to transfer the data.
Or saying a PCI33 bus has 32-bits at 33.3 MHz or a little over 1Gbps bandwidth (133MBps), but realistically you'll get more like 100MBps throughput after accounting for the overhead.

So what is your requirement how many Mbps are you transferring? is it multiple channels of data or a single data stream? Can you burst? Do you need to access data in a non-sequential manner (then you shouldn't have used DDR3)? What are the requirements?
 

What do you mean by "±100 samples at 96K"?

To be more exact, about 100 or more times 3bytes at 96KHz.
So.. 28,8Mbit, sec. (Or as much frequency as I can get with having 500MB to 2GB depth)

I can route the 24bit connection on these, but i'm not hard pressed on using two DDR3 IC's. About the locality of the data It's non local, 2D arrays mostly.

What do you mean by "differential PLL with something like Altera's PHY"? I'm confused as to what you are trying to convey here.

I can use two pins from the FPGA's PLL directly. In conjunction with Altera's IP.
 

1333.3 MBps

I don't get this? It's in the official spec too. But wouldn't that be /9/9/9= 1,82MBps in the worst case scenario?
(Or not even the worst case scenario)
 

Bursts are at 1333.3 MBps (if the part is x8), if you account for overhead it's slightly less, but still way more than the 1.82 MBps you mention.

When they describe memory as DDR3 1333 9-9-9 the 9-9-9 numbers are for the latency not for dividing the data frequency.

you should probably read the following two basic tutorials on DDR RAM.
**broken link removed**
**broken link removed**
 


Times 3 indeed, But.. Best case I don't have the RTC delay giving me 9/9.
No way I was going to throughput the same sample. the 1333 It's that number, "right"?

In which case I may have to reconsider my modules, or start at the top again.
 
Last edited:

Times 3 indeed, But.. Best case I don't have the RTC delay giving me 9/9.
No way I was going to throughput the same sample. the 1333 It's that number, "right"?

Huh?

I thought I was pretty clear with my previous post.

1333.3 MBps is the data bandwidth. i.e. data when transferred is done at 1333.3 MBps for a x8 (8 dq pins) part. The device use a DDR (Double Data Rate) interface clocked at 666.6 MHz. So what is this RTC delay giving you 9/9 mean? I have no clue what you are trying to say.

And what does the samples have to do with the 1333. You can output anything you want on the dq pins at 1333. The same sample or all different ones. Each data will be written or read from a different memory location.

My issue with post #6 was you seem to think that you take the data bandwidth 1333 MBps and divide it by 9^3 i.e. ((1333/9)/9)/9 = 1.82, which is entirely wrong. Or was that meant to convey you are writting/reading 3D arrays (i.e. cubes) that are 9x9x9 and you have a 1.82 Array/s interface?
 

No simply dividing the clockrate by the timing. Why is that wrong?
 

No simply dividing the clockrate by the timing. Why is that wrong?

If you are using the DDR3 with single cycle accesses with one data word per access and throwing away the burst then yes you will end up with lousy performance, because you are using the memory in a way that it was not designed for.

Go read the links I put in post #7. You need to understand how DDR3 works. Dividing by 9 three times in a row isn't how you compute the timing. Look at the datasheet and the timing diagrams, they don't do that or imply that is how you calculate the throughput.

If you needed single cycle accesses then you should have used something like QDR-II instead of DDR3. At least in this case you can find parts with burst of 2 so you'll only throw away half the data access.

I think I'm done here. You aren't making enough of an effort to learn how DDR3 works before making assumptions about how to use it (which are all wrong at this point).
 

I did read it. It's not in there tho,
But thanks..
 

Back to back bursts of 8 from a Micron datasheet for a DDR3


The READ latency with tRCD (the first 9 in 9-9-9).


As you can keep stringing the reads one after another just by sending the READ commands at the right time you can get very high bandwidths.

You should also read the 200 page datasheet of the DDR3 part you are using. The other two links were just to give you an overview of what all those numbers they use for DDR3 speeds mean.

I saw in another post that you don't have a formal electronics background. If you don't realize it yet, being an EE isn't easy, and DDR3 is not a beginners interface design. You might want to reconsider using it and stick with a simpler memory interfaces like SRAM.
 
Ok, that's cool tho. I want to be safe regardless of data being in another row and or column.
Was wondering what that mode register flag was for. So it looks safe to implement. Let the controller handle that.
Times 8 for streaming data is pretty fast, @~400Mhz
 
Last edited:

Ok, that's cool tho. I want to be safe regardless of data being in another row and or column.
If you are changing rows every time you read then you shouldn't be using DDR3. Take a look at QDR-II it's designed for more of a random access than sequential. It's designed for high bandwidth network switching applications. Burst sizes of 2 or 4. It's also not a simple interface, but definitely more simple than DDR3.

- - - Updated - - -

Looks like the latest generation is QDR-IV and I wish they had this 4 years ago.
https://www.cypress.com/products/qdr-iv
 

I noticed the tradeoff, think I rather misuse a larger component than think about using selectors on multiple smaller ones.
The newer ones are looking up tho, wanted to use the SRAM's at first.
 

Back at it..

The DDR4 modules are rated for higher frequencies than the logic i'm using. i'm using ~400MHz logic.
But the datasheet does state there's a mode register flag for stepping outside the usual frequency rating.
I'm kind of a frequency oriented guy and don't understand much of the speedgrade stuff.
So i'm confused weather I can use DDR4 memory at a 400MHz (or lower) differential clock, and catch it back at 400Mhz?
(If that's within the device spec?)
 


Did this ever get answered?

Just took a look and the minimum clock frequency allowed with the DLL on is 625 MHz and the maximum allowed frequency with the DLL off is 125 MHz, so I don't think you can run it at 400 MHz and expect it to work.
 

Youre right. Micron said the same thing. Kind of bummed me out, it looks nice sitting there..
 

Did you check to see if the memphy on your device supported DDR4?

Normally the high bandwidth memory uses 4:1 clocking. This means the logic runs at 312.5MHz, and the IO at 625MHz DDR. This works for DDR4 as it is burst oriented, and commands can be issued at a lower rate. If the controller has an 8:1 mode, this would be 156.25Mhz to get 625MHz.

That said, the calibration procedure and signaling requirements may be different enough to matter if you attempt to use a non-DDR4 controller with DDR4 devices.
 

Status
Not open for further replies.
Cookies are required to use this site. You must accept them to continue using the site. Learn more…