CRC circuit question

Status
Not open for further replies.
The mathematical definition for an M-bit CRC is to append M zeros to the message and then "divide" with the polynomial. The remainder is the CRC.
The "long division" implementation for the serial CRC does exactly that, so it is very straighforward.

It is the "optimized" version that is difficult to understand.
 

About the testbench simulation problem. It is usually a good idea to have all inputs and output to the DUT delayed by a transport delay assignment. Driving the DUT directly from a initial block can occasionally result in the race condition issue seen here.

It avoids the issue with how Verilog is scheduled as it removes any race between clock edges and assignments as the assignments are no longer done at the clock edges.
 

An interesting link about race conditions in verilog simulation:
I think it explains why the problem disappeared when the testbench assignments were changed to nonblocking.
 

The "long division" implementation for the serial CRC does exactly that, so it is very straighforward.

How is "long division" version being straightforward compared to optimized version ?



 

How is "long division" version being straightforward compared to optimized version ?
It is easy to see that the "long division" implementation does what the CRC definition says. It works exactly like the CRC definition.
The value in the shift register is always the remainder when you "divide" the currently received bits by the polynomial.
This is identical to the intermediate values you get when calculating the CRC according to the definition.

It is difficult to see that the optimized version gets the same result as the CRC definition.
The pattern in the shift register is not correlated to the intermediate values when calculating according to the CRC definition.
 

For long division version, the simulation waveform does not really match the following hand calculation. Why ?

 
Last edited by a moderator:

I would say that VHDL can have race conditions. I ran across one that a colleague couldn't solve in their simulation. It turns out the problem was due to a race condition caused by a clock that was used inside a component, and was output from the component to anther block that also used the clock. There ended up being a delta cycle difference between the clock used in the component directly compared to the one used in the other component. I ended up suggesting they change the design to generate the clock and output it from the component and run the output clock back into both that component and the other one. That fixed the race condition they were seeing in the simulation.
 

For long division version, the simulation waveform does not really match the following hand calculation. Why ?
The serial implementation calculates for one input bit per clock cycle.
You will not recognize the intermediate results if you take giant steps like that.
You only show the intermediate results after a "subtract" clock cycle, and the final result.

Here is the full calculation:
Code:
     1101101000000 Dividend=Input message + 5 zeros
100101             Divisor=Polynomial, No subtraction
==================
00001             Intermediate result 0x01
000011            Get next input bit
100101            No subtraction
==================
  00011            Intermediate result 0x03
  000110           Get next input bit
  100101           No subtraction
==================
   00110           Intermediate result 0x06
   001101          Get next input bit
   100101          No subtraction
==================
    01101          Intermediate result 0x0d
    011011         Get next input bit
    100101         No subtraction
==================
     11011         Intermediate result 0x1b
     110110        Get next input bit
     100101        Subtract
==================
      10011        Intermediate result 0x13
      100111       Get next input bit
      100101       Subtract
==================
       00010       Intermediate result 0x02
       000100      Get next input bit
       100101      No subtraction
==================
        00100      Intermediate result 0x04
        001000     Get next input bit
        100101     No subtraction
==================
         01000     Intermediate result 0x08
         010000    Get next input bit
         100101    No subtraction
==================
          10000    Intermediate result 0x10
          100000   Get next input bit
          100101   Subtract
==================
           00101   Intermediate result 0x05
           001010  Get next input bit
           100101  No subtraction
==================
            01010  Intermediate result 0x0a
            010100 Get last input bit
            100101 No subtraction
==================
             10100 Final result 0x14
--- Updated ---

The big difference is that there is no ambiguity in the VHDL simulation. You can analyze the problem and fix it.
The VHDL delta cycle difference is the same regardless of the simulator execution order. You will always get the same result.
You can view the delta cycle difference in the simulator, and if you eliminate it (maybe by adding dummy assignments for the "early" clock), you are fine.

I have seen a similar VHDL "problem" in an ASIC project with gated clocks. The simulation model for the clock gate added one delta cycle to the output clock. The clock before and after the gating had the transitions on different delta cycles.
When we added a "dummy" assignment to the non-gated clock (before being used as a clock in any logic), the simulation was OK.
We could also see in the simulator that the clocks were aligned to the same delta cycle.
 
Last edited:

    promach

    Points: 2
    Helpful Answer Positive Rating
Code:
     1101101000000 Dividend=Input message + 5 zeros
100101             Divisor=Polynomial, No subtraction
==================
00001             Intermediate result 0x01

In the above calculation, how do you obtain intermediate result of 0x01 ?
 

In the above calculation, how do you obtain intermediate result of 0x01 ?
I now see that there is a misalignment at the beginning. 3 lines should be shifted one step to the right:
Code:
     1101101000000 Dividend=Input message + 5 zeros
100101             Divisor=Polynomial, No subtraction
==================
 00001             Intermediate result 0x01 (this is just the first input bit)
 000011            Get next input bit
 100101            No subtraction
==================

Since the polynomial is 6 bits, 6 input bits are needed before the first "subtraction" can be made.
This means that the first 5 intermediate results (remainders) are just subsets of the 5 first input bits:
Code:
1     0x01
11    0x03
110   0x06
1101  0x0d
11011 0x1b
 

    promach

    Points: 2
    Helpful Answer Positive Rating

This is true for signals, but when you mix in shared variables assigned from multiple processes VHDL is still at the mercy of the compiler, so it can work on one compilation and not on another.

eg. What value will A have after the first delta of simulation here?
Code:
shared variable a : std_logic;

process
begin
  a := '0';
  wait;
end process;

process
begin
  a := '1';
  wait;
end process;
 

ok, now I have understood the long division of serial CRC.

What about the "optimized" version of serial CRC which is not so straightforward ?
 

With the help of others (see the screenshot below done by others), I have understood the reasoning behind the optimized version of serial crc implementation. With optimized version, we could have just skipped stage 0 till stage 4, and jumped directly to stage 5.

Now, how do I make parallel implementation of both the "optimized" and "long division" versions of serial crc implementation ?


Note:

We are using USB CRC5 with a data bit (message bit) width of 1.
Suppose our message is 1:
data = 1

Note the results of each stage is calculated and shown in the next stage so I can show previous and current.

About the stages:
First we feed in message bit 1 at stage 0.
Next we append zeroes, so feed in a zero for the following five stages (width of CRC).

 


The procedure is identical for both cases, and is described in your post #13.
To generate the CRC for chunks of N input bits, you modify the inputs bits and shift register bits one at a time, and execute the serial implementation N times. You note in matrix 1 and 2 which output bits are affected by each of the "input" bits.

Remember that the parallel version of "long division" also needs M zeros after the message.
This can be a problem if the CRC size M isn't a multiple of the input chunk size N.
It will put restrictions on the input message length, or you need extra logic to handle an incomplete last chunk, with less than N bits.
The same problem also occurs for the "optimized" implementation if you want to process say 32 input bits per clock cycle, but the message length can be any multiple of 8 bits.
 

To generate the CRC for chunks of N input bits, you modify the inputs bits and shift register bits one at a time, and execute the serial implementation N times. You note in matrix 1 and 2 which output bits are affected by each of the "input" bits.

How exactly is modify the inputs bits and shift register bits one at a time being done ?
 

How exactly is modify the inputs bits and shift register bits one at a time being done ?
This has already been answered in posts #24 and #39.
The easiest is probably to only set one bit at a time and see which output bits are set after executing N clock cycles, but what we really want to see is which output bits are flipped when we flip a certain input bit. This means that you can have any input pattern as the "reference" when you then flip one bit and check which output bits will be flipped.
If an "output" bit flips when we flip a certain input bit, we know that the input bit is included in the equation for that output bit.
 
Last edited:


Why need 2 matrices ?
 

Why need 2 matrices ?
As I mentioned in post #43, you can put all rows in one matrix.
Each row represents one "input" bit (a real input bit or a bit in the shift register).
The original author of the text you pasted into post #13 decided to separate the rows into 2 matrices, one for the "real" input bits and another for the shift register bits.
 

In http://outputlogic.com/my-stuff/circuit-cellar-january-2010-crc.pdf#page=4 , which table (table 1 or table 2) is for the "real" input bits ?
Table 1 is for the Nin[] bits = the input data = the "real" input bits
Table 2 is for the Min[] bits = the shift register bits

The "next" state for each output bit (Mout) depends on both the incoming data bits Nin and the previous shift register bits Min.
The equations will have both Nin terms and Min terms. Table 1 is for finding the Nin terms and table 2 is for finding the Min terms,
but you can put everything in one table/matrix. The important thing to know is that each row represents one of the Nin/Min bits (= one of the possible terms in the parallel equations).
 

    promach

    Points: 2
    Helpful Answer Positive Rating
Status
Not open for further replies.
Cookies are required to use this site. You must accept them to continue using the site. Learn more…