quantization and how to determine the fixed point bit?

Status
Not open for further replies.

naught

Member level 3
Joined
Aug 25, 2012
Messages
59
Helped
4
Reputation
8
Reaction score
4
Trophy points
1,288
Location
chengdu
Visit site
Activity points
1,739
Number A is the input of the system, let's assume it's 0.123456. After the double precision number A has been processed in the system, I get the double precision result B(the actual result).

To get where my fixed point is, I apply the following procedure:

the final form is Q1.(N-1), which means 1bit for sign, N-1 bit for fractional bit.

1. multiply A by 2^(N-1), N can be changed later.
2. truncate, or round the A*(2^(N-1)), get the int(A*2^(N-1)).
3. process int(A*2^(N-1)) into the system, get result and truncate the result to be integerB.
4. divide the integerB by 2^(N-1), we have fractional number (integerB/2^(N-1)).
5 compare the actual result B and the (integerB/2^(N-1)). if ( abs(B - (integerB/2^(N-1))) ) / (abs(B)) gets too big, go back to step 1, increase the N; if not, we have our N bit quantization.

If my procedures above are right, when I'm doing all the algorithmic calculations in the step 3, which means "N bit + N bit will be (N+1)bit", "N bit multiply N bit will be 2*N bit" , in the end of the step 3, my temporary binary result would possibly be much wider than the original N bit.

Here's my question. I assume this would not affect my final result as long as the final result does not exceed the integer bit range, in this case it would be less than 1, as long as this is satisfied, it's OK to throw all the MSBs, just leave enough bits to represent the integer value(I would have the max value calculated beforehand in my matlab, thus I know how many integer bits I need). As to the fractional bit, lose the LSB bits would cause loss of precision.

In order to do this, I have to know exactly where the fixed point is at EACH calculation in the process. Maybe the original is Q1.15, but after a serials of calculations, the final is Q10.20.
since I've known the range, 10bit integer is not necessary and accordingly downsize to the corresponding bits.

Am I doing this right? please help.
 

Status
Not open for further replies.
Cookies are required to use this site. You must accept them to continue using the site. Learn more…