Filtering a binary file into two data types

wtr · Jul 1, 2015

Hello all,

I have a binary file that consists of two datatypes.

datatype 1 has the identifier AAAA AAAA 1234 8000 {UTC timestamp} [data]
datatype 2 has the identifier AAAA AAAA 4321 8000 {UTC timestamp} [data]

Consider the binary file to be concatenated together randomly such that
datatype 1 & datatype 1 & datatype 2 & datatype 1 & etc

I want to know how I can extract the data into two separate files datatype 1 & datatype 2.

I have the following C code that strips the UTC timestamps from a datatype block. The problem is this is all based on a text file. I would love to know how i can keep this all binary to speed up the whole process.

Code C - [expand]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
// strip_vme_utc_data.c  WDST  16/04/2015
// this program is used to remove UTCs from the vme log data
// so the file may be used for simple file comparison.
 
int main(int argc, char *argv[]){
 
  FILE *fp1; // vme log file
  FILE *fp2; // sanitised vme log file
 
  char  str1[16]; // vector from IP file
  char  str2[16]; // vector to OP file
  char word_value[4]; // data value
  char utc_token[] = "8000"; // utc token
 
  char same;
//  unsigned long l; // vector counter
//  int j; // loop counter
  
  if(argc!=3){
  printf("Usage: strip_vme_utc_data <vme input file> <vme output file>\n");
  exit(1);
  }
  
  // open first file
  if((fp1 = fopen(argv[1],"rb"))==NULL){
  printf("Cannot open first file.\n");
  exit(1);
  }
  
  // open second file
  if((fp2 = fopen(argv[2],"wb"))==NULL){
  printf("Cannot open second file.\n");
  exit(1);
  }
  
  // write header
//  fprintf(fp2,"Reference Data \n");
  
  // compare the files
  while(!feof(fp1)) {
  
      fscanf (fp1, "%s", &str1); //get time
      if(ferror(fp1)) {
        printf("Error reading first file. \n");
        break;
      }
      fscanf (fp1, "%s", &str1); //get data
      strncpy(word_value, str1,sizeof(str1));
      // strncpy(word_value, str1, sizeof(str1));
 
      printf("Data is %s\n",word_value);
 
     if(strcmp(word_value,utc_token) == 0){ // it is a utc token
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get UTC data 1
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get UTC data 2
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get UTC data 3
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get UTC data 4
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get UTC data 5
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get Lost vme count
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get Lost vme count2
      fscanf (fp1, "%s", &str1); //get time
      fscanf (fp1, "%s", &str1); //get vme data
      // strncpy(word_value, str1,4);
      strncpy(word_value, str1, sizeof(str1));
      }
 
       fprintf(fp2,"%s\n",word_value);
 
   } // end of while
 
  if(fclose(fp1)==EOF){
    printf("Error closing first file.\n");
    exit(1);
  }
  
  if(fclose(fp2)==EOF){
    printf("Error closing second file.\n");
    exit(1);
   }
  
  return(0); 
  }

Thanks in advance.

Dan Mills · Jul 1, 2015

Can AAAA AAAA xxxx 8000 appear anywhere in the payload?

Regards, Dan.

D.A.(Tony)Stewart · Jul 1, 2015

Call this your frame sync pattern. If payload is variable length then it must be included after timestamp. Frame lock detection can then ignore data until loss of sync.

This is similar to SDLC protocols where preamble of AAAA is used for bit sync and word sync and pattern that follows for frame sync or address decode of source of payload. The protocol needs to have some CRC checksum for error check and some EOL code to check frame length when expected to stay in sync.

The frame error rate will be worse than BER, so adequate redundancy is needed with unique patterns improves this such as a maximum likelyhood autocorrelation sequence code for high rel. communication in marginal SNR conditions. ECC enhances error rates such as Chinese Remainder Theorem fire Codes etc.

Otherwise DIY any frame sync algorithm with parsing patterns and windows when to look for each byte type are needed using a State Machine design approach.

I have designed my own in the 70's but there are lots of standard protocols to suit this. Even RLL data streams used in HDD's for address and data with bit compression and ECC.

wtr · Jul 2, 2015

Dan Mills said:
Can AAAA AAAA xxxx 8000 appear anywhere in the payload?

Regards, Dan.

Yes.

Consider the datablocks that follow the header of variable length.

Here is a snipet of the binary data I get.

aa aa aa aa aa aa aa aa aa aa aa aa 21 43 33 20 f0 82 1f 20 9e 69 c3 a6 76 d6 20 80 20 20 01 20 ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 aa aa aa aa aa aa aa aa aa aa aa aa 21 43 36 20 d3 2a 1f 20 9e 69 c3 a6 78 d6 20 80 20 20 02 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 aa aa aa aa aa aa aa aa aa aa aa aa 21 43 35 20 b4 54 1f 20 9e 69 c3 a6 7a d6 20 80 20 20 03 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 aa aa aa aa aa aa aa aa aa aa aa aa 21 43 36 20 b1 0b 1f 20 9e 69 c3 a6 7c d6 20 80 20 20 04 20 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 aa aa aa aa aa aa aa aa aa aa aa aa 21 43 36 20 6f 24 1f 20 9e 69 c3 a6 7e d6 20 80 20 20 05 20 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be aa aa aa aa aa aa aa aa aa aa aa aa 21 43 37 20 dc 20 1f 20 9e 69 c3 a6 80 d6 20 80 20 20 06 20 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa ad de ef be 20 20 04 37 34 77 20 20 77 ca e1 ab 20 20 aa aa aa aa aa aa aa aa aa aa aa aa 21 43 35 20 b4 54 1f 20 9e 69 c3 a6 82 d6 20 80 20 20 07 20 ab d1 1c 01 20 a1 20 20 5b d1 1e e7 20 f0 20 20 ea 1d 20 20 df 01 1d 1e aa aa 55 55 aa aa

This is somewhat worrying because what I intend to write is the following.

Code VHDL - [expand]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
package vme_data_pkg is
  type array_2d is array (1 to 24) of std_logic_vector(15 downto 0);
  CONSTANT vme_data1 : array_2d := (X"DEAD",
                                                         X"BEEF",
                 X"0000",
                 X"3704",
                 X"7734",
                 X"0000",
                 X"CA77",
                 X"ABE1",
                 X"0000",
                 X"D1AB",
                 X"011C",
                 X"A100",
                 X"0000",
                X"D15B",
                X"E71E",
                X"F000",
                X"0000",
                X"1DEA",
                X"0000",
                X"01DF",
                X"1E1D",
                X"AAAA",
                X"5555",
                X"AAAA"
                                                        );
                                    
end package vme_data_pkg;

As can be seen in the snipet - ad de ef be is suppose to be deadbeaf. I think this may have to do with the sampling on the rising edge & falling edge & then being writen to the compact flash in the wrong order. This is not a problem in the system because when reading back data over a bus the fpga vhdl code automatically reorders the data. However I'm in a situation where I'm taking the data directly from the compact flash.

Clearly this has all gone pearshaped & i'll need help
My tasks can break down into
1. Reorganising bytes (even & odd shift around)
2. Filtering data into a separate file depending on the header. AAAA AAAA 4321 or AAAA AAAA 1234.

Regards
Wes

- - - Updated - - -

It also appears that 00 00 is being displayed as 20 20. This happens when I copy the binary from notepad++ Hex view into this post

FvM · Jul 2, 2015

As can be seen in the snipet - ad de ef be is suppose to be deadbeaf. I think this may have to do with the sampling on the rising edge & falling edge & then being writen to the compact flash in the wrong order. This is not a problem in the system because when reading back data over a bus the fpga vhdl code automatically reorders the data. However I'm in a situation where I'm taking the data directly from the compact flash.

Compact flash is written and read in sectors sequentially, there's no point where the data would get flipped or reordered. I also don't see how a FPGA is involved in the problem.

The question is essentially simple can you identify the data blocks of interest by an unique signature e.g. a header? If yes it's a simple C coding problem working with binary files and a buffer which acts as source for all search operations etc.

wtr · Jul 2, 2015

FVM, I had to do an endian switch to get the data into a format I can edit it. The problem was just the order it was written to the disk. However I've fixed this with the following.

Code C - [expand]
1
2
3
4
5
6
7
8
while (ReadByte(fp_r, &buffer)) {
     /* Endian Switch */
     temp = buffer[1];
     buffer[1] = buffer[0];
     buffer[0]=temp;
     /* Write data to the file */
     fwrite((void*) buffer, 1, 2, fp_w);
   }

What I need now is to filter the output file from pointer fp_w. Using something like fseek/search or whatever function to find AAAA AAAA 1234

Regards,
Wes

FvM · Jul 2, 2015

There should be no endian problem involved with character or byte streams. But it is of course with files consiting of of word entities. Anyway, I see you know how to handle it.

It's important to know if the data is word aligned or can be shifted by multiples of one byte. Respectively the data could be either searched increasing the buffer pointer in byte or word steps.

The search can be performed by two nested for loops. The inner loop is comparing the search pattern starting from the actual buffer position. If the loop proceeds til the end, a match is found, if the compare fails, the inner loop is left with a break and the outer loop advances the buffer position by one.

wtr · Jul 3, 2015

The data length is variable, therefore after I've found the header... I need a dynamic hover. I don't believe for loops will help,

I'll implement with a while loop & report back

for
if header
while != next_header
data <= data

FvM · Jul 3, 2015

I was only describing the search algorithm, not overall data processing.

To be aware of a new header at any position of the input stream, the search algorithm has to be run continuously.

Some details of your data haven't been told yet, e.g. possible length of the data blocks, total file size, so I'll stop guessing about the solution.

wtr · Jul 3, 2015

I inherited this task & didn't know much about it at the start now it appears the data is stored in the following format

aa aa aa aa aa aa aa aa aa aa aa aa 43 21
00 35 - word length
54 b4 - checksum
00 1f - utcword 1
69 9e - utcword 2
a6 c3 - utcword 3
d6 b1 - utcword 4
80 00 - utcword 5
00 00 - utcword 6
00 1f - utc flag
d1 ab - DATA
01 1c
a1 00
00 00
d1 5b
e7 1e
f0 00
00 00
1d ea
00 00
01 df
1e 1d
aa aa
55 55
aa aa
de ad
be ef
00 00
37 04
77 34
00 00
ca 77
ab e1
00 00
d1 ab
01 1c
a1 00
00 00
d1 5b
e7 1e
f0 00
00 00
1d ea
00 00
01 df
1e 1d
aa aa
55 55
aa aa
de ad
be ef
00 00
37 04
77 34
00 00
ca 77
ab e1
00 00
d1 ab
01 1c
a1 00
00 00
d1 5b - DATA

Where data length 35HEX = 53 dec.
So it transpire I can use this as a variable to do loop.

The topic I need help with here is what c function to use. strcmp / fseek / regex etc. Please advise whats best

Dan Mills · Jul 3, 2015

You cannot use the string stuff because the data contains zeros which would be seen as string terminators.

Looks to me like a good case for a little state machine, for loops, and some pointer arithmetic.

There may well be platform specific things that are helpful, but that is outside the scope of pure C, think things like mmap as possibly being very handy (Depends a bit on your expected data file size, is this something you can do in memory, or does it need to work piecemeal?).

Regards, Dan.

wtr · Jul 3, 2015

Expected size is in the GB's anywhere from 5GB to 32GB.

I am worried about using a window & missing out on the next section of AAAAAA strings, however I'm not keen on the idea of only incremementing a byte at a time, when the file size is so big.

I suppose I could use a big window to scan & then when I move on I have a slight offset & look at the last chunk of previous window

Welcome to EDAboard.com

Filtering a binary file into two data types

wtr

Full Member level 5

Dan Mills

Advanced Member level 2

D.A.(Tony)Stewart

Advanced Member level 7

wtr

Full Member level 5

FvM

Super Moderator

wtr

Full Member level 5

FvM

Super Moderator

wtr

Full Member level 5

FvM

Super Moderator

wtr

Full Member level 5

Dan Mills

Advanced Member level 2

wtr

Full Member level 5

Similar threads

Part and Inventory Search

Welcome to EDABoard.com

Sponsor

Connect with us

Online statistics

Forum statistics