You have three basic choices:
1. Parallel (also known as unrolled) structure, this is the one that you will see in most DSP text-books, with every filter delay implemented as explicit registers and one result sample is available every clock cycle.
2. MAC (multiply-accumulate) based, this one computes one tap per cycle and stores intermediate result in an accumulator, it would take 16 cycles to produce one output sample.
3. DA (distributed arithmetic) based, this method is similar to (2), in the sense that it requires 16 cycles to produce a result and it needs an accumulator, but no multiplyers are needed since their functionality is implemented in a look-up table (LUT).
Google for the above for terms and you will find a wealth of implementations that are well documented.