Tricky, The op doesn't want to add the values but compare them for the maximum value.
I'm assuming the desire is to both get the maximum and produce the location of the value in the matrix...
There are two approaches as I see it right now.
1. The parallel approach which would require a comparator tree - compare two values (for all elements paired up) and pass the larger of each compare to the next compare stage. This will result in a very large circuit with relatively small MxN matrices.
2. If you can tolerate latency then a more software like approach would be much more efficient. Build an FSM that cycles through all the entries in the "memory array" (matrix) and compares for the largest entry. Cycle time is entirely dependent on the size of the matrix, but the design won't grow much regardless of the size of the matrix being examined. If the latency is too large break it up with more than one of the FSM's running in parallel.
Regards,
-alan