I'm no expert on audio algorithms, but I'll give it a shot.
First of all, why are you worried about the varying amplitude issue? If you perform a cross-correlation across the data, you should be able to find a relative peak if someone says 'rhinoceros,' regardless of the amplitude; if the voice is loud, the peak will be larger, but I would think the peak would still exist if the voice was just multiplied by some scalar. Are you assuming he is moving while he is saying the word?
If he is moving while saying the word, here is one idea: you can break the word down into syllables and use one correlator per syllable. You would also need to make sure the syllables are adjacent in your detector, but this way if the amplitude is varying across the word, you are now assuming the amplitude is the same for that syllable.