I suspect a major ingredient is to generate waveforms for the vowel sounds. In an electronic circuit you would need to store only one cycle each for a-e-i-o-u (short and long), then repeat that waveform for as long as a key is pressed. Either that or add the right combination of overtones.
By speaking into a microphone you can see waveforms on an oscilloscope. Even more convenient is a program called Audacity, for digital sound processing. It's free.
The foot pedal probably contains a potentiometer (variable resistor) for changing the frequency (pitch) of the oscillator.
Rather than a keyboard with resistive keys, I think the fingers can press simple switches. Conceivably you might use an ordinary computer keyboard.
It's tricky to create consonants (plosives, dentals, sibilants). With Audacity It's possible to capture one occurrence of mouthing a consonant, then store it electronically.
The voder sounds like it has a surprising amount of personality, from seeing this audio-video demonstration (2/3 of the way down the webpage):