I'm afraid don't have the time to go and chase down references you seek - I do not have any already on hand and you can easily search for them yourself.
I think that you are getting into usage and flow and handshaking details, which is something you would need to think through as part of your project.
If you read through the docs (PDFs) for the different OpenCores FPU projects, and also for the Xilinx cores, you will notice the following trends:
-> The Usselmann FPU will not provide a "ready" or "done" output flag, but instead accepts new data on every cycle and has the exact same number of cycles of latency (4 cycles) per *any* operation (but it is also not able to run at a very high frequency since it is only a four-deep pipeline even for e.g. a divide).
-> The Jidan FPU instead outputs a "ready" signal, which must be used to know when a new data can be fed into it, but this also allows a different number of cycles for different operations.
-> The Xilinx core does not support multiple operations in the same core (you would need to combine a few of these together with added logic to create a true FPU).
I think you can easily gather and understand knowledge for yourself.
There are certainly more papers out there about how to approach the pipelining and balancing and data-flow issues with an FPU design.
For example, I just now searched for "fpu design pdf" and quickly found the following:
ftp://reports.stanford.edu/pub/cstr/reports/csl/tr/96/711/CSL-TR-96-711.pdf
Maybe you could start reading through this and also consider some of the references listed in the bibliography?
Good luck with your project ...