I need to use single and double precision matrix matrix multiply blocks on FPGA for some comparison. Can anyone help me find an open source code for it? Preferably just using on-chip memory.

I need to use it for comparing FPGA's computational performance. I am using Xilinx Virtex 6-XC6VLX130T. I can generate floating point adders and multipliers via Xilinx Core generator. What I am interested in is the architecture design to maximize speed using maximum possible resources on the board (maximally parallel architecture). Any tutorial or any existing code on any architecture will be helpful.