I had success on a similar problem by allocating native buffers for the matrices, then using a basic CUDA call. The actual work was 100x faster than my CPU baseline.
The bottleneck of course was fetching & loading relevant data to memory to start with.