VMMA Documentation
John Strawn

18 September 1987

Here are the basic steps to be performed:

          move      x:(r1),x0
          move      y:(r5),y0
          move      x:(r0),x1
          move      y:(r4),y1
          mpy       x0,y0,a
          macr      x1,y1,a
          move      a,x:(r6)

That produces two instructions plus five moves, which means that
at least three operations will be required for one output 
element (and this is the solution used in the code):

          ; loop setup
          move                x:(r1),x0 y:(r5),y0
          mpy       x0,y0,a   x:(r0),x1 y:(r4),y1
          macr      x1,y1,a   x:(r1),x0 y:(r5),y0
          move                x:(r0),x1 y:(r4),y1

          ; inner loop
          mpy       x0,y0,a   a,x:(r6)  
          macr      x1,y1,a   x:(r0),x1 y:(r4),y1
          move                x:(r1),x0 y:(r5),y0


For the sake of completeness, here are some alternatives.
Doubling up to use two accumulators will lose, because writing 
out the results with (R_O) will always write to the same side of 
memory. Here is a best-case example. By listing the operations to 
be done, it becomes obvious that there are 6 x moves and 4 y 
moves, which will require 6 instructions minimum.  So no savings 
is possible by doubling up the accumulators. 

          move      x:(r1),x0
          move      y:(r5),y0
          move      x:(r0),x1
          move      y:(r4),y1
          mpy       x0,y0,a
          macr      x1,y1,a
          move      a,x:(r6)
          move      x:(r1),x0
          move      y:(r5),y0
          move      x:(r0),x1
          move      y:(r4),y1
          mpy       x0,y0,b
          macr      x1,y1,b
          move      b,x:(r6)

An alternative might be to use two accumulators with an explicit 
round. The hope would be that the accumulators could be doubled 
up to save execution time: 

          move      x:(r1),x0
          move      y:(r5),y0
          move      x:(r0),x1
          move      y:(r4),y1
          mpy       x0,y0,a
          mpy       x1,y1,b
          add       a,b
          rnd       b
          move      b,x:(r6)

But since this results in four explicit operations per element, 
no savings is possible. Yet another possiblity might be to forego 
Motorola's nifty rounding algorithm and add in a rounding 
constant; but then *that* constant would have to come from 
somewhere, and no registers are left for storing constants, and 
moving a constant from memory would in most cases add yet another 
instruction.  So this is a bad idea.

WARNING:  This macro ends with

          move      M_X,R_L        

Therefore, the next instruction after the end of this macro 
should not use the R_L register.