Notes on the implementation of VSADD John Strawn 20 July 1987 1. How the inner loop came to be developed The single scalar input is pointed to by R_I2, which is never incremented. Assuming sinp != sins ("input scalar side"), the inner loop might look like: do x1,pf\vsadd_\ic\loop1 move sinp:(R_I1)+N_I1,x0 sins:(R_I2),a add x0,a a,sout:(R_O)+N_O pf\vsadd_\ic\loop1 An alternate would be something like: move sins:(R_I2,),a move sinp:(R_I1)+N_I1,b add b,a do x1,pf\vsadd_\ic\loop2 add b,a a,sout:(R_O)+N_O move sinp:(R_I1)+N_I1,a pf\vsadd_\ic\loop2 which is even better because there are no sins vs sinp vs sout conflicts. Note that you can't fold that last move into the second move slot in the add instruction because you'd be moving into register a, which is what the add is writing into. In that last example, the "add" doesn't use one of the two move fields. Assuming sinp != sout, you can effectively "double up" the instructions. Pipelining must be carefully handled to avoid the screw case cnt=1. Of course, you've already tested against cnt=0. Assume x1 contains cnt/2. You'd have something like move sins:(R_I2),y0 move sinp:(R_I1)+N_I1,a move sinp:(R_I1)+N_I1,b add x0,a do x1,pf\vsadd_\ic\l3 add y0,b move sinp:(R_I1)+N_I1,a a,sout:(R_O)+N_O add y0,a move sinp:(R_I1)+N_I1,b b,sout:(R_O)+N_O pf\vsadd_\ic\l3 (if odd then:) move a,sout:(R_O)+N_O As Julius says, this is asymptotically twice as fast as the loop 1 and loop2 solutions given earlier. Of course, if sinp==sout, then the solutions are identical in execution time. 2. Debugging. Of the possible combinations of sins, sinp, and sout: sinp sins sout 1 x x x 2 x x y 3 x y x 4 x y y 5 y x x 6 y x y 7 y y x 8 y y y actually, sins is outside the main loop, so it should be adequate to test one x and one y of sins. We must test sinp==sout and sinp=!sout for sinp==x and sinp==y. So the four tests chosen are 1, 2, 5, and 8, numbered t1 through t4 in tvsadd.asm. 3. For testing for even and odd with: ror b ; b gets cnt/2 jcc pf\_vsadd_\ic\_l1 move #1,a1 pf\_vsadd_\ic\_l1 you really want any arbitrary non-zero value in a1, because later you will do this: move y1,b neg b jeq pf\_vsadd_\ic\_l2 ; if cnt odd, move a,sout:(R_O)+N_O ; store final element pf\_vsadd_\ic\_l2 Nominally you could collapse the jcc followed by move into a tcc, but I can't find a *foolproof* source of non-zero anywhere in the sources listed for the tcc instruction. 4. Here are the addresses of the input and output vectors: _SYMBOL X xscal I 0000201E ax_vec I 00002000 ixself_vec I 00002018 out1_vec I 00002020 out3_vec I 0000202E sum1 I 0000205D sum2 I 0000205E sum3 I 0000205F sum4 I 00002060 sum5 I 00002061 sum6 I 00002062 sum7 I 00002063 sum8 I 00002064 sum9 I 00002065 ans1 I 00002066 ans2 I 00002067 ans3 I 00002068 ans4 I 00002069 ans5 I 0000206A ans6 I 0000206B ans7 I 0000206C ans8 I 0000206D ans9 I 0000206E ans999 I 0000206F _SYMBOL Y yscal I 0000201F ay_vec I 0000200C out2_vec I 00002027 out4_vec I 00002035 out5_vec I 0000203C out6_vec I 00002043 out7_vec I 0000204A out8_vec I 00002051