Understanding ASM: Macro vs Subroutine

As I continue adding more custom code to my project, there is one area I am wanting to understand better.

Macros and Subroutines have similar function with some key differences, and I want to know better what I am working with.

What my understanding is so far:
Subroutine: A block of code that you can call from anywhere, as long as that subroutine's label is defined within the bank you currently have active.
Only 1 copy of this code exists, you can call it as many times as you like without bloating the code.
It saves the current code location to the stack, runs the subroutine, and then returns to where it left off.
you -must- return from a subroutine, or you will have undefined and usually game breaking behavior, as the code execution order gets screwed up.

Macro: This code can be called from -anywhere- it doesn't matter which bank you are in. This is because it is not actually calling this code, but inserting the
contents of the macro into the place where you called it. Every time you call a macro, it adds the entire contents of its code to that location. This makes it difficult to
use macros inside of branching statements, because they normally push your labels out of range unless the macro is very short.

So, in my experience. I would think that it would usually be better and faster to create and use subroutines. However there seems to be an efficiency cost that I am not understanding.

I am trying to make my collision code more efficient in this manner:
Make different subroutines for 1 point, 4 point, and 6 point collision.
Only call those subroutines for appropriate objects, such that only the minimum necessary number of collision points is checked every frame.

These subroutines work, however I am noticing much more slowdown, not less. Is it actually more taxing to the CPU to use subroutines rather than direct code?
 

Kasumi

New member
This post of mine may help with what a macro does: http://nesmakers.com/viewtopic.php?f=40&t=2553&p=15973#p15973

It may be advanced knowledge, but you don't have to return from a subroutine. If you don't, though, you should fix the the stack (but even this isn't required if you really know what you're doing).

Code:
main:
pla ;pull one byte of the return address off the stack
pla ;pull the other byte

;main game loop here

jsr main;pushes the two bytes for the address to return to the stack
The above is viable! It's not better than:
Code:
main:

;main game loop here

jmp main
But it's viable, and basically functionally identical.

Macros are a concept the CPU never gets to see. A macro can be placed anywhere in your code (I wouldn't say it's called), but it can totally still matter what bank is in. The macro just results in a copy/paste of the code inside the macro as far as the CPU/assembler are concerned. (Well, with the arguments changed.) If the macro relies on code that's not in the current bank, the same thing will happen as if you jsr'd to a subroutine that's not in the current bank. (The same thing would happen because it IS the same thing.)

Macros are usually faster than subroutines, because there's a penalty for calling a subroutine. Here's two pieces of code:

Code:
jsr subroutine
jmp someplaceelse
subroutine:
     lda #$00
rts

Code:
MACRO subroutine
  lda #$00
ENDM

subroutine
jmp someplaceelse
The first takes 17 cycles to run. (jsr takes 6 cycles, lda #$00 takes 2 cycles, RTS takes 6 cycles, jmp takes 3 cycles).
The second takes 5 cycles. (lda #$00 takes 2 cycles, jmp takes 3 cycles)

But macros can have jsrs to subroutines within them. The only way they save cycles is avoiding the jsr/rts like in the examples above. So why not always use them? Because speed isn't everything. In the post I linked: http://nesmakers.com/viewtopic.php?f=40&t=2553&p=15973#p15973

The GoToScreen macro contains code that will use at least 15 bytes. 6 of them is 90 bytes. If you stored it as a subroutine (and you kind of couldn't, because necessary setup code is in the macro, but roll with me), you'd need 15 bytes to store the routine once. Then another byte for the RTS, which the macro doesn't need. Then you'd need 6 jsrs at 3 bytes each. 18 +16=34 bytes is way less than 90.

The reason to use a macro is to save screen space within your text editor. 5 commonly used lines of code made into one keyword means more code visible on screen at once. It can also make certain things that would be hard to understand (like the reverse subtraction) easier to understand upon returning later. Also CPU time at the cost of bytes.

In short: A subroutine saves you space (so long as its contents excluding the RTS are more than two bytes), but never time. There's a 12 cycle cost for calling a subroutine and returning, copying the block of code within the subroutine multiple times instead avoids spending this 12 cycles, and the macro makes copying the block easier.
 
Thanks, I think I understand quite a bit better now.

By changing the Macros to Subroutine calls in the collision code, I am not necessarily making anything easier for the processor (in fact making things slightly harder), just saving on actual code space.

Even if you are re-using the same code multiple times, in some cases it is worth the extra code space to not have to deal with jumping to and returning from a subroutine.

In cases of action scripts, which are not always dealt with every single frame (inputs, enemy ai, etc.) It would usually be better to use a subroutine and save code space.
( Some of my monsters, namely bosses, have extremely large scripts I call via subroutine and the performance hit is not noticeable at all )

For code that is run -every single frame- such as collision checking, you should try to avoid subroutines because time to execute is much more critical than space.
 

Kasumi

New member
Collision checking is run MULTIPLE TIMES per frame, so the 12 cycles per really adds up. But there's all kinds of other optimizations too.

NES Maker actually gives up a lot of RAM to optimize its collision checks. You can give up a bunch of space for speed by "unrolling loops". Like, you know how the button read code reads 8 buttons? Instead of a loop, just copy paste the loop body 8 times. That is an "unrolled" loop. Most of the time when you optimize for space, you lose speed, and when you optimize for speed, you lose space. When you find a way to win both ways, you've done something particularly clever. ;)

NES Maker actually runs a lot of its loops forwards, rather than backwards, which is slower.

Code:
ldx #0
loop:
lda mario,x;or whatever
inx
cpx #LOOPCONSTANT
bne loop
vs
Code:
ldx #LOOPCONSTANT-1
loop:
lda mario,x;or whatever
dex
bpl loop
Why is it faster? No cpx. (Which is a time save on every loop iteration, not just once per frame.) You can't always do it because sometimes you really need to go in 0 to X order, but you can often do it. There's also TXA STA, which NES Maker does a lot, but STX exists. (I've considered making a post about this, since it ends up getting replicated in a lot of user scripts. It's 2 cycles and a byte lost every time. Not huge, but still... The case where you NEED to do TXA, STA is when it's a ,x address or a ,y address outside the zero page which is not often the case. Another benefit of not using TXA is that it doesn't change A, which might mean it has to be reloaded losing more cycles and bytes.)

Anyway, optimization is hard, and experience is probably the only teacher.
 

drexegar

Member
Kasumi said:
Anyway, optimization is hard, and experience is probably the only teacher.

Thank you kasumi this is ever so important information! I look forward to a post about optimization!
 

Kasumi

New member
I'm probably going to write a whole NES programming tutorial when I finish up my program, but I don't have anything specifically about optimization planned. Experience is the only teacher!
 
Top Bottom