Calling conventions, stack frame and zero page:

The stack will first increment, then store values.
Frame pointer is kept in AC3 and ZP "fp" (or HW FP).
Stack pointer is kept in ZP "sp".
Stack reference pointer is kept in ZP "spref".
Stack reference pointer is the top of stack and used to restore stack
after a function call.  This offset on the stack is unallocated and used
when calling subroutines (usually to store return address).

0-17	Unused (by us)
20-26	(Auto-increment), scratch
27	Stack pointer (if no in hardware)
30-37	(Auto-decrement), scratch
40-47	Used by HW stack and MMPU
50-377	Addresses for subroutines/...

The normal registers (AC0-AC2) are all considered scratch registers.

#ifdef 2bsd
long data type is "pdp-endian", which means that the low address 
contains the high word, and the high address is the low word.
The word itself is little-endian, as opposed to the HW addressing
used on nova with byte instructions (nova3/nova4).
#endif

Register classes are assigned as:
	AC0-AC2: AREGs.
	AC0+AC1: BREG (long, concatenated).

In byte code the right half of a word is the first byte (little-endian).
This is different to how the hardware handle it.

Stack frame layout (with frame pointer). 2+1 words to save on stack:
Stack grows upward

 sp ->	! free	! 3	<- also spref
	! w2	! 2
	! w1	! 1
 fp ->	! ospref! 0
 	! ofp	! -1
	! arg0	! -2	<- stack pointer when enter new function
	! arg1	! -3
ospref->! opc	! -4

Return values are in ac0 (and ac1).

The prolog/epilog (csav/cret) are implemented as subroutines, called 
from the ZP dispatcher (for overlay handling).
Both can be omitted if the function do not need it, but the FP in AC3
must be restored.

sp is either in loc 027 (auto-inc) or in HW stack.

In ZP (from srt0.s) a bunch of words are defined:
	.zrel
spref:	.word 0
fp:	.word 0
curseg:	.word 0
C0:	.word 0
C1:	.word 1
	...

calling function 2-arg: (cleanup done in return function)
	sta arg2,@sp	/	push arg2
	sta arg1,@sp	/	push arg1
	jsr @[.word _fun] /	jsr @[.word _fun]


in function: (2 words needed on stack)
_fun:	.word 2		/	.word 2
	...
	jmp @cret	/	jmp @cret


"Thunks" generated by the linker in non-overlayed space
_t_foo:	lda 0,C1	# new segment in ac0
	lda 2,.+2	# destination function address in ac2
	jmp segcmn
	.word _foo

Switch segment (if needed) and setup for new function
segcmn:	lda 1,curseg
	sta 0,curseg	# update before call
	sub# 0,1,szr
	  nioc mmu	# supervisor call if changing segment
csav: 	
	lda 1,0,3	/	lda 1,0,3	# fetch # words needed on stack
	inc 3,2		/	inc 3,2		# put return address in ac2

	sta 0,@sp	/	push 0		# store old seg
	lda 0,fp	/	mffp 0		# fetch fp
	sta 0,@sp	/	push 0		# ...and push
	lda 0,spref	/	lda 0,spref	# fetch spref
	sta 0,@sp	/	push 0		# ...and push
	lda 3,sp	/	mfsp 3		# fetch sp
	sta 3,fp	/	mtfp 3		# save sp as new fp

	add 3,1		/	add 3,1		# calc new stack pointer
	sta 1,sp	/	mtsp 1		# store
	sta 1,spref	/	sta 1,spref	# store spref as well 
	jmp 0,2		/	jmp 0,2		# return

cret:
	lda 2,0,3	/	lda 2,0,3	# old sp/spref
	sta 2,spref	/	sta 2,spref
	lda 3,-1,3	/	lda 3,-1,3	# old fp
	sta 3,fp	/	mtfp 3		# restore fp
	sta 2,sp	/	mtsp 2		# cleanup stack
	jmp @0,2	/	jmp @0,2



# little-endian bytes. 
# load byte, ptr in ac2, ret in ac0
lbyte:	sta 3,@spref	# save return address
	
	lda 1,[377]	# get byte mask
	movr 2,2,szc	# skip next if right word
	movs 1,1	# swap mask
	lda 0,,2	# load word into ac0
	and 1,0,szc	# mask our wanted word
	movs 0,0	# swap back if left

	lda 3,fp	# restore ac3
	jmp 0,@spref	# get back

# save byte, ptr in ac2, byte in right half of ac0
sbyte:	sta 3,@spref

	lda 3,[377]	# mask
	and 3,0		# Ensure ac0 high byte is clear

	movr 2,2,szc	# skip next if left half
	movs 0,0,skp	# right half; swap input byte
	movs 3,3	# left half; swap mask

	lda 1,,2	# get word
	and 3,1		# mask out saved half
	add 0,1		# merge in other half
	sta 1,2		# write back

	lda 3,fp
	jmp @spref

# shift ac0 left ac1 times
shl:	sta 3,@spref
	neg 1,1,snr	# negate count.
	jmp out		# done if no shift

	movzl 0,0	# shift left
	inc 1,1,szr	# done?
	jmp .-2
out:
	lda 3,fp
	jmp @spref

# struct assignment
# on entry: src in ac2, dst in ac0, negative cnt in ac1
__stcpy:
	sta 3,@spref
	lda 3,fp
	sta 2,20	# dest
	sta 0,21	# src
	dsz 20
	dsz 21

again:
	lda 0,@20	# inc first, then load
	sta 0,@21	# inc first, then store
	inc 1,1,szr
	jmp again
	jmp @spref
