assembly - x86 cycles per instruction
Fast signed 16-bit divide by 7 for 6502 (2)
A different way to do this would be to convert the division into a multiplication.
To work out the multiplication factor, we are interested in taking the reciprocal. Essentially we do:
d = n*(1/7)
To make things more accurate we multiply up by by a convenient power of 2. 2^16 works well:
d = floor(n*floor(65536/7)/65536)
The multiplication factor is: floor(65536/7) which is 9362. The size of the result will be:
ceiling(log2(65535*9362)) = 30 bits (4 bytes rounded up)
We can then discard the lower two bytes to divide by 65536, or simply just use the upper 2 bytes for the end result.
To work out the actual rotates and adds we need, we examine the binary representation of the factor 9362:
Note the repetition of the bit pattern. So an efficient scheme would be to calculate:
((n*9*256/4 + n*9)*8 + n)*2 = 9362*n
Calculating n*9 only needs ceiling(log2(65535*9)) = 20 bits (3 bytes).
In pseudo assembly this is:
LDA number ; lo byte STA multiply_nine LDA number+1 ; high byte STA multiply_nine+1 LDA #0 STA multiply_nine+2 ; 3 byte result ASL multiply_nine ; multiply by 2 ROL multiply_nine+1 ROL mulitply_nine+2 ASL multiply_nine ; multiply by 2 (4) ROL multiply_nine+1 ROL mulitply_nine+2 ASL multiply_nine ; multiply by 2 (8) ROL multiply_nine+1 ROL mulitply_nine+2 CLC ; not really needed as result is only 20 bits, carry always zero LDA multiply_nine ADC number STA multiply_nine LDA multiply_nine+1 ADC number+1 STA multiply_nine+1 LDA multiply_nine+2 ADC #0 STA multiply_nine+2 ; n*9
The rest of the exercise I leave to the OP. Note it's not necessary to multiply by 256, as this is just a whole byte shift up.
I am working on an assembly language program for a 6502 cpu, and am finding that I need a fast-as-possible divide-by-seven routine, in particular one which could take a 16-bit dividend.
I am familiar with the routines found here, but generalizing the divide-by-seven routine found there is quite complicated, and a cursory examination of the general algorithm (using integer division)
x/7 ~= (x + x/8 + x/64 ... )/8
indicates that to handle a 16-bit range, it would likely take over 100 cycles to complete because of the 6502's single accumulator register and the relative slowness of individual memory bit shifts on the 6502.
I thought that a lookup table might help, but on the 6502, I am of course restricted to lookup tables that are 256 bytes or fewer. To that end one could assume the existence of two 256-byte lookup tables, xdiv7 and xmod7, which, when using an unsigned one-byte value as an index into the table, can quickly obtain the result of a byte divided by 7 or modulo 7, respectively. I am not sure how I could utilize these to find the values for the full 16-bit range, however.
In parallel, I also need a modulo 7 algorithm, although ideally whatever solution that can be worked out with the division will produce a mod7 result as well. If additional precomputed tables are needed, I am amenable to adding those, as long as the total memory requirements for all tables does not exceed about 3k.
Although I ultimately need a signed division algorithm, an unsigned one would suffice, because I can generalize that to a signed routine as needed.
Any assistance would be greatly appreciated.
At Unsigned Integer Division Routines for 8-bit division by 7:
;Divide by 7 (From December '84 Apple Assembly Line) ;15 bytes, 27 cycles sta temp lsr lsr lsr adc temp ror lsr lsr adc temp ror lsr lsr
The estimate of about 100 cycles with shifts was pretty accurate: 104 cycles to the last ror, 106 cycles total not including
rts, 112 cycles for the whole function.
NOTE: after assembly for C64 and using VICE emulator for C64 I found the algorithm fails, for example 65535 gives 9343 and the correct answer is 9362.
; for 16 bit division by 7 ; input: ; register A is low byte ; register X is high byte ; output ; register A is low byte ; register X is high byte ; ; memory on page zero ; temp is on page zero, 2 bytes ; aHigh is on page zero, 1 byte -- sta temp stx temp+1 stx aHigh -- lsr aHigh ror a lsr aHigh ror a lsr aHigh ror a --- adc temp tax lda aHigh adc temp+1 sta aHigh txa -- ror aHigh ror a lsr aHigh ror a lsr aHigh ror a -- adc temp tax lda aHigh adc temp+1 sta aHigh txa -- ror aHigh ror a lsr aHigh ror a lsr aHigh ror a -- 104 cycles ;------- ldx aHigh ; -- 106 rts ; -- 112 cycles