# assembly - x86 cycles per instruction

## Fast signed 16-bit divide by 7 for 6502 (2)

A different way to do this would be to convert the division into a multiplication.

To work out the multiplication factor, we are interested in taking the reciprocal. Essentially we do:

```
d = n*(1/7)
```

To make things more accurate we multiply up by by a convenient power of 2. 2^16 works well:

```
d = floor(n*floor(65536/7)/65536)
```

The multiplication factor is: floor(65536/7) which is 9362. The size of the result will be:

```
ceiling(log2(65535*9362)) = 30 bits (4 bytes rounded up)
```

We can then discard the lower two bytes to divide by 65536, or simply just use the upper 2 bytes for the end result.

To work out the actual rotates and adds we need, we examine the binary representation of the factor 9362:

```
10010010010010
```

Note the repetition of the bit pattern. So an efficient scheme would be to calculate:

```
((n*9*256/4 + n*9)*8 + n)*2 = 9362*n
```

Calculating n*9 only needs ceiling(log2(65535*9)) = 20 bits (3 bytes).

In pseudo assembly this is:

```
LDA number ; lo byte
STA multiply_nine
LDA number+1 ; high byte
STA multiply_nine+1
LDA #0
STA multiply_nine+2 ; 3 byte result
ASL multiply_nine ; multiply by 2
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine ; multiply by 2 (4)
ROL multiply_nine+1
ROL mulitply_nine+2
ASL multiply_nine ; multiply by 2 (8)
ROL multiply_nine+1
ROL mulitply_nine+2
CLC ; not really needed as result is only 20 bits, carry always zero
LDA multiply_nine
ADC number
STA multiply_nine
LDA multiply_nine+1
ADC number+1
STA multiply_nine+1
LDA multiply_nine+2
ADC #0
STA multiply_nine+2 ; n*9
```

The rest of the exercise I leave to the OP. Note it's not necessary to multiply by 256, as this is just a whole byte shift up.

I am working on an assembly language program for a 6502 cpu, and am finding that I need a fast-as-possible divide-by-seven routine, in particular one which could take a 16-bit dividend.

I am familiar with the routines found here, but generalizing the divide-by-seven routine found there is quite complicated, and a cursory examination of the general algorithm (using integer division)

x/7 ~= (x + x/8 + x/64 ... )/8

indicates that to handle a 16-bit range, it would likely take over 100 cycles to complete because of the 6502's single accumulator register and the relative slowness of individual memory bit shifts on the 6502.

I thought that a lookup table might help, but on the 6502, I am of course restricted to lookup tables that are 256 bytes or fewer. To that end one could assume the existence of two 256-byte lookup tables, xdiv7 and xmod7, which, when using an unsigned one-byte value as an index into the table, can quickly obtain the result of a byte divided by 7 or modulo 7, respectively. I am not sure how I could utilize these to find the values for the full 16-bit range, however.

In parallel, I also need a modulo 7 algorithm, although ideally whatever solution that can be worked out with the division will produce a mod7 result as well. If additional precomputed tables are needed, I am amenable to adding those, as long as the total memory requirements for all tables does not exceed about 3k.

Although I ultimately need a signed division algorithm, an unsigned one would suffice, because I can generalize that to a signed routine as needed.

Any assistance would be greatly appreciated.

At Unsigned Integer Division Routines for 8-bit division by 7:

```
;Divide by 7 (From December '84 Apple Assembly Line)
;15 bytes, 27 cycles
sta temp
lsr
lsr
lsr
adc temp
ror
lsr
lsr
adc temp
ror
lsr
lsr
```

The estimate of about 100 cycles with shifts was pretty accurate: 104 cycles to the last ror, 106 cycles total not including `rts`

, 112 cycles for the whole function.

**NOTE**: after assembly for C64 and using VICE emulator for C64 I found the algorithm fails, for example 65535 gives 9343 and the correct answer is 9362.

```
; for 16 bit division by 7
; input:
; register A is low byte
; register X is high byte
; output
; register A is low byte
; register X is high byte
;
; memory on page zero
; temp is on page zero, 2 bytes
; aHigh is on page zero, 1 byte
--
sta temp
stx temp+1
stx aHigh
--
lsr aHigh
ror a
lsr aHigh
ror a
lsr aHigh
ror a
---
adc temp
tax
lda aHigh
adc temp+1
sta aHigh
txa
--
ror aHigh
ror a
lsr aHigh
ror a
lsr aHigh
ror a
--
adc temp
tax
lda aHigh
adc temp+1
sta aHigh
txa
--
ror aHigh
ror a
lsr aHigh
ror a
lsr aHigh
ror a -- 104 cycles
;-------
ldx aHigh ; -- 106
rts ; -- 112 cycles
```