; DESCRIPTION ; ; This source file provides a fast 32-bit floating point divide routine for ; the MSP430 processor family. (It unrolls the division loop by 8 bits.) ; ; I have arranged things here to be similar to something a C compiler might ; call and not necessarily how I may use it for an assembly-code-only TSB ; system. But it is relatively easy to adapt and I had to pick some kind of ; calling convention for illustration purposes. ; ; I am also keeping closer to the IEEE 754 single precision format than the ; HP 2116 floating point format here. This is just a starting concept point ; for the purposes of asking a question here. It's not necessarily where I ; will end up. ; ; ; ROUNDING ; ; A special note on rounding is in order. There are several rounding modes ; possible; such as round-towards-zero, round-down, round-up, and another ; which is more commonly used ... round-to-nearest. In the case of round-to- ; nearest, this can be similar to what is often taught in school (where .5 is ; always rounded up) or it can be a more statistically useful method called ; round-to-nearest-even, which rounds towards the even value (up or down, ; depending.) The default assumption made by floating point libraries is to ; use round-to-nearest-even. ; ; This may sound a lot more complicated to implement than it really is, in a ; subroutine like this one. It turns out that it is impossible to have an ; exact .5 to worry about. Since it never can occur, a very simple rounding ; choice is actually used in the following code without any need to refer to ; the quotient. ; ; For those interested in the mathematical argument about why, here it is. ; ; There are two possible cases we need to cope with. The division performed ; is actually an integer one and conceptually shifts the dividend by either ; 24 or 25 bits to the left, depending on whether the divisor is larger or ; smaller than the dividend. The main idea here is that the leading bit out ; of the division should be a '1' so that the quotient is already normalized. ; ; All this hand-waving means we have two cases to contend with: ; ; (1) dividend * 2^24 / divisor = quotient + .5 ; (2) dividend * 2^25 / divisor = quotient + .5 ; ; The .5 value shown in these is really the remainder/divisor, but where we ; are assuming that it results in an exact .5 that we'd need to worry about, ; if we are to round-to-nearest-even. (We assume the .5 as an exact result ; of computing remainder/divisor.) ; ; This is then: ; ; (1) dividend * 2^24 = quotient * divisor + divisor/2 ; (2) dividend * 2^25 = quotient * divisor + divisor/2 ; ; Now, these two can only be true if the divisor itself has its least ; significant bit as '0', which we will assume is true for now. Anyway, we ; can multiply through by 2, to get: ; ; (1) dividend * 2^25 = 2 * quotient * divisor + divisor ; (2) dividend * 2^26 = 2 * quotient * divisor + divisor ; ; Now, let's use the modulo function on both sides: ; ; (1) (2 * quotient * divisor + divisor) mod 2^25 = 0 ; (2) (2 * quotient * divisor + divisor) mod 2^26 = 0 ; ; Now, just to reduce the cases to one, let's use instead mod 2^25 for the ; second case. It's still true that it will equal 0 and this helps to make ; only one case to worry about. The other one comes along for free, then: ; ; (2 * quotient * divisor + divisor) mod 2^25 = 0 ; ; This must be true in both cases. We can now extract the divisor, this way: ; ; (divisor * [2 * quotient + 1]) mod 2^25 = 0 ; ; This is a product and it requires that the product produce at least 25 '0' ; bits in the least significant bits of the product result. This can only be ; true if and only if there are >=25 least significant '0's present in the ; divisor's least significant bits plus those also found in [2*quotient + 1]. ; But we can easily see that [2*quotient + 1] has no least signficant bits of ; '0' -- shifting the quotient by one bit to the left and adding one pretty ; much guarantees that the least significant bit is a '1', making that ; contribution nil. So this means that the divisor must have at least 25 ; least significant '0' bits. But this is impossible, since the divisor in ; a 32-bit FP value is only 24 bits in size. One cannot provide a divisor ; that can cause this to occur. ; ; Thus, having to deal with an exact .5 isn't needed here. So a simpler ; rounding method is applied in this code -- namely, rounding anything .5 ; and greater, upwards. Doing that works out to exactly the same thing as ; is required in the rounding specification for round-to-nearest-even. ; ; ; SPECIAL THANKS ; ; I actually owe thanks to a lot of places. This includes books on VHDL, ; various web sites, and so on. But I also owe a special debt to W. ; Dijkstra for his thoughts about an earlier version I'd written. ; -------------------------------------------------------------------------- ; fpdiv 32-bit floating-point divide ; -------------------------------------------------------------------------- ; ; This routine divides a 32-bit floating point dividend by a 32-bit floating ; point divisor, producing the 32-bit floating point quotient. ; ; No support for denormals at this time. ; ; Inputs: ; ; R15:R14 32-bit floating-point dividend ; 4(SP):2(SP) 32-bit floating-point divisor ; ; Outputs: ; ; R15:R14 32-bit floating-point quotient ; ; Scratches: ; ; none ; ; Stack Handling: ; ; The passed parameters must be cleaned up by this routine. The caller ; does not do it. In this sense, this is unlike a c function. ; ; Detailed Discussion: ; ; There is a fast division ahead. It is an unrolled-by-8 loop and it is a ; non-restoring division method. It produces 8 bits per loop. ; ; If you aren't familiar with non-restoring division, the consider this idea: ; Call each step of the trial subtraction T[n]. At each step we want to test ; the result of 2 times the prior remainder minus the divisor. If the trial ; subtraction succeeds, we just shift the remainder one bit to the left and ; do the subtraction again. So at each step our successful trial subtraction ; it is something like this: ... T[i]=2*T[i-1]-D, T[i+1]=2*T[i]-D ... and so ; on. But if the trial subtraction fails, the divisor needs to added back ; to the remainder before before shifting and doing the next subtraction. ; Instead, this is T[i+1]=2*(T[i]+D)-D. But this is just T[i+1]=2*T[i]+D. ; Comparing these two results, you can see that the shift of T[i] still takes ; place, but that in this case addition of the divisor is used on the next ; step instead of subtraction. Non-restoring division is named that way ; because it replaces the addition+shift+subtraction used when subtraction ; from the remainder fails, with shift+addition. It alternates between ; shift+subtract and shift+add, depending on the quotient bit generated. ; ; It is unrolled-by-8 in the sense that the code for generating each bit is ; copied 8 times in inline fashion, to avoid a conditional loop on each bit. ; Since a decrement and conditional jump require three cycles, doing that ; once for 8 bits is a big win over doing it once for each bit. ; ; This code needs to produce 24 good mantissa bits as a quotient, where the ; leading bit is '1', and we then need to also produce one final bit for ; rounding purposes. (If the leading bit -- the most significant one -- ; isn't a '1', then we'd lose significance because of the fixed count of ; generated quotient bits.) ; ; The actual division code loop is best examined as a coroutine. Imagine ; the division routine something like the following diagram. Since program ; text is usually viewed under fixed-point fonts, it should render as a ; readable picture. I apologize for any difficulties in interpreting it: ; ; start ,-------, ,-------, ; | | | | | ; | | v v | ; | | ,-------, ,-------, | ; v | |movbyte| |movbyte| | ; ,-------, | '-------' '-------' | ; |frstbit| | | | | ; '-------' | v v | ; | | ,-------, ,-------, | ; | | |rlc/sub| |rla/add| | ; v | '-------'----------, ,----------'-------' | ; '------ | ----> | c=0 \/ c=1 | c=0 | ; | c=1 v /\ v | ; | ,-------, <--------' '--------> ,-------, | ; | |rlc/sub| |rla/add| | ; | '-------'----------, ,----------'-------' | ; | c=1 | c=0 \/ c=1 | c=0 | ; | v /\ v | ; | ,-------, <--------' '--------> ,-------, | ; | |rlc/sub| |rla/add| | ; | '-------'----------, ,----------'-------' | ; | c=1 | c=0 \/ c=1 | c=0 | ; | v /\ v | ; | ,-------, <--------' '--------> ,-------, | ; | |rlc/sub| |rla/add| | ; | '-------'----------, ,----------'-------' | ; | c=1 | c=0 \/ c=1 | c=0 | ; | v /\ v | ; | ,-------, <--------' '--------> ,-------, | ; | |rlc/sub| |rla/add| | ; | '-------'----------, ,----------'-------' | ; | c=1 | c=0 \/ c=1 | c=0 | ; | v /\ v | ; | ,-------, <--------' '--------> ,-------, | ; | |rlc/sub| |rla/add| | ; | '-------'----------, ,----------'-------' | ; | c=1 | c=0 \/ c=1 | c=0 | ; | v /\ v | ; | ,-------, <--------' '--------> ,-------, | ; | |rlc/sub| |rla/add| | ; | '-------'----------, ,----------'-------' | ; | c=1 | c=0 \/ c=1 | c=0 | ; | v /\ v | ; | ,-------, <--------' '--------> ,-------, | ; '<--|rlc/dec| |rla/dec|-->' ; z=0 '-------' '-------' z=0 ; | z=1 z=1 | ; v v ; ,-------, ,-------, ; | round1| | round0| ; '-------' '-------' ; | | ; v v ; '-------------> end <------------' ; ; In the above, the 'movbyte' code moves the 8 temporary bits of the quotient ; to another location in preparation for the next 8 bits being generated. ; The 'rlc/sub' code shifts the remainder and quotient, placing a '1' bit ; into the quotient from the right side, and performs a subtraction step. ; This is done when the produced quotient bit is a '1'. The 'rlc/add' code ; shifts the remainder and quotient, placing a '0' bit into the quotient from ; the right side, and performs an addition step. This is done when the pro- ; duced quotient bit is a '0'. 'frstbit' represents the generation of the ; first quotient bit. This is special because we know something about the ; relationship of dividend and divisor (the dividend and divisor both have ; their MSB=1 and are within a factor of two of each other, which as soon as ; the first subtraction takes place is no longer known) and we must ensure ; the leading bit is '1' and may need to adjust the exponent, plus init the ; loop counter. The 'round1' and 'round0' parts deal with differences in ; interpreting the final remainder for rounding, depending on what the last ; generated quotient bit was. ; -------------------------------------------------------------------------- fpdiv:: ; ---- ; Save registers we'll use throughout, below. ; ---- push R11 ; will hold exponent and sign push R10 ; temporary & partial quotient push R9 ; temporary & partial quotient ; ---- ; Get the divisor into R13:R12. ; ---- push R13 push R12 mov 14(SP), R13 mov 12(SP), R12 ; ---- ; Compute the sign of the quotient and extract both exponents. ; ; R11 dividend excess exponent (upper byte is garbage) ; R10 divisor excess exponent (upper byte is garbage) ; R9 quotient sign in least significant bit ; ---- ; Handle dividend exponent and extract its sign. mov R15, R11 ; shift dividend exponent to upper rla R11 ; 8 bits of R11 and compute the mov SR, R9 ; quotient's sign into R9 swpb R11 ; flip exponent to lower 8 bits ; Handle divisor exponent and compute final sign of quotient. mov R13, R10 ; shift divisor exponent to upper rla R10 ; 8 bits of R10 and copy its xor SR, R9 ; sign (now in the carry) to R9 swpb R10 ; flip exponent to lower 8 bits ; ---- ; Now check for early-out conditions based on the exponents of ; the dividend and divisor. Remember that the upper byte of ; each is garbage. The order of the following tests are ; important. The following tests treat denormals as zero. ; ; The tests are handled this way so that normal floating point ; operations are as fast as possible. Jumps are expensive, so ; arithmetic operations are preferred here. ; ---- ; Upper byte of dividend exponent is 0 for valid exponents. add.b #1, R11 sub #2, R11 ; Upper byte of divisor exponent is 0 for valid exponents. add.b #1, R10 sub #2, R10 ; Sign bit is set in either, only for exception cases. bit R10, R11 jn fp32divf_except ; At this point, the exponents are offset by -1. ; ---- ; Compute the quotient's binary exponent, early. Since both ; exponents include the floating point excess notation along ; with any above coded biases (-1, at this time), subtracting ; the two from each other removes this excess from the result. ; ; The floating point excess will need to be added back to the ; quotient's exponent, later. For now, it will be in signed ; binary form. ; ; The code here also pack the sign into the signed exponent to ; free up R9 for other use, later on. ; ; R11 preliminary, binary exponent of quotient ; R10 freed for use ; R9 freed for use ; ---- sub R10, R11 ; compute binary exponent value rrc R9 ; quotient sign into carry and then rlc R11 ; to low order bit of exponent ; ---- ; Extract the normalized mantissas of the dividend and divisor. ; This includes inserting the hidden bit implied by the notation. ; ---- ; R15:R14 dividend mantissa in MMMM:MM00 format. bis.b #0x80, R15 ; add the hidden bit xor R14, R15 ; then shift the mov.b R14, R14 ; dividend up by 8 xor R14, R15 ; bits, placing zeros swpb R14 ; into the lower- swpb R15 ; most 8 bits and ; R13:R12 divisor mantissa in MMMM:MM00 format. bis.b #0x80, R13 ; add the hidden bit and xor R12, R13 ; then shift the mov.b R12, R12 ; divisor up by 8 xor R12, R13 ; bits, placing zeros swpb R12 ; into the lower- swpb R13 ; most 8 bits and ; ---- ; Before the division loop proper starts, we need to handle the ; first quotient bit step a little differently. ; ; The division loop improves execution time by not having to ; check for significant bits shifted into the carry before doing ; a trial subtraction (or addition.) This means that it must ; work entirely within the registers, and that means that the ; divisor has a leading zero bit (the dividend does not require ; this.) So we already know we must shift the divisor down by 1 ; bit before we even start. If dividend >= divisor, then we also ; need to shift down the dividend, too. Otherwise, no. But all ; this can be combined a bit, as the following code shows. ; ; If it turns out that the dividend < divisor, then we must also ; shift the dividend left by one bit, so dividend >= divisor. In ; this case, there is a -1 adjustment to the exponent to account ; for this. ; ; Inputs: ; R15:R14 normalized dividend ; R13:R12 normalized divisor ; ; Outputs: ; R15:R14 adjusted remainder ; R13:R12 divisor, denormalized right by 1 bit ; R11 exponent is updated, ; ---- ; Do the first subtraction. If dividend >= divisor, then ; the remainder is good and all that is left to do is to set ; up the loop and plug in the first quotient bit and go. sub R12, R14 subc R13, R15 jc fp32divf_0 ; Otherwise, dividend < divisor. So fix up the remainder as ; though we'd shifted the dividend up 1 bit before the trial. rla R14 rlc R15 add R12, R14 addc R13, R15 ; (carry = 1 here) ; Adjust the exponent, since we effectively left-shifted the ; dividend. Note that R11 also holds quotient sign, so the ; value subtracted is left shifted 1 bit, first. sub #(1 << 1), R11 ; Now, shift the divisor down by one bit. fp32divf_0: clrc rrc R13 rrc R12 ; R9 holds a partial quotient value. Clear it. clr R9 ; The first quotient bit is always '1'. Insert it. bis #1, R14 ; Prepare the loop counter and start the division loop. push R8 ; save loop counter variable mov #3, R8 ; set up the counter jmp fp32divf_1x ; and jump in ; Only 8 bits of the quotient are produced each loop. ; ; Below, I use a Q[x] notation. This just means which of ; the 8 bits produced each loop is being produced at that ; point, with Q[7] being the highest order quotient bit of ; the loop and Q[0] being the lowest order quotient bit. ; Quotient bit = '1' portion of coroutine. fp32divf_1: swpb R9 ; R9 holds last two bytes of bis R10, R9 ; quotient. sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[7] jnc fp32divf_2a ; Q[7] = "0", go make it so fp32divf_1a: rlc R14 ; Q[7] = "1", shift it in rlc R15 fp32divf_1x: sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[6] jnc fp32divf_2b ; Q[6] = "0", go make it so fp32divf_1b: rlc R14 ; Q[6] = "1", shift it in rlc R15 sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[5] jnc fp32divf_2c ; Q[5] = "0", go make it so fp32divf_1c: rlc R14 ; Q[5] = "1", shift it in rlc R15 sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[4] jnc fp32divf_2d ; Q[4] = "0", go make it so fp32divf_1d: rlc R14 ; Q[4] = "1", shift it in rlc R15 sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[3] jnc fp32divf_2e ; Q[3] = "0", go make it so fp32divf_1e: rlc R14 ; Q[3] = "1", shift it in rlc R15 sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[2] jnc fp32divf_2f ; Q[2] = "0", go make it so fp32divf_1f: rlc R14 ; Q[2] = "1", shift it in rlc R15 sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[1] jnc fp32divf_2g ; Q[1] = "0", go make it so fp32divf_1g: rlc R14 ; Q[1] = "1", shift it in rlc R15 sub R12, R14 ; perform trial subtraction for subc R13, R15 ; quotient bit Q[0] jnc fp32divf_2h ; Q[0] = "0", go make it so fp32divf_1h: rlc R14 ; Q[0] = "1", shift it in rlc R15 mov.b R14, R10 ; save least bits of quotient bic R10, R14 ; and remove them from remainder dec R8 ; another 8 bits to go? jnz fp32divf_1 ; yes -- unpack these 8 and repeat ; Loop ends on '1' bit generation. Perform rounding as if ; a final division step. sub R12, R14 ; perform trial subtraction subc R13, R15 ; for rounding purposes jmp fp32divf_3 ; waste two cycles for code space ; Quotient bit = '0' portion of coroutine. fp32divf_2: swpb R9 ; R9 holds last two bytes of bis R10, R9 ; quotient. add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[7] jc fp32divf_1a ; Q[7] = "1", go make it so fp32divf_2a: rla R14 ; Q[7] = "0", shift it in rlc R15 add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[6] jc fp32divf_1b ; Q[6] = "1", go make it so fp32divf_2b: rla R14 ; Q[6] = "0", shift it in rlc R15 add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[5] jc fp32divf_1c ; Q[5] = "1", go make it so fp32divf_2c: rla R14 ; Q[5] = "0", shift it in rlc R15 add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[4] jc fp32divf_1d ; Q[4] = "1", go make it so fp32divf_2d: rla R14 ; Q[4] = "0", shift it in rlc R15 add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[3] jc fp32divf_1e ; Q[3] = "1", go make it so fp32divf_2e: rla R14 ; Q[3] = "0", shift it in rlc R15 add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[2] jc fp32divf_1f ; Q[2] = "1", go make it so fp32divf_2f: rla R14 ; Q[2] = "0", shift it in rlc R15 add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[1] jc fp32divf_1g ; Q[1] = "1", go make it so fp32divf_2g: rla R14 ; Q[1] = "0", shift it in rlc R15 add R12, R14 ; perform trial addition for addc R13, R15 ; quotient bit Q[0] jc fp32divf_1h ; Q[0] = "1", go make it so fp32divf_2h: rla R14 ; Q[0] = "0", shift it in rlc R15 mov.b R14, R10 ; save least bits of quotient bic R10, R14 ; and remove them from remainder dec R8 ; another 8 bits to go? jnz fp32divf_2 ; yes -- unpack these 8 and repeat ; Loop ends on '0' bit generation. Perform rounding as if ; a final division step. add R12, R14 ; perform trial addition for addc R13, R15 ; rounding purposes ; Regardless of whether the loop ended on '0' or '1' bit ; generation, the rounding bit is in the carry. Add it ; into the quotient and move the quotient into R15:R14. fp32divf_3: pop R8 ; restore loop counter R8's value addc.b #0, R10 ; add rounding bit from carry addc #0, R9 ; into the quotient in R9:R10 swpb R9 ; now arrange all this into mov.b R9, R15 ; R15:R14 such that the xor R15, R9 ; quotient is in the form bis R9, R10 ; of 00QQ:QQQQ mov R10, R14 ; ---- ; Add excess notation to exponent and validate it. ; ; The exponent, on arrival here, is in binary form. It should ; range from the minimum allowed of -126 to the maximum of 127. ; Because of the excess notation, these values will range from ; ; The exponent bias of 127 needs to be added. This is handled ; as part of the test for the minimum value of -126, but leaving ; it just short by 1. This is because the mantissa has a leading ; bit that is 1 and when the exponent is added to pack up the ; floating point value, this extra 1 is added back. ; ; (Note that R11 still holds a combined signed exponent and the ; sign of the operation.) ; ; Inputs: ; R11 signed exponent + sign, form: EEEE EEEE EEEE EEES ; ; Outputs: ; R11 updated exponent + sign, form: 0000 000E EEEE EEES ; ---- ; Validate that the exponent is >= -126. If not, the value ; underlows to zero. (Could denormal, if supported.) add #126 << 1, R11 jn fp32divf_z ; 127 is the largest allowable exponent. We've already added ; 126 to it, so the maximum allowable value is 127+126. cmp #(127+126+1) << 1, R11 jhs fp32divf_i ; ---- ; Pack the sign, exponent, and mantissa into the final floating- ; point result, such that: ; ; R15:R14 final floating point quotient ; ---- rrc R11 ; place exponent into position swpb R11 ; with the sign bit rrc R11 add R11, R15 ; merge into final quotient ; ---- ; Restore registers. ; ---- fp32divf_r: pop R12 pop R13 pop R9 pop R10 pop R11 mov @SP, 4(SP) ; move return address add #4, SP ; and get rid of parameters ret ; ---- ; Handle exception cases. At this point, we've determined that ; either the dividend or the divisor or both is one of zero or ; infinity. Denormals are treated as zero by earlier code that ; branches here, so any denormal handling (currently none) would ; start here. ; ; The value of an exponent here is 0xFFFF for 0.0 values and is ; 0xFFFE for INF values. If the exponent isn't for one of these ; two special values, then it is the usual excess notation for ; floating point exponent, less 1. (The test that arrives here ; subtracts one from the usual excess notated exponent as part of ; the test.) ; ; Inputs: ; R11 biased, excess notated dividend exponent ; R10 biased, excess notated divisor exponent ; ---- fp32divf_except: add #1, R10 ; check for divisor = 0.0 and jeq fp32divf_i ; return signed Inf, if so add #2, R11 ; check for dividend = Inf and jeq fp32divf_i ; return signed Inf, if so ; ---- ; Return 0 result as quotient. ; ---- fp32divf_z: clr R15 ; set 0.0 exponent and mantissa fp32divf_s: clr R14 ; clear the low order mantiaa jmp fp32divf_r ; ---- ; Return INF result as quotient. ; ---- fp32divf_i: mov #0x7F80, R15 ; INF return jmp fp32divf_s .dbend