
In the introduction we have outlined the applications that require 64bit precision, scientific simulations and CAD/CAM being notable examples. However, the transition from normal scalar code to 64bit floatingpoint SSE2 code is complex and it may require some major design changes. A more conservative approach would be moving to scalar SSE2 code, i.e. using scalar instruction (you can easily identify them by the S postfix instead of D) that work on a single 64bit datum. The top benefit of this strategy is that no parallelism is exploited, so it will naturally fit the existing scalar code, and that it does not require 16byte alignment of memory operands; the major drawback is that it wastes a potential 2x speedup. The tradeoff between development time and expected performance determines which strategy is more sensible. It should be noted that handcoding algorithms with SSE2 should be faster than with x87, as SSE2 offers directly addressable registers instead of the unwieldy x87 register stack.
Pentium 4 processors show really poor x87 performance, far below that of the current champion AMD Athlon; it is therefore clear that the route to fast floatingpoint computations passes through SSE2. If Intel can get both Microsoft and Borland to work on a vectorizing compiler, the Pentium 4 may prove to be a winner, but if compiler support will be lackluster (such as the current support for MMX and SSE) it is likely that the Pentium 4 will suffer from lack of optimized software.
Here is the list of SSE2 instructions that extend SSE (adapted from Intel's preliminary documentation):
 DATA MOVEMENT INSTRUCTIONS
MOVAPD (move aligned packed doubleprecision floatingpoint) transfers a 128bit doubleprecision floatingpoint operand from memory to an XMM register and vice versa, or between XMM registers. The memory address must be aligned to a 16byte boundary, otherwise a general protection exception (GP#) is generated.
MOVUPD (move unaligned packed doubleprecision floatingpoint) transfers a 128bit doubleprecision floatingpoint operand from memory to and XMM register and vice versa, or between XMM registers, without any requirement of alignment of the memory address.
MOVSD (move scalar doubleprecision floatingpoint) transfers a 64bit doubleprecision floatingpoint operand from memory to the low 64 bits of an XMM register and vice versa, or between XMM registers. Alignment of the memory address is not required.
MOVHPD (move high packed doubleprecision floatingpoint) transfers a 64bit doubleprecision floatingpoint operand from memory to the high 64 bits of an XMM register and vice versa. The low quadword of the register is left unchanged. Alignment of the memory address is not required.
MOVLPD (move low packed doubleprecision floatingpoint) transfers a 64bit doubleprecision floatingpoint operand from memory to the low quadword of an XMM register and vice versa. The high quadword of the register is left unchanged. Alignment of the memory address is not required.
MOVMSKPD (move packed doubleprecision floatingpoint mask) extracts the sign bit of each of the two packed doubleprecision floatingpoint numbers in an XMM register and saves them in a general purpose register. This 2bit value can then be used as a condition to perform branching.
 ARITHMETIC INSTRUCTIONS
ADDPD (add packed doubleprecision floatingpoint) and SUBPD (subtract packed doubleprecision floatingpoint) add and subtract, respectively, two packed double precision floatingpoint operands.
ADDSD (add scalar doubleprecision floatingpoint) and SUBSD (subtract scalar double precision floatingpoint) add and subtract, respectively, the low quadwords of two doubleprecision floatingpoint operands; the high quadword of the source operand is passed through to the destination operand.
MULPD (multiply packed doubleprecision floatingpoint) multiplies two packed doubleprecision floatingpoint operands.
MULSD (multiply scalar doubleprecision floatingpoint) multiplies the low quadwords of two packed doubleprecision floatingpoint operands; the high quadword of the source operand is passed through to the destination operand.
DIVPD (divide packed doubleprecision floatingpoint) divides two packed doubleprecision floatingpoint operands.
DIVSD (divide scalar doubleprecision floatingpoint) divides the low 64 bits of two packed doubleprecision floatingpoint operands; the high quadword of the source operand is passed through to the destination operand.
SQRTPD (square root packed doubleprecision floatingpoint) returns the packed square roots of a packed doubleprecision floatingpoint operand to the destination operand.
SQRTSD (square root scalar doubleprecision floatingpoint) returns the square root of the low quadword of the packed doubleprecision floatingpoint source operand to the low quadword of the destination operand; the high quadword of the source operand is passed through to the destination operand.
MAXPD (maximum packed doubleprecision floatingpoint) compares the corresponding doubleprecision floatingpoint values from two packed doubleprecision floatingpoint operands and returns the numerically higher value from each comparison to the destination operand.
MAXSD (maximum scalar doubleprecision floatingpoint) compares the lowdoubleprecision floatingpoint values from two packed doubleprecision floatingpoint operands and returns the numerically higher value from the comparison to the low quadword of the destination operand; the high quadword of the source operand is passed through to the destination operand.
MINPD (minimum packed doubleprecision floatingpoint) compares the corresponding doubleprecision floatingpoint values from two packed doubleprecision floating point operands and returns the numerically lower value from each comparison to the destination
operand.
MINSD (minimum scalar doubleprecision floatingpoint) compares the low doubleprecision floatingpoint values from two packed doubleprecision floatingpoint operands and returns the numerically lower value from the comparison to the low quadword of the destination operand; the high quadword of the source operand is passed through to the destination operand.
 LOGICAL INSTRUCTIONS
ANDPD (AND of packed doubleprecision floatingpoint) returns a bitwise logical AND of two packed doubleprecision floatingpoint operands.
ANDNPD (AND NOT of packed doubleprecision floatingpoint) returns a bitwise logical AND NOT of two packed doubleprecision floatingpoint operands.
ORPD (OR of packed doubleprecision floatingpoint) returns a bitwise logical OR of two packed doubleprecision floatingpoint operands.
XORPD (XOR of packed doubleprecision floatingpoint) returns a bitwise logical XOR of two packed doubleprecision floatingpoint operands.
 COMPARISON INSTRUCTIONS
These instructions compare packed and scalar doubleprecision floatingpoint values and return the results of the comparison either to the destination operand or to the EFLAGS register.
CMPPD (compare packed doubleprecision floatingpoint) compares the corresponding doubleprecision floatingpoint values from two packed doubleprecision floatingpoint operands, using an immediate operand as a predicate, and returns a 64bit mask result of all 1s or all 0s for each comparison to the destination operand. The value of the immediate operand allows the selection of any of 12 compare conditions: equal, less than, less than equal, greater than, greater than or equal, unordered, not equal, not less than, not less than or equal, not greater than, not greater than or equal, ordered.
CMPSD (compare scalar doubleprecision floatingpoint) compares the low doubleprecision floatingpoint values from two packed doubleprecision floatingpoint operands, using an immediate operand as a predicate, and returns a 64bit mask result of all 1s or all 0s for the comparison to the low quadword of the destination operand; the high quadword of the source operand is passed through to the destination operand. The immediate operand selects the compare conditions as with the CMPPD instruction.
COMISD (compare scalar doubleprecision floatingpoint and set EFLAGS) and UCOMISD (unordered compare scalar doubleprecision floatingpoint and set EFLAGS) instructions compare the low quadwords of two packed doubleprecision floatingpoint operands and set the ZF, PF, and CF flags in the EFLAGS register to show the result (greater than, less than, equal, or unordered). These two instructions differ as follows: the COMISD instruction signals a floatingpoint invalidoperation (#I) exception when a source operand is either a QNaN or SNaN; the UCOMISD instruction only signals an invalidoperation exception when a source operand is an SNaN.
 SHUFFLE INSTRUCTIONS
SHUFPD (shuffle packed doubleprecision floatingpoint) places either of the two packed doubleprecision floatingpoint values from first source operand in the low quadword of the destination operand, and places either of the two packed doubleprecision floatingpoint values from second source operand in the high quadword of the destination operand.
UNPCKHPD (unpacked high packed doubleprecision floatingpoint) performs an interleaved unpack of the high doubleprecision floatingpoint values of the two source operands. It ignores the low quadwords of the sources.
UNPCKLPD (unpacked low packed doubleprecision floatingpoint) performs an interleaved unpack of the low doubleprecision floatingpoint values of the two source operands. It ignores the high quadwords of the sources.