ebookHTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Chapter Twelve. Digital Data Formats and Their Effects</title>
<link rel="stylesheet" type="text/css" href="9780137028528.css"/>
<link rel="stylesheet" type="application/vnd.adobe-page-template+xml" href="page-template.xpgt"/>
</head>
<body>
<p><a id="ch12"/></p>
<h2><a id="page_623"/>Chapter Twelve. Digital Data Formats and Their Effects</h2>
<p class="image"><img src="graphics/fig12-00.jpg" alt="image"/></p>
<p>In digital signal processing, there are many ways to represent numerical data in computing hardware. These representations, known as <em>data formats</em>, have a profound effect on the accuracy and ease of implementation of any given signal processing algorithm. The simpler data formats enable uncomplicated hardware designs to be used at the expense of a restricted range of number representation and susceptibility to arithmetic errors. The more elaborate data formats are somewhat difficult to implement in hardware, but they allow us to manipulate very large and very small numbers while providing immunity to many problems associated with digital arithmetic. The data format chosen for any given application can mean the difference between processing success and failure&#8212;it&#8217;s where our algorithmic rubber meets the road.</p>
<p>In this chapter, we&#8217;ll introduce the most common types of <em>fixed-point</em> digital data formats and show why and when they&#8217;re used. Next, we&#8217;ll use analog-to-digital (A/D) converter operations to establish the precision and dynamic range afforded by these fixed-point formats along with the inherent errors encountered with their use. Finally, we&#8217;ll cover the interesting subject of <em>floating-point</em> binary formats.</p>
<p><a id="ch12sec1lev1"/></p>
<h3>12.1 Fixed-Point Binary Formats</h3>
<p>Within digital hardware, numbers are represented by binary digits known as bits&#8212;in fact, the term <em>bit</em> originated from the words <em>Binary digIT.</em> A single bit can be in only one of two possible states: either a one or a zero.<sup><a id="ch12fn_01"/><a href="#ch12fn01">&#8224;</a></sup> A six-bit <a id="page_624"/>binary number could, for example, take the form 101101, with the leftmost bit known as the <em>most significant bit</em> (msb); the rightmost bit is called the <em>least significant bit</em> (lsb). The number of bits in a binary number is known as the <em>word length</em>&#8212;hence 101101 has a word length of six. Like the decimal number system so familiar to us, the binary number system assumes a weight associated with each digit in the number. That weight is the base of the system (two for binary numbers and ten for decimal numbers) raised to an integral power. To illustrate this with a simple example, the decimal number 4631 is</p>
<p class="footnotes"><a id="ch12fn01"/><sup><a href="#ch12fn_01">&#8224;</a></sup> Binary numbers are used because early electronic computer pioneers quickly realized that it was much more practical and reliable to use electrical devices (relays, vacuum tubes, transistors, etc.) that had only two states, <em>on</em> or <em>off</em>. Thus, the on/off state of a device could represent a single binary digit.</p>
<p class="caption"><a id="ch12equ01"/>(12-1)</p>
<p class="image"><img src="graphics/eq-12-01.jpg" alt="image"/></p>
<p>The factors 10<sup>3</sup>, 10<sup>2</sup>, 10<sup>1</sup>, and 10<sup>0</sup> are the digit weights in <a href="#ch12equ01">Eq. (12-1)</a>. Similarly, the six-bit binary number 101101 is equal to decimal 45 as shown by</p>
<p class="caption"><a id="ch12equ02"/>(12-2)</p>
<p class="image"><img src="graphics/eq-12-02.jpg" alt="image"/></p>
<p>Using subscripts to signify the base of a number, we can write <a href="#ch12equ02">Eq. (12-2)</a> as 101101<sub>2</sub> = 45<sub>10</sub>. <a href="#ch12equ02">Equation (12-2)</a> shows us that, like decimal numbers, binary numbers use the <em>place value</em> system where the position of a digit signifies its weight. If we use <em>B</em> to denote a number system&#8217;s base, the place value representation of the four-digit number <em>a</em><sub>3</sub><em>a</em><sub>2</sub><em>a</em><sub>1</sub><em>a</em><sub>0</sub> is</p>
<p class="caption"><a id="ch12equ03"/>(12-3)</p>
<p class="image"><img src="graphics/eq-12-03.jpg" alt="image"/></p>
<p>In <a href="#ch12equ03">Eq. (12-3)</a>, <em>B<sup>n</sup></em> is the weight multiplier for the digit <em>a<sub>n</sub></em>, where 0 &#8804; <em>a<sub>n</sub></em> &#8804; <em>B</em>&#8722;1. (This place value system of representing numbers is very old&#8212;so old, in fact, that its origin is obscure. However, with its inherent positioning of the decimal or binary point, this number system is so convenient and powerful that its importance has been compared to that of the alphabet<a href="#ch12en1">[1]</a>.)</p>
<p><a id="ch12sec2lev1"/></p>
<h4>12.1.1 Octal Numbers</h4>
<p>As the use of minicomputers and microprocessors rapidly expanded in the 1960s, people grew tired of manipulating long strings of ones and zeros on paper and began to use more convenient ways to represent binary numbers. One way to express a binary number is an octal format, with its base of eight. (Of course, the only valid digits in the octal format are 0 to 7&#8212;the digits 8 and 9 have no meaning in octal representation.)</p>
<p>Converting from binary to octal is as simple as separating the binary number into three-bit groups starting from the right. For example, the binary number 10101001<sub>2</sub> can be converted to octal format as</p>
<p class="center"><strong>10101001<sub>2</sub></strong> &#8594;&#160;&#160;&#160;&#160;10 <sub>&#124;</sub> 101 <sub>&#124;</sub> 001 = 251<sub>8</sub>.</p>
<p><a id="page_625"/>Thus the octal format enables us to represent an eight-digit binary value with a simpler three-digit octal value. However, the relentless march of technology is pushing octal numbers, like wooden tennis rackets, into extinction.</p>
<p><a id="ch12sec2lev2"/></p>
<h4>12.1.2 Hexadecimal Numbers</h4>
<p>Today the predominant binary number representation format is the hexadecimal number format using 16 as its base. Converting from binary to hexadecimal is done, this time, by separating the binary number into four-bit groups starting from the right. The binary number 10101001<sub>2</sub> is converted to hexadecimal format as</p>
<p class="center"><strong>10101001<sub>2</sub></strong> &#8594;&#160;&#160;&#160;&#160;1010 <sub>&#124;</sub> 1001 = <em>A</em>9<sub>16</sub>.</p>
<p>If you haven&#8217;t seen the hexadecimal format used before, don&#8217;t let the <em>A</em>9 digits confuse you. In this format, the characters A, B, C, D, E, and F represent the digits whose decimal values are 10, 11, 12, 13, 14, and 15 respectively. We convert the two groups of bits above to two hexadecimal digits by starting with the left group of bits, 1010<sub>2</sub> = 10<sub>10</sub> = <em>A</em><sub>16</sub>, and 1001<sub>2</sub> = 9<sub>10</sub> = 9<sub>16</sub>. Hexadecimal format numbers also use the place value system, meaning that <em>A</em>9<sub>16</sub> = (<strong><em>A</em></strong> &#183; 16<sup>1</sup> + <strong>9</strong> &#183; 16<sup>0</sup>). For convenience, then, we can represent the eight-digit 10101001<sub>2</sub> with the two-digit number <em>A</em>9<sub>16</sub>. <a href="#ch12tab01">Table 12-1</a> lists the permissible digit representations in the number systems discussed thus far.</p>
<p class="caption"><a id="ch12tab01"/><strong>Table 12-1</strong> Allowable Digit Representations versus Number System Base</p>
<p class="image"><img src="graphics/t0626-01.jpg" alt="image"/></p>
<p>In the above example we used a subscripted 16 to signify a hexadecimal number. Note that it&#8217;s common, in the literature of binary number formats, to have hexadecimal numbers preceded by special characters to signify that indeed they are hexadecimal. You may see, for example, numbers like $A9 or 0xA9 where the &#8220;$&#8221; and &#8220;0x&#8221; characters specify the follow-on digits to be hexadecimal.</p>
<p><a id="ch12sec2lev3"/></p>
<h4>12.1.3 Sign-Magnitude Binary Format</h4>
<p>For binary numbers to be at all useful in practice, they must be able to represent negative values. Binary numbers do this by dedicating one of the bits in a binary word to indicate the sign of a number. Let&#8217;s consider a popular binary format known as <em>sign magnitude.</em> Here, we assume that a binary word&#8217;s leftmost bit is a sign bit and the remaining bits represent the magnitude of a number that is always positive. For example, we can say that the four-bit number 0011<sub>2</sub> is +3<sub>10</sub> and the binary number 1011<sub>2</sub> is equal to &#8722;3<sub>10</sub>, or</p>
<p class="image"><img src="graphics/625equ01.jpg" alt="image"/></p>
<p><a id="page_626"/>Of course, using one of the bits as a sign bit reduces the magnitude of the numbers we can represent. If an unsigned binary number&#8217;s word length is <em>b</em> bits, the number of different values that can be represented is 2<em><sup>b</sup></em>. An eight-bit word, for example, can represent 2<sup>8</sup> = 256 different integral values. With zero being one of the values we have to express, a <em>b</em>-bit unsigned binary word can represent integers from 0 to 2<em><sup>b</sup></em>&#8722;1. The largest value represented by an unsigned eight-bit word is 2<sup>8</sup>&#8722;1 = 255<sub>10</sub> = 11111111<sub>2</sub>. In the sign-magnitude binary format a <em>b</em>-bit word can represent only a magnitude of &#177;2<sup><em>b</em>&#8722;1</sup>&#8722;1, so the largest positive or negative value we can represent by an eight-bit sign-magnitude word is &#177;2<sup>8&#8722;1</sup>&#8722;1 = &#177;127.</p>
<p><a id="ch12sec2lev4"/></p>
<h4>12.1.4 Two&#8217;s Complement Format</h4>
<p>Another common binary number scheme, known as the <em>two&#8217;s complement</em> format, also uses the leftmost bit as a sign bit. The two&#8217;s complement format is the most convenient numbering scheme from a hardware design standpoint and has been used for decades. It enables computers to perform both addition and subtraction using the same hardware adder logic. To obtain the negative <a id="page_627"/>version of a positive two&#8217;s complement number, we merely complement (change a one to a zero, and change a zero to a one) each bit, add a binary one to the complemented word, and discard any bits carried beyond the original word length. For example, with 0011<sub>2</sub> representing a decimal 3 in two&#8217;s complement format, we obtain a negative decimal 3 through the following steps:</p>
<p class="image"><img src="graphics/627fig01.jpg" alt="image"/></p>
<p>In the two&#8217;s complement format, a <em>b</em>-bit word can represent positive amplitudes as great as 2<sup><em>b</em>&#8722;1</sup>&#8722;1, and negative amplitudes as large as &#8722;<em>2<sup>b</sup></em><sup>&#8722;1</sup>. <a href="#ch12tab02">Table 12-2</a> shows four-bit word examples of sign-magnitude and two&#8217;s complement binary formats.</p>
<p class="caption"><a id="ch12tab02"/><strong>Table 12-2</strong> Integer Binary Number Formats</p>
<p class="image"><img src="graphics/t0628-01.jpg" alt="image"/></p>
<p>While using two&#8217;s complement numbers, we have to be careful when adding two numbers of different word lengths. Consider the case where a four-bit number is added to an eight-bit number:</p>
<p class="image"><img src="graphics/627fig02.jpg" alt="image"/></p>
<p>No problem so far. The trouble occurs when our four-bit number is negative. Instead of adding a +3 to the +15, let&#8217;s try to add a &#8722;3 to the +15:</p>
<p class="image"><img src="graphics/627fig03.jpg" alt="image"/></p>
<p>The above arithmetic error can be avoided by performing what&#8217;s called a <em>sign-extend</em> operation on the four-bit number. This process, typically performed automatically in hardware, extends the sign bit of the four-bit negative number to the left, making it an eight-bit negative number. If we sign-extend the &#8722;3 and then perform the addition, we&#8217;ll get the correct answer:</p>
<p class="image"><img src="graphics/627fig04.jpg" alt="image"/></p>
<p><a id="ch12sec2lev5"/></p>
<h4>12.1.5 Offset Binary Format</h4>
<p>Another useful binary number scheme is known as the <em>offset binary</em> format. While this format is not as common as two&#8217;s complement, it still shows up in <a id="page_628"/>some hardware devices. <a href="#ch12tab02">Table 12-2</a> shows offset binary format examples for four-bit words. Offset binary represents numbers by subtracting 2<sup><em>b</em>&#8722;1</sup> from an unsigned binary value. For example, in the second row of <a href="#ch12tab02">Table 12-2</a>, the offset binary number is 1110<sub>2</sub>. When this number is treated as an unsigned binary number, it&#8217;s equivalent to 14<sub>10</sub>. For four-bit words <em>b</em> = 4 and 2<sup><em>b</em>&#8722;1</sup> = 8, so 14<sub>10</sub> &#8722; 8<sub>10</sub> = 6<sub>10</sub>, which is the decimal equivalent of 1110<sub>2</sub> in offset binary. The difference between the unsigned binary equivalent and the actual decimal equivalent of the offset binary numbers in <a href="#ch12tab02">Table 12-2</a> is always &#8722;8. This kind of offset is sometimes referred to as a <em>bias</em> when the offset binary format is used. (It may interest the reader that we can convert back and forth between the two&#8217;s complement and offset binary formats merely by complementing a word&#8217;s most significant bit.)</p>
<p>The history, arithmetic, and utility of the many available number formats is a very broad field of study. A thorough and very readable discussion of the subject is given by Knuth in reference <a href="#ch12en2">[2]</a>.</p>
<p><a id="ch12sec2lev6"/></p>
<h4><a id="page_629"/>12.1.6 Fractional Binary Numbers</h4>
<p>All of the binary numbers we&#8217;ve considered so far had integer decimal values. Noninteger decimal numbers, numbers with nonzero digits to the right of the decimal point, can also be represented with binary numbers if we use a <em>binary point</em>, also called a <em>radix point</em>, identical in function to our familiar decimal point. (As such, in the binary numbers we&#8217;ve discussed so far, the binary point is assumed to be fixed just to the right of the rightmost, lsb, bit.) For example, using the symbol &#9674; to denote a binary point, the six-bit unsigned binary number 11<sub>&#9674;</sub>0101<sub>2</sub> is equal to decimal 3.3125 as shown by</p>
<p class="caption"><a id="ch12equ04"/>(12-4)</p>
<p class="image"><img src="graphics/eq-12-04.jpg" alt="image"/></p>
<p>For our 11<sub>&#9674;</sub>0101<sub>2</sub> example in <a href="#ch12equ04">Eq. (12-4)</a> the binary point is set between the second and third most significant bits and we call that binary number a <em>fractional</em> number. Having a stationary position for the binary point is why this binary number format is called <em>fixed-point binary</em>. The unsigned number 11<sub>&#9674;</sub>0101<sub>2</sub> has two integer bits and four fractional bits, so, in the parlance of binary numbers, such a number is said to have a 2.4, &#8220;two dot four,&#8221; format (two integer bits and four fractional bits).</p>
<p>Two&#8217;s complement binary numbers can also have this <em>integer plus fraction</em> format, and <a href="#ch12tab03">Table 12-3</a> shows, for example, the decimal value ranges for all possible eight-bit two&#8217;s complement fractional binary numbers. Notice how the 8.0-format row in <a href="#ch12tab03">Table 12-3</a> shows the decimal values associated with an eight-bit two&#8217;s complement binary number whose binary point is to the right of the lsb, signifying an all-integer binary number. On the other hand, the 1.7-format row in <a href="#ch12tab03">Table 12-3</a> shows the decimal values associated with an eight-bit two&#8217;s complement binary number whose binary point is just to the right of the msb (the sign bit), signifying an all-fraction binary number.</p>
<p class="caption"><a id="ch12tab03"/><strong>Table 12-3</strong> Eight-Bit, Two&#8217;s Complement, Fractional Format Values</p>
<p class="image"><img src="graphics/t0630-01.jpg" alt="image"/></p>
<p>The decimal value range of a general fractional two&#8217;s complement binary number is</p>
<p class="caption"><a id="ch12equ05"/>(12-5)</p>
<p class="image"><img src="graphics/eq-12-05.jpg" alt="image"/></p>
<p>where the &#8220;# of integer bits&#8221; notation means the number of bits to the left of the binary point and &#8220;# of fraction bits&#8221; means the number of bits to the right of the binary point.</p>
<p><a href="#ch12tab03">Table 12-3</a> teaches us two important lessons. First, we can place the implied binary point anywhere we wish in the eight-bit word, just so long as <a id="page_630"/>everyone accessing the data agrees on that binary point placement and the designer keeps track of that placement throughout all of the system&#8217;s arithmetic computations. Binary arithmetic hardware behavior does not depend on the &#8220;agreed upon&#8221; binary point placement. Stated in different words, binary point placement does not affect two&#8217;s complement binary arithmetic operations. That is, adding or multiplying two binary numbers will yield the same binary result regardless of the implied binary point location within the data words. We leave an example of this behavior as a homework problem.</p>
<p>Second, for a fixed number of bits, fractional two&#8217;s complement binary numbers allow us to represent decimal numbers with poor precision over a wide range of values, or we can represent decimal numbers with fine precision but only over a narrow range of values. In practice you must &#8220;pick your poison&#8221; by choosing the position of the binary point based on what&#8217;s more important to you, number range or number precision.</p>
<p>Due to their 16-bit internal data paths, it&#8217;s very common for programmable 16-bit DSP chips to use a 1.15 format (one integer bit to represent sign, and 15 fractional bits) to represent two&#8217;s complement numbers. These 16-bit signed <em>all-fraction</em> binary numbers are particularly useful because multiplying two such numbers results in an <em>all-fraction</em> product, avoiding any unpleasant binary <em>overflow</em> problems, to be discussed shortly. (Be aware that this 1.15 format is also called <em>Q15 format.</em>) Because the 1.15-format is so commonly used in programmable hardware, we give examples of it and other 16-bit formats in <a href="#ch12tab04">Table 12-4</a>. In that table, the &#8220;resolution&#8221; is the decimal value of the format&#8217;s lsb.</p>
<p class="caption"><a id="ch12tab04"/><strong>Table 12-4</strong> 16-Bit Format Values</p>
<p class="image"><img src="graphics/t0631-01.jpg" alt="image"/></p>
<p>Multiplication of two 1.15 binary words results in a 2.30-format (also called a <em>Q30-format</em>) fractional number. That 32-bit product word contains two <a id="page_631"/>sign bits and 30 fractional bits, with the msb being called an extended sign bit. We have two ways to convert (truncate) such a 32-bit product to the 1.15 format so that it can be stored as a 16-bit word. They are</p>
<p class="indenthangingB">&#8226; shifting the 32-bit word left by one bit and storing the upper 16 bits, and</p>
<p class="indenthangingB">&#8226; shifting the 32-bit word right by 15 bits and storing the lower 16 bits.</p>
<p><a id="page_632"/>To conclude this fractional binary discussion, we provide the steps to convert a decimal number whose magnitude is less than one, such as an FIR digital filter coefficient, to the 1.15 binary format. As an example, to convert the decimal value 0.452 to the two&#8217;s complement 1.15 binary format:</p>
<p class="indenthangingN">1. Multiply the absolute value of the original decimal number 0.452 by 32768 (2<sup>15</sup>), yielding a scaled decimal 14811.136.</p>
<p class="indenthangingN">2. Round the value 14811.136 to an integer, using your preferred rounding method, producing a scaled decimal value of 14811.</p>
<p class="indenthangingN">3. Convert the decimal 14811 to a binary integer and place the binary point to the right of the msb, yielding 0<sub>&#9674;</sub>011 1001 1101 1011 (39DB<sub>16</sub>).</p>
<p class="indenthangingN">4. If the original decimal value was positive, stop now. If the original decimal value was negative, implement a two&#8217;s complement conversion by inverting Step 3&#8217;s binary bits and add one.</p>
<p>If you, unfortunately, do not have software to perform the above positive decimal integer to 1.15 binary conversion in Step 3, here&#8217;s how the conversion can be done (painfully) by hand:</p>
<p class="indenthanging3">3.1. Divide 14811 by 2, obtaining integer 7405 plus a remainder of 0.5. Because the remainder is not zero, place a one as the lsb of the desired binary number. Our binary number is 1.</p>
<p class="indenthanging3">3.2. Divide 7405 by 2, obtaining integer 3702 plus a remainder of 0.5. Because the remainder is not zero, place a one as the bit to the left of the lsb bit established in Step 3.1 above. Our binary number is now 11.</p>
<p class="indenthanging3">3.3. Divide 3702 by 2, obtaining integer 1851 plus a remainder of zero. Because the remainder is zero, place a zero as the bit to the left of the bit established in Step 3.2 above. Our binary number is now 011.</p>
<p class="indenthanging3">3.4. Continue this process until the integer portion of the divide-by-two quotient is zero. Append zeros to the left of the binary word to extend its length to 16 bits.</p>
<p>Using the above steps to convert decimal 14811<sub>10</sub> to binary 1.15 format proceeds as shown in <a href="#ch12tab05">Table 12-5</a>, producing our desired binary number of 0<sub>&#9674;</sub>011 1001 1101 1011 (39DB<sub>16</sub>).</p>
<p class="caption"><a id="ch12tab05"/><strong>Table 12-5</strong> Decimal 14811 to Binary 1.15 Conversion Example</p>
<p class="image"><img src="graphics/t0632-01.jpg" alt="image"/></p>
<p><a id="ch12sec1lev2"/></p>
<h3>12.2 Binary Number Precision and Dynamic Range</h3>
<p>As we implied earlier, for any binary number format, the number of bits in a data word is a key consideration. The more bits used in the word, the better the resolution of the number, and the larger the maximum value that can be <a id="page_633"/>represented.<sup><a id="ch12fn_02"/><a href="#ch12fn02">&#8224;</a></sup> Assuming that a binary word represents the amplitude of a signal, digital signal processing practitioners find it useful to quantify the dynamic range of various binary number schemes. For a signed integer binary word length of <em>b</em>+1 bits (one sign bit and <em>b</em> magnitude bits), the dynamic range is defined by</p>
<p class="footnotes"><a id="ch12fn02"/><sup><a href="#ch12fn_02">&#8224;</a></sup> Some computers use 64-bit words. Now, 2<sup>64</sup> is approximately equal to 1.8 &#183; 10<sup>19</sup>&#8212;that&#8217;s a pretty large number. So large, in fact, that if we started incrementing a 64-bit counter once per second at the beginning of the universe (&#8776;20 billion years ago), the most significant four bits of this counter would <em>still</em> be all zeros today.</p>
<p class="caption"><a id="ch12equ06"/>(12-6)</p>
<p class="image"><img src="graphics/eq-12-06.jpg" alt="image"/></p>
<p><a id="page_634"/>The dynamic range measured in dB is</p>
<p class="caption"><a id="ch12equ06a"/>(12-6&#8242;)</p>
<p class="image"><img src="graphics/eq-12-06a.jpg" alt="image"/></p>
<p>When 2<em><sup>b</sup></em> is much larger than 1, we can ignore the &#8722;1 in <a href="#ch12equ06a">Eq. (12-6&#8242;)</a> and state that</p>
<p class="caption"><a id="ch12equ06b"/>(12-6&#8243;)</p>
<p class="image"><img src="graphics/eq-12-06b.jpg" alt="image"/></p>
<p><a href="#ch12equ06b">Equation (12-6&#8243;)</a>, dimensioned in dB, tells us that the dynamic range of our number system is directly proportional to the word length. Thus, an eight-bit two&#8217;s complement word, with seven bits available to represent signal magnitude, has a dynamic range of 6.02 &#183; 7 = 42.14 dB. Most people simplify <a href="#ch12equ06b">Eq. (12-6&#8243;)</a> by using the rule of thumb that the dynamic range is equal to &#8220;6 dB per bit.&#8221;</p>
<p><a id="ch12sec1lev3"/></p>
<h3>12.3 Effects of Finite Fixed-Point Binary Word Length</h3>
<p>The effects of finite binary word lengths touch all aspects of digital signal processing. Using finite word lengths prevents us from representing values with infinite precision, increases the background noise in our spectral estimation techniques, creates nonideal digital filter responses, induces noise in analog-to-digital (A/D) converter outputs, and can (if we&#8217;re not careful) lead to wildly inaccurate arithmetic results. The smaller the word lengths, the greater these problems will be. Fortunately, these finite, word-length effects are rather well understood. We can predict their consequences and take steps to minimize any unpleasant surprises. The first finite, word-length effect we&#8217;ll cover is the errors that occur during the A/D conversion process.</p>
<p><a id="ch12sec2lev7"/></p>
<h4>12.3.1 A/D Converter Quantization Errors</h4>
<p>Practical A/D converters are constrained to have binary output words of finite length. Commercial A/D converters are categorized by their output word lengths, which are normally in the range from 8 to 16 bits. A typical A/D converter input analog voltage range is from &#8722;1 to +1 volt. If we used such an A/D converter having 8-bit output words, the least significant bit would represent</p>
<p class="caption"><a id="ch12equ07"/>(12-7)</p>
<p class="image"><img src="graphics/eq-12-07.jpg" alt="image"/></p>
<p>What this means is that we can represent continuous (analog) voltages perfectly as long as they&#8217;re integral multiples of 7.81 millivolts&#8212;any intermediate <a id="page_635"/>input voltage will cause the A/D converter to output a <em>best estimate</em> digital data value. The inaccuracies in this process are called <em>quantization errors</em> because an A/D output least significant bit is an indivisible quantity. We illustrate this situation in <a href="#ch12fig01">Figure 12-1(a)</a>, where the continuous waveform is being digitized by an 8-bit A/D converter whose output is in the sign-magnitude format. When we start sampling at time <em>t</em> = 0, the continuous waveform happens to have a value of 31.25 millivolts (mv), and our A/D output data word will be exactly correct for sample <em>x</em>(0). At time <em>T</em> when we get the second A/D output word for sample <em>x</em>(1), the continuous voltage is between 0 and &#8722;7.81 mv. In this case, the A/D converter outputs a sample value of 10000001, representing &#8722;7.81 mv, even though the continuous input was not quite as negative as &#8722;7.81 mv. The 10000001 A/D output word contains some quantization error. Each successive sample contains quantization error because the A/D&#8217;s digitized <a id="page_636"/>output values must lie on the horizontal line in <a href="#ch12fig01">Figure 12-1(a)</a>. The difference between the actual continuous input voltage and the A/D converter&#8217;s representation of the input is shown as the quantization error in <a href="#ch12fig01">Figure 12-1(b)</a>. For an ideal A/D converter, the quantization error, a kind of <em>roundoff</em> noise, can never be greater than &#177;1/2 an lsb, or &#177;3.905 mv.</p>
<p class="caption"><a id="ch12fig01"/><strong>Figure 12-1</strong> Quantization errors: (a) digitized <em>x(n)</em> values of a continuous signal; (b) quantization error between the actual analog signal values and the digitized signal values.</p>
<p class="image"><img src="graphics/fig12-01.jpg" alt="image"/></p>
<p>While <a href="#ch12fig01">Figure 12-1(b)</a> shows A/D quantization noise in the time domain, we can also illustrate this noise in the frequency domain. <a href="#ch12fig02">Figure 12-2(a)</a> depicts a continuous sinewave of one cycle over the sample interval shown as the dashed line and a quantized version of the time-domain samples of that <a id="page_637"/>wave as the dots. Notice how the quantized version of the wave is constrained to have only integral values, giving it a <em>stair-step</em> effect oscillating above and below the true unquantized sinewave. The quantization here is four bits, meaning that we have a sign bit and three bits to represent the magnitude of the wave. With three bits, the maximum peak values for the wave are &#177;7. <a href="#ch12fig02">Figure 12-2(b)</a> shows the discrete Fourier transform (DFT) of a discrete version of the sinewave whose time-domain sample values are not forced to be integers but have high precision. Notice in this case that the DFT has a nonzero value only at <em>m</em> = 1. On the other hand, <a href="#ch12fig02">Figure 12-2(c)</a> shows the spectrum of the four-bit quantized samples in <a href="#ch12fig02">Figure 12-2(a)</a>, where quantization effects have induced noise components across the entire spectral band. If the quantization noise depictions in <a href="#ch12fig01">Figures 12-1(b)</a> and <a href="#ch12fig02">12-2(c)</a> look random, that&#8217;s because they are. As it turns out, even though A/D quantization noise is random, we can still quantify its effects in a useful way.</p>
<p class="caption"><a id="ch12fig02"/><strong>Figure 12-2</strong> Quantization noise effects: (a) input sinewave applied to a 64-point DFT; (b) theoretical DFT magnitude of high-precision sinewave samples; (c) DFT magnitude of a sinewave quantized to four bits.</p>
<p class="image"><img src="graphics/fig12-02.jpg" alt="image"/></p>
<p>In the field of communications, people often use the notion of output signal-to-noise ratio, or SNR = (signal power)/(noise power), to judge the usefulness of a process or device. We can do likewise and obtain an important expression for the output SNR of an ideal A/D converter, <em>SNR</em><sub>A/D</sub>, accounting for finite word-length quantization effects. Because quantization noise is random, we can&#8217;t explicitly represent its power level, but we can use its statistical equivalent of variance to define <em>SNR</em><sub>A/D</sub> measured in dB as</p>
<p class="caption"><a id="ch12equ08"/>(12-8)</p>
<p class="image"><img src="graphics/eq-12-08.jpg" alt="image"/></p>
<p>Next, we&#8217;ll determine an A/D converter&#8217;s quantization noise variance relative to the converter&#8217;s maximum input peak voltage <em>V<sub>p</sub></em>. If the full-scale (&#8722;<em>V<sub>p</sub></em> to +<em>V<sub>p</sub></em> volts) continuous input range of a <em>b</em>-bit A/D converter is 2<em>Vp</em>, a single quantization level <em>q</em> is that voltage range divided by the number of possible A/D output binary values, or <em>q</em> = 2<em>V<sub>p</sub></em>/2<em><sup>b</sup></em>. (In <a href="#ch12fig01">Figure 12-1</a>, for example, the quantization level <em>q</em> is the lsb value of 7.81 mv.) A depiction of the likelihood of encountering any given quantization error value, called the probability density function <em>p(e)</em> of the quantization error, is shown in <a href="#ch12fig03">Figure 12-3</a>.</p>
<p class="caption"><a id="ch12fig03"/><strong>Figure 12-3</strong> Probability density function of A/D conversion roundoff error (noise).</p>
<p class="image"><img src="graphics/fig12-03.jpg" alt="image"/></p>
<p>This simple rectangular function has much to tell us. It indicates that there&#8217;s an equal chance that any error value between &#8722;<em>q</em>/2 and +<em>q</em>/2 can occur. By definition, because probability density functions have an area of unity (i.e., the probability is 100 percent that the error will be somewhere under the curve), the amplitude of the <em>p(e)</em> density function must be the area <a id="page_638"/>divided by the width, or <em>p(e)</em> = 1/<em>q</em>. From <a href="app04.html#app04fig07">Figure D-7</a> and <a href="app04.html#app04equ29">Eq. (D-29)</a> in <a href="app04.html#app04">Appendix D</a>, the variance of our uniform <em>p(e)</em> is</p>
<p class="caption"><a id="ch12equ09"/>(12-9)</p>
<p class="image"><img src="graphics/eq-12-09.jpg" alt="image"/></p>
<p>We can now express the A/D noise error variance in terms of A/D parameters by replacing <em>q</em> in <a href="#ch12equ09">Eq. (12-9)</a> with <em>q</em> = 2<em>V<sub>p</sub></em>/2<em><sup>b</sup></em> to get</p>
<p class="caption"><a id="ch12equ10"/>(12-10)</p>
<p class="image"><img src="graphics/eq-12-10.jpg" alt="image"/></p>
<p>OK, we&#8217;re halfway to our goal&#8212;with <a href="#ch12equ10">Eq. (12-10)</a> giving us the denominator of <a href="#ch12equ08">Eq. (12-8)</a>, we need the numerator. To arrive at a general result, let&#8217;s express the input signal in terms of its root mean square (rms), the A/D converter&#8217;s peak voltage, and a loading factor <em>LF</em> defined as</p>
<p class="caption"><a id="ch12equ11"/>(12-11)</p>
<p class="image"><img src="graphics/eq-12-11.jpg" alt="image"/></p>
<p class="footnotes"><a id="ch12fn03"/><sup>&#8224;</sup> As covered in <a href="app04.html#app04">Appendix D</a>, <a href="app04.html#app04sec1lev2">Section D.2</a>, although the variance &#963;<sup>2</sup> is associated with the power of a signal, the standard deviation is associated with the rms value of a signal.</p>
<p>With the loading factor defined as the input rms voltage over the A/D converter&#8217;s peak input voltage, we square and rearrange <a href="#ch12equ11">Eq. (12-11)</a> to show the signal variance <img src="graphics/638fig01.jpg" alt="image"/> as</p>
<p class="caption"><a id="ch12equ12"/>(12-12)</p>
<p class="image"><img src="graphics/eq-12-12.jpg" alt="image"/></p>
<p><a id="page_639"/>Substituting <a href="#ch12equ10">Eqs. (12-10)</a> and <a href="#ch12equ12">(12-12)</a> in <a href="#ch12equ08">Eq. (12-8)</a>,</p>
<p class="caption"><a id="ch12equ13"/>(12-13)</p>
<p class="image"><img src="graphics/eq-12-13.jpg" alt="image"/></p>
<p><a href="#ch12equ13">Eq. (12-13)</a> gives us the S<em>NR</em><sub>A/D</sub> of an ideal <em>b</em>-bit A/D converter in terms of the loading factor and the number of bits <em>b</em>. <a href="#ch12fig04">Figure 12-4</a> plots <a href="#ch12equ13">Eq. (12-13)</a> for various A/D word lengths as a function of the loading factor. Notice that the loading factor in <a href="#ch12fig04">Figure 12-4</a> is never greater than &#8722;3 dB, because the maximum continuous A/D input peak value must not be greater than <em>V<sub>p</sub></em> volts. Thus, for a sinusoid input, its rms value must not be greater than <img src="graphics/639fig01.jpg" alt="image"/> volts (3 dB below <em>V<sub>p</sub></em>).</p>
<p class="caption"><a id="ch12fig04"/><strong>Figure 12-4</strong> <em>SNR</em><sub>A/D</sub> of ideal A/D converters as a function of loading factor in dB.</p>
<p class="image"><img src="graphics/fig12-04.jpg" alt="image"/></p>
<p>When the input sinewave&#8217;s peak amplitude is equal to the A/D converter&#8217;s full-scale voltage <em>V<sub>p</sub></em>, the full-scale <em>LF</em> is</p>
<p class="caption"><a id="ch12equ14"/>(12-14)</p>
<p class="image"><img src="graphics/eq-12-14.jpg" alt="image"/></p>
<p>Under this condition, the maximum A/D output SNR from <a href="#ch12equ13">Eq. (12-13)</a> is</p>
<p class="caption"><a id="ch12equ15"/>(12-15)</p>
<p class="image"><img src="graphics/eq-12-15.jpg" alt="image"/></p>
<p><a id="page_640"/>This discussion of SNR relative to A/D converters means three important things to us:</p>
<p class="indenthangingN"><strong>1.</strong> An ideal A/D converter will have an <em>SNR</em><sub>A/D</sub> defined by <a href="#ch12equ13">Eq. (12-13)</a>, so any discrete <em>x</em>(<em>n</em>) signal produced by a <em>b</em>-bit A/D converter can never have an SNR greater than <a href="#ch12equ13">Eq. (12-13)</a>. (<a href="app04.html#app04">Appendix D</a> dicusses methods for computing the SNR of discrete signals.) For example, let&#8217;s say we want to digitize a continuous signal whose SNR is 55 dB. Using an ideal eight-bit A/D converter with its full-scale <em>SNR</em><sub>A/D</sub> of 6.02 &#183; 8 + 1.76 = 49.9 dB from <a href="#ch12equ15">Eq. (12-15)</a>, the quantization noise will contaminate the digitized values, and the resultant digital signal&#8217;s SNR can be no better than 49.9 dB. We&#8217;ll have lost signal SNR through the A/D conversion process. (A ten-bit A/D, with its ideal <em>SNR</em><sub>A/D</sub> &#8776; 62 dB, could be used to digitize a 55 dB SNR continuous signal to reduce the SNR degradation caused by quantization noise.) <a href="#ch12equ13">Equations (12-13)</a> and <a href="#ch12equ15">(12-15)</a> apply to ideal A/D converters and don&#8217;t take into account such additional A/D noise sources as aperture jitter error, missing output bit patterns, and other nonlinearities. So actual A/D converters are likely to have SNRs that are lower than that indicated by theoretical <a href="#ch12equ13">Eq. (12-13)</a>. To be safe in practice, it&#8217;s sensible to assume that <em>SNR</em><sub>A/D-max</sub> is 3 to 6 dB lower than indicated by <a href="#ch12equ15">Eq. (12-15)</a>.</p>
<p class="indenthangingN"><strong>2.</strong> <a href="#ch12equ15">Equation (12-15)</a> is often expressed in the literature, but it can be a little misleading because it&#8217;s imprudent to force an A/D converter&#8217;s input to full scale. It&#8217;s wise to drive an A/D converter to some level below full scale because inadvertent overdriving will lead to signal clipping and will induce distortion in the A/D&#8217;s output. So <a href="#ch12equ15">Eq. (12-15)</a> is overly optimistic, and, in practice, A/D converter SNRs will be less than indicated by <a href="#ch12equ15">Eq. (12-15)</a>. The best approximation for an A/D&#8217;s SNR is to determine the input signal&#8217;s rms value that will never (or rarely) overdrive the converter input, and plug that value into <a href="#ch12equ11">Eq. (12-11)</a> to get the loading factor value for use in <a href="#ch12equ13">Eq. (12-13)</a>.<sup><a id="ch12fn_04"/><a href="#ch12fn04">&#8224;</a></sup> Again, using an A/D converter with a wider word length will alleviate this problem by increasing the available <em>SNR</em><sub>A/D</sub>.</p>
<p class="footnotes"><a id="ch12fn04"/><sup><a href="#ch12fn_04">&#8224;</a></sup> By the way, some folks use the term <em>crest factor</em> to describe how hard an A/D converter&#8217;s input is being driven. The crest factor is the reciprocal of the loading factor, or <em>CF</em> = <em>V<sub>p</sub></em>/(rms of the input signal).</p>
<p class="indenthangingN"><strong>3.</strong> Remember, now, real-world continuous signals always have their own inherent continuous SNR, so using an A/D converter whose <em>SNR</em><sub>A/D</sub> is a great deal larger than the continuous signal&#8217;s SNR serves no purpose. In this case, we would be wasting A/D converter bits by digitizing the analog signal&#8217;s noise to a high degree of accuracy, which does not improve <a id="page_641"/>our digital signal&#8217;s overall SNR. In general, we want the converter&#8217;s <em>SNR</em><sub>A/D</sub> value to be approximately 6 dB greater than an analog signal&#8217;s SNR.</p>
<p>A word of caution is appropriate here concerning our analysis of A/D converter quantization errors. The derivations of <a href="#ch12equ13">Eqs. (12-13)</a> and <a href="#ch12equ15">(12-15)</a> are based upon three assumptions:</p>
<p class="indenthangingN"><strong>1.</strong> The cause of A/D quantization errors is a stationary random process; that is, the performance of the A/D converter does not change over time. Given the same continuous input voltage, we always expect an A/D converter to provide exactly the same output binary code.</p>
<p class="indenthangingN"><strong>2.</strong> The probability density function of the A/D quantization error is uniform. We&#8217;re assuming that the A/D converter is ideal in its operation and all possible errors between &#8722;<em>q</em>/2 and +<em>q</em>/2 are equally likely. An A/D converter having stuck bits or missing output codes would violate this assumption. High-quality A/D converters being driven by continuous signals that cross many quantization levels will result in our desired uniform quantization noise probability density function.</p>
<p class="indenthangingN"><strong>3.</strong> The A/D quantization errors are uncorrelated with the continuous input signal. If we were to digitize a single continuous sinewave whose frequency was harmonically related to the A/D sample rate, we&#8217;d end up sampling the same input voltage repeatedly and the quantization error sequence would not be random. The quantization error would be predictable and repetitive, and our quantization noise variance derivation would be invalid. In practice, complicated continuous signals such as music or speech, with their rich spectral content, avoid this problem.</p>
<p>To conclude our discussion of A/D converters, let&#8217;s consider one last topic. In the literature the reader is likely to encounter the expression</p>
<p class="caption"><a id="ch12equ16"/>(12-16)</p>
<p class="image"><img src="graphics/eq-12-16.jpg" alt="image"/></p>
<p><a href="#ch12equ16">Equation (12-16)</a> is used by test equipment manufacturers to specify the sensitivity of test instruments using a <em>b</em><sub>eff</sub> parameter known as the number of <em>effective bits</em>, or effective number of bits (ENOB)[<a href="#ch12en3">3</a>&#8211;<a href="#ch12en8">8</a>]. <a href="#ch12equ16">Equation (12-16)</a> is merely <a href="#ch12equ15">Eq. (12-15)</a> solved for <em>b</em> and is based on the assumption that the A/D converter&#8217;s analog input peak-peak voltage spans roughly 90 percent of the converter&#8217;s full-scale voltage range. Test equipment manufacturers measure the actual SNR of their product, indicating its ability to capture continuous input signals relative to the instrument&#8217;s inherent noise characteristics. Given this true SNR, they use <a href="#ch12equ16">Eq. (12-16)</a> to determine the <em>b</em><sub>eff</sub> value for advertisement in their product literature. The larger the <em>b</em><sub>eff</sub>, the greater the continuous voltage <a id="page_642"/>that can be accurately digitized relative to the equipment&#8217;s intrinsic quantization noise.</p>
<p><a id="ch12sec2lev8"/></p>
<h4>12.3.2 Data Overflow</h4>
<p>The next finite, word-length effect we&#8217;ll consider is called <em>overflow</em>. Overflow is what happens when the result of an arithmetic operation has too many bits, or digits, to be represented in the hardware registers designed to contain that result. We can demonstrate this situation to ourselves rather easily using a simple four-function, eight-digit pocket calculator. The sum of a decimal 9.9999999 plus 1.0 is 10.9999999, but on an eight-digit calculator the sum is 10.999999 as</p>
<p class="image"><img src="graphics/642fig01.jpg" alt="image"/></p>
<p>The hardware registers, which contain the arithmetic result and drive the calculator&#8217;s display, can hold only eight decimal digits; so the least significant digit is discarded (of course). Although the above error is less than one part in ten million, overflow effects can be striking when we work with large numbers. If we use our calculator to add 99,999,999 plus 1, instead of getting the correct result of 100 million, we&#8217;ll get a result of 1. Now that&#8217;s an authentic overflow error!</p>
<p>Let&#8217;s illustrate overflow effects with examples more closely related to our discussion of binary number formats. First, adding two unsigned binary numbers is as straightforward as adding two decimal numbers. The sum of 42 plus 39 is 81, or</p>
<p class="image"><img src="graphics/642fig02.jpg" alt="image"/></p>
<p>In this case, two 6-bit binary numbers required 7 bits to represent the results. The general rule is <em>the sum of</em> m <em>individual</em> b<em>-bit binary numbers can require as many as</em> [b <em>+ log<sub>2</sub>(</em>m<em>)</em>] <em>bits to represent the results</em>. So, for example, a 24-bit result register (accumulator) is needed to accumulate the sum of sixteen 20-bit binary numbers, or 20 + log<sub>2</sub>(16) = 24. The sum of 256 eight-bit words requires an accumulator whose word length is [8 + log<sub>2</sub>(256)], or 16 bits, to ensure that no overflow errors occur.</p>
<p><a id="page_643"/>In the preceding example, if our accumulator word length was six bits, an overflow error occurs as</p>
<p class="image"><img src="graphics/643fig01.jpg" alt="image"/></p>
<p>Here, the most significant bit of the result overflowed the six-bit accumulator, and an error occurred.</p>
<p>With regard to overflow errors, the two&#8217;s complement binary format has two interesting characteristics. First, under certain conditions, overflow during the summation of two numbers causes no error. Second, with multiple summations, intermediate overflow errors cause no problems if the final magnitude of the sum of the <em>b</em>-bit two&#8217;s complement numbers is less than 2<sup><em>b</em>&#8722;1</sup>. Let&#8217;s illustrate these properties by considering the four-bit two&#8217;s complement format in <a href="#ch12fig05">Figure 12-5</a>, whose binary values are taken from <a href="#ch12tab02">Table 12-2</a>.</p>
<p class="caption"><a id="ch12fig05"/><strong>Figure 12-5</strong> Four-bit two&#8217;s complement binary numbers.</p>
<p class="image"><img src="graphics/fig12-05.jpg" alt="image"/></p>
<p>The first property of two&#8217;s complement overflow, which sometimes causes no errors, can be shown by the following examples:</p>
<p class="image"><a id="page_644"/><img src="graphics/644fig01.jpg" alt="image"/></p>
<p>Then again, the following examples show how two&#8217;s complement overflow sometimes does cause errors:</p>
<p class="image"><img src="graphics/644fig02.jpg" alt="image"/></p>
<p>The rule with two&#8217;s complement addition is <em>if the carry bit into the sign bit is the same as the overflow bit out of the sign bit, the overflow bit can be ignored, causing no errors; if the carry bit into the sign bit is different from the overflow bit out of the sign bit, the result is invalid</em>. An even more interesting property of two&#8217;s complement numbers is that a series of <em>b</em>-bit word summations can be performed where intermediate sums are invalid, but the final sum will be correct if its magnitude is less than 2<sup><em>b</em>&#8722;1</sup>. We show this by the following example. If we add a +6 to a +7, and then add a &#8722;7, we&#8217;ll encounter an intermediate overflow error but our final sum will be correct, as</p>
<p class="image"><img src="graphics/644fig03.jpg" alt="image"/></p>
<p><a id="page_645"/>The magnitude of the sum of the three four-bit numbers was less than 2<sup>4&#8722;1</sup> (&lt;8), so our result was valid. If we add a +6 to a +7, and next add a &#8722;5, we&#8217;ll encounter an intermediate overflow error, and our final sum will also be in error because its magnitude is not less than 8.</p>
<p class="image"><img src="graphics/644fig04.jpg" alt="image"/></p>
<p>Another situation where overflow problems are conspicuous is during the calculation of the fast Fourier transform (FFT). It&#8217;s difficult at first to imagine that multiplying complex numbers by sines and cosines can lead to excessive data word growth&#8212;particularly because sines and cosines are never greater than unity. Well, we can show how FFT data word growth occurs by considering a decimation-in-time FFT butterfly from <a href="ch04.html#ch04fig14">Figure 4-14(c)</a>, repeated here as <a href="#ch12fig06">Figure 12-6(a)</a>, and grinding through a little algebra. The expression for the <em>x</em>&#8217; output of this FFT butterfly, from <a href="ch04.html#ch04equ26">Eq. (4-26)</a>, is</p>
<p class="caption"><a id="ch12equ17"/>(12-17)</p>
<p class="image"><img src="graphics/eq-12-17.jpg" alt="image"/></p>
<p class="caption"><a id="ch12fig06"/><strong>Figure 12-6</strong> Data overflow scenarios: (a) single decimation-in-time FFT butterfly; (b) 2nd-order IIR filter.</p>
<p class="image"><img src="graphics/fig12-06.jpg" alt="image"/></p>
<p>Breaking up the butterfly&#8217;s <em>x</em> and <em>y</em> inputs into their real and imaginary parts and remembering that <img src="graphics/645fig01.jpg" alt="image"/>, we can express <a href="#ch12equ17">Eq. (12-17)</a> as</p>
<p class="caption"><a id="ch12equ18"/>(12-18)</p>
<p class="image"><img src="graphics/eq-12-18.jpg" alt="image"/></p>
<p>If we let &#945; be the twiddle factor angle of 2&#960;<em>k/N</em>, and recall that <em>e</em><sup>&#8722;<em>j</em>&#945;</sup> = cos(&#945;) &#8722; <em>j</em>sin(&#945;), we can simplify <a href="#ch12equ18">Eq. (12-18)</a> as</p>
<p class="caption"><a id="ch12equ19"/>(12-19)</p>
<p class="image"><img src="graphics/eq-12-19.jpg" alt="image"/></p>
<p><a id="page_646"/>If we look, for example, at just the real part of the <em>x</em>&#8217; output, <em>x</em>&#8217;<sub>real</sub>, it comprises the three terms</p>
<p class="caption"><a id="ch12equ20"/>(12-20)</p>
<p class="image"><img src="graphics/eq-12-20.jpg" alt="image"/></p>
<p>If <em>x</em><sub>real</sub>, <em>y</em><sub>real</sub>, and <em>y</em><sub>imag</sub> are of unity value when they enter the butterfly and the twiddle factor angle &#945; = 2&#960;<em>k/N</em> happens to be &#960;/4 = 45&#176;, then, <em>x</em>&#8217;<sub>real</sub> can be greater than 2 as</p>
<p class="caption"><a id="ch12equ21"/>(12-21)</p>
<p class="image"><img src="graphics/eq-12-21.jpg" alt="image"/></p>
<p>So we see that the real part of a complex number can more than double in magnitude in a single stage of an FFT. The imaginary part of a complex number is equally likely to more than double in magnitude in a single FFT stage. Without mitigating this word growth problem, overflow errors could render an FFT algorithm useless.</p>
<p>Overflow problems can also be troublesome for fixed-point systems containing feedback as shown in <a href="#ch12fig06">Figure 12-6(b)</a>. Examples of such networks are infinite impulse response (IIR) filters, cascaded integrator-comb (CIC) filters, and exponential averagers. The hardware register (accumulator) containing <em>w</em>(<em>n</em>) must have a binary word width that will hold data values as large as the network&#8217;s DC (zero Hz) gain <em>G</em> times the input signal, or <em>G</em> &#183; <em>x</em>(<em>n</em>). To avoid data overflow, the number of bits in the <em>w</em>(<em>n</em>)-results register must be at least</p>
<p class="caption"><a id="ch12equ22"/>(12-22)</p>
<p class="image"><img src="graphics/eq-12-22.jpg" alt="image"/></p>
<p>where <img src="graphics/646inline01.jpg" alt="image"/> means that if log<sub>2</sub>(<em>G</em>) is not an integer, round it up to the next larger integer. (As a quick reminder, we can determine the DC gain of a digital network by substituting <em>z</em> = 1 in the network&#8217;s <em>z</em>-domain transfer function.)</p>
<p>OK, overflow problems are handled in one of two ways&#8212;by truncation or rounding&#8212;each inducing its own individual kind of quantization errors, as we shall see.</p>
<p><a id="ch12sec2lev9"/></p>
<h4>12.3.3 Truncation</h4>
<p>Truncation is the process where some number of least significant bits are discarded from a binary number. A practical example of truncation is the situation where the results of a processing system are 16-bit signal samples that must be passed on to a 12-bit digital-to-analog converter. To avoid overflowing the converter&#8217;s 12-bit input register, the least significant 4 bits of the 16-bit signal samples must be discarded. Thinking about decimal numbers, if we&#8217;re quantizing to decimal integer values, for example, the real value 1.2 would be quantized to 1.</p>
<p><a id="page_647"/>An example of truncation to integer values is shown in <a href="#ch12fig07">Figure 12-7(a)</a>, where all values of <em>x</em> in the range of 0 &#8804; <em>x</em> &lt; 1 are set equal to 0, values of <em>x</em> in the range of 1 &#8804; <em>x</em> &lt; 2 are set equal to 1, and so on. The quantization level (value), in that figure, is <em>q</em> = 1. The quantization error induced by this truncation is the vertical distance between the horizontal bold lines and the dashed diagonal line in <a href="#ch12fig07">Figure 12-7(a)</a>.</p>
<p class="caption"><a id="ch12fig07"/><strong>Figure 12-7</strong> Truncation: (a) quantization nonlinearities; (b) error probability density function; (c) binary truncation.</p>
<p class="image"><img src="graphics/fig12-07.jpg" alt="image"/></p>
<p>As we did with A/D converter quantization errors, we can call upon the concept of probability density functions to characterize the quantization errors induced by truncation. The probability density function of truncation errors, in terms of the quantization level <em>q</em>, is shown in <a href="#ch12fig07">Figure 12-7(b)</a>. In <a id="page_648"/><a href="#ch12fig07">Figure 12-7(a)</a> the quantization level <em>q</em> is 1, so in this case we can have truncation errors as great as &#8722;1. Drawing upon our results from <a href="app04.html#app04equ11">Eqs. (D-11)</a> and <a href="app04.html#app04equ12">(D-12)</a> in <a href="app04.html#app04">Appendix D</a>, the mean and variance of our uniform truncation error probability density function are expressed as</p>
<p class="caption"><a id="ch12equ23"/>(12-23)</p>
<p class="image"><img src="graphics/eq-12-23.jpg" alt="image"/></p>
<p>and</p>
<p class="caption"><a id="ch12equ24"/>(12-24)</p>
<p class="image"><img src="graphics/eq-12-24.jpg" alt="image"/></p>
<p>The notion of binary number truncation is shown in <a href="#ch12fig07">Figure 12-7(c)</a>, where the ten-bit binary word <em>W</em> is to be truncated to six bits by discarding the four Truncate bits. So in this binary truncation situation, <em>q</em> in <a href="#ch12fig07">Figure 12-7(b)</a> is equal to the least significant bit (lsb) value (bit R0) of the retained binary word.</p>
<p>In a sense, truncation error is the price we pay for the privilege of using integer binary arithmetic. One aspect of this is the error introduced when we use truncation to implement division by some integer power of two. A quick way of dividing a binary value by 2<em><sup>K</sup></em> is to shift a binary word <em>K</em> bits to the right; that is, we&#8217;re truncating the data value (not the binary word width) by discarding the rightmost <em>K</em> bits after the right shift.</p>
<p>For example, let&#8217;s say we have the value 31 represented by the six-bit binary number 011111<sub>2</sub>, and we want to divide it by 16 through shifting the bits <em>K</em> = 4 places to the right and discarding those shifted bits. After the right shift we have a binary quotient of 000001<sub>2</sub>. Well, we see the significance of the problem because this type of division gave us a result of one instead of the correct quotient 31/16 = 1.9375. Our division-by-truncation error here is roughly 50 percent of the correct quotient. Had our original dividend been 63 represented by the six-bit binary number 111111<sub>2</sub>, dividing it by 16 through a four-bit shift would give us an answer of binary 000011<sub>2</sub>, or decimal three. The correct answer, of course, is 63/16 = 3.9375. In this case the percentage error is 0.9375/3.9375, or about 23.8 percent. So, the larger the dividend, the lower the truncation error.</p>
<p>If we study these kinds of errors, we&#8217;ll find that truncation error depends on three things: the number of value bits shifted and discarded, the values of the discarded bits (were those dropped bits ones or zeros?), and the magnitude of the binary number left over after shifting. Although a complete analysis of these truncation errors is beyond the scope of this book, a practical example of how division by truncation can cause serious numerical errors is given in reference <a href="#ch12en9">[9]</a>.</p>
<p>Unfortunately, truncation induces a DC bias (an error whose average is a nonzero negative number) on the truncated signal samples, as predicted by <a id="page_649"/><a href="#ch12equ23">Eq. (12-23)</a>. We see this behavior in <a href="#ch12fig07">Figure 12-7(b)</a> where the truncation error is always negative. Inducing a constant (DC) error to a signal sequence can be troublesome in many applications because the always-negative truncation error can grow to an unacceptable level in subsequent computations. So, in an effort to avoid overflow errors, rounding (discussed in the next section) is often preferred over truncation.</p>
<p><a id="ch12sec2lev10"/></p>
<h4>12.3.4 Data Rounding</h4>
<p>Rounding is where a binary number requiring truncation is slightly modified before the truncation operation is performed. Let&#8217;s review the behavior of rounding by first defining rounding as the process wherein a number is modified such that it is subsequently represented by, or <em>rounded off to</em>, its nearest quantization level. For example, if we&#8217;re quantizing to integer values, the decimal number 1.2 would be quantized to 1, and the number 1.6 would be quantized to 2. This is shown in <a href="#ch12fig08">Figure 12-8(a)</a>, where all values of <em>x</em> in the range of &#8722;0.5 &#8804; <em>x</em> &lt; 0.5 are set equal to 0, values of <em>x</em> in the range of 0.5 &#8804; <em>x</em> &lt; 1.5 are set equal to 1, and so on.</p>
<p class="caption"><a id="ch12fig08"/><strong>Figure 12-8</strong> Rounding: (a) quantization nonlinearities; (b) error probability density function.</p>
<p class="image"><img src="graphics/fig12-08.jpg" alt="image"/></p>
<p><a id="page_650"/>The quantization error induced by such a rounding operation is the vertical distance between the bold horizontal lines and the dashed diagonal line in <a href="#ch12fig08">Figure 12-8(a)</a>. The probability density function of the error induced by rounding, in terms of the quantization level <em>q</em>, is shown in <a href="#ch12fig08">Figure 12-8(b)</a>. In <a href="#ch12fig08">Figure 12-8(a)</a> the quantization level is <em>q</em> = 1, so in this case we can have quantization error magnitudes no greater than <em>q</em>/2, or 1/2. Using our <a href="app04.html#app04equ11">Eqs. (D-11)</a> and <a href="app04.html#app04equ12">(D-12)</a> results from <a href="app04.html#app04">Appendix D</a>, the mean and variance of our uniform rounding probability density function are expressed as</p>
<p class="caption"><a id="ch12equ25"/>(12-25)</p>
<p class="image"><img src="graphics/eq-12-25.jpg" alt="image"/></p>
<p>and</p>
<p class="caption"><a id="ch12equ26"/>(12-26)</p>
<p class="image"><img src="graphics/eq-12-26.jpg" alt="image"/></p>
<p>The notion of binary number rounding can be described using <a href="#ch12fig07">Figure 12-7(c)</a>, where the binary word <em>W</em> is to be truncated by discarding the four Truncate bits. With rounding, the binary word <em>W</em> is modified before the Truncate bits are discarded. So with binary rounding, <em>q</em> in <a href="#ch12fig08">Figure 12-8(b)</a> is equal to the lsb value of the preserved binary word R0.</p>
<p>Let&#8217;s not forget: the purpose of rounding, its goal in life, is to avoid data overflow errors while reducing the DC bias error (an error whose average is not zero) induced by simple truncation. Rounding achieves this goal because, in theory, its average error is zero as shown by <a href="#ch12equ25">Eq. (12-25)</a>. Next we discuss two popular methods of data rounding.</p>
<p>A common form of binary data rounding is straightforward to implement. Called <em>round-to-nearest,</em> it comprises the two-step process of adding one to the most significant (leftmost) of the lsb bits to be discarded, bit T3 of word <em>W</em> in <a href="#ch12fig07">Figure 12-7(c)</a>, and then discarding the appropriate Truncate bits. For an example of this rounding method, let&#8217;s say we have 16-bit signal samples destined to be routed to a 12-bit digital-to-analog converter. To avoid overflowing the converter&#8217;s 12-bit input register, we add a binary value of 1000<sub>2</sub> (decimal 8<sub>10</sub> = 2<sup>3</sup>) to the original 16-bit sample value and then truncate (discard) the sum&#8217;s least significant 4 bits. As another example of round-to-nearest rounding, if a 32-bit &#8220;long&#8221; word is rounded to 16 bits, a value of 2<sup>15</sup> is added to the long word before discarding the sum&#8217;s 16 least significant bits.</p>
<p>Stated in different words, this round-to-nearest rounding method means: If the T3 bit is a one, increment the R bits by one. Then shift the R bits to the right, discarding the Truncate bits.</p>
<p>The round-to-nearest method does reduce the average (DC bias) of the quantization error induced by simple truncation; however the round-to-nearest method&#8217;s average error bias is close to but not exactly equal to zero. <a id="page_651"/>(That&#8217;s because the R bits, in <a href="#ch12fig07">Figure 12-7(c)</a>, are always incremented when the value of the Truncate bits is equal to the value R0/2. This means that over time the R bits are rounded up slightly more often than they are rounded down.) With additional bit checking we can force the average rounding error to be exactly zero using a scheme called <em>convergent rounding.</em></p>
<p>Convergent rounding, also called <em>round to even,</em> is a slightly more complicated method of rounding, but one that yields zero-average rounding error on the rounded binary signal samples. Similar to the round-to-nearest method, convergent rounding does not always increment <a href="#ch12fig07">Figure 12-7(c)</a>&#8217;s R bits (the value Retain) when the value of the Truncate bits is equal to R0/2. In the convergent rounding scheme, when Truncate = R0/2, the value Retain is only incremented if its original value was an odd number. This clever process is shown in <a href="#ch12fig09">Figure 12-9</a>.</p>
<p class="caption"><a id="ch12fig09"/><strong>Figure 12-9</strong> Convergent rounding.</p>
<p class="image"><img src="graphics/fig12-09.jpg" alt="image"/></p>
<p>OK, here&#8217;s what we&#8217;ve learned about rounding: Relative to simple truncation, rounding requires more computations, but rounding both minimizes the constant-level (DC bias) quantization error induced by truncation alone, and rounding has a lower maximum quantization error. So rounding is often the preferred method used to avoid binary data overflow errors. The above two rounding methods can, by the way, be used in two&#8217;s complement number format systems.</p>
<p>As a practical rule, to retain maximum numerical precision, all necessary full-width binary arithmetic should be performed first and then rounding (or truncation) should be performed as the very last operation. For example, if we must add twenty 16-bit binary numbers followed by rounding the sum to 12 bits, we should perform the additions at full 16-bit precision and, as a final step, round the summation result to 12 bits.</p>
<p><a id="page_652"/>In digital signal processing, statistical analysis of quantization error effects is complicated because quantization is a nonlinear process. Analytical results depend on the types of quantization errors, the magnitude of the data being represented, the numerical format used, and which of the many FFT or digital filter structures we are implementing. Be that as it may, digital signal processing experts have developed simplified error models whose analysis has proved useful. Although discussion of these analysis techniques and their results is beyond the scope of this introductory text, many references are available for the energetic reader[<a href="#ch12en10">10</a>&#8211;<a href="#ch12en18">18</a>]. (Reference <a href="#ch12en11">[11]</a> has an extensive reference list of its own on the topic of quantization error analysis.)</p>
<p>Again, the overflow problems using fixed-point binary formats&#8212;which we try to alleviate with truncation or rounding&#8212;arise because so many digital signal processing algorithms comprise large numbers of additions or multiplications. This obstacle, particularly in hardware implementations of digital filters and the FFT, is avoided by hardware designers through the use of floating-point binary number formats.</p>
<p><a id="ch12sec1lev4"/></p>
<h3>12.4 Floating-Point Binary Formats</h3>
<p>Floating-point binary formats allow us to overcome most of the limitations of precision and dynamic range mandated by fixed-point binary formats, particularly in reducing the ill effects of overflow<a href="#ch12en19">[19]</a>. Floating-point formats segment a data word into two parts: a mantissa <em>m</em> and an exponent <em>e</em>. Using these parts, the value of a binary floating-point number <em>n</em> is evaluated as</p>
<p class="caption"><a id="ch12equ27"/>(12-27)</p>
<p class="image"><img src="graphics/eq-12-27.jpg" alt="image"/></p>
<p>that is, the number&#8217;s value is the product of the mantissa and 2 raised to the power of the exponent. (<em>Mantissa</em> is a somewhat unfortunate choice of terms because it has a meaning here very different from that in the mathematics of logarithms. Mantissa originally meant the decimal fraction of a logarithm.<sup><a id="ch12fn_05"/><a href="#ch12fn05">&#8224;</a></sup> However, due to its abundance in the literature we&#8217;ll continue using the term <em>mantissa</em> here.) Of course, both the mantissa and the exponent in <a href="#ch12equ27">Eq. (12-27)</a> can be either positive or negative numbers.</p>
<p class="footnotes"><a id="ch12fn05"/><sup><a href="#ch12fn_05">&#8224;</a></sup> For example, the common logarithm (log to the base 10) of 256 is 2.4082. The 2 to the left of the decimal point is called the characteristic of the logarithm and the 4082 digits are called the mantissa. The 2 in 2.4082 does not mean that we multiply .4082 by 10<sup>2</sup>. The 2 means that we take the antilog of .4082 to get 2.56 and multiply that by 10<sup>2</sup> to get 256.</p>
<p>Let&#8217;s assume that a <em>b</em>-bit floating-point number will use <em>b<sub>e</sub></em> bits for the fixed-point signed exponent and <em>b<sub>m</sub></em> bits for the fixed-point signed mantissa. <a id="page_653"/>The greater the number of <em>b<sub>e</sub></em> bits used, the larger the dynamic range of the number. The more bits used for b<em><sub>m</sub></em>, the better the resolution, or precision, of the number. Early computer simulations conducted by the developers of <em>b</em>-bit floating-point formats indicated that the best trade-off occurred with <em>b<sub>e</sub></em> &#8776; <em>b</em>/4 and <em>b<sub>m</sub></em> &#8776; 3<em>b</em>/4. We&#8217;ll see that for typical 32-bit floating-point formats used today, <em>b<sub>e</sub></em> &#8776; 8 bits and <em>b<sub>m</sub></em> &#8776; 24 bits.</p>
<p>To take advantage of a mantissa&#8217;s full dynamic range, most implementations of floating-point numbers treat the mantissa as a fractional fixed-point binary number, shift the mantissa bits to the right or left, so that the most significant bit is a one, and adjust the exponent accordingly. The process of shifting a binary bit pattern so that the most significant bit is a one is called <em>bit normalization.</em> When normalized, the mantissa bits are typically called the <em>fraction</em> of the floating-point number, instead of the mantissa. For example, the decimal value 3.6875<sub>10</sub> can be represented by the fractional binary number 11.1011<sub>2</sub>. If we use a two-bit exponent with a six-bit fraction floating-point word, we can just as well represent 11.1011<sub>2</sub> by shifting it to the right two places and setting the exponent to two as</p>
<p class="caption"><a id="ch12equ28"/>(12-28)</p>
<p class="image"><img src="graphics/eq-12-28.jpg" alt="image"/></p>
<p>The floating-point word above can be evaluated to retrieve our decimal number again as</p>
<p class="caption"><a id="ch12equ29"/>(12-29)</p>
<p class="image"><img src="graphics/eq-12-29.jpg" alt="image"/></p>
<p>After some experience using floating-point normalization, users soon realized that always having a one in the most significant bit of the fraction was wasteful. That redundant one was taking up a single bit position in all data words and serving no purpose. So practical implementations of floating-point formats discard that one, assume its existence, and increase the useful number of fraction bits by one. This is why the term <em>hidden bit</em> is used to describe some floating-point formats. While increasing the fraction&#8217;s precision, this scheme uses less memory because the hidden bit is merely accounted for in the <a id="page_654"/>hardware arithmetic logic. Using a hidden bit, the fraction in <a href="#ch12equ28">Eq. (12-28)</a>&#8217;s floating-point number is shifted to the left one place and would now be</p>
<p class="caption"><a id="ch12equ30"/>(12-30)</p>
<p class="image"><img src="graphics/eq-12-30.jpg" alt="image"/></p>
<p>Recall that the exponent and mantissa bits were fixed-point signed binary numbers, and we&#8217;ve discussed several formats for representing signed binary numbers, i.e., sign magnitude, two&#8217;s complement, and offset binary. As it turns out, all three signed binary formats are used in industry-standard floating-point formats. The most common floating-point formats, all using 32-bit words, are listed in <a href="#ch12tab06">Table 12-6</a>.</p>
<p class="caption"><a id="ch12tab06"/><strong>Table 12-6</strong> Floating&#8211;Point Number Formats</p>
<p class="image"><img src="graphics/t0654-01.jpg" alt="image"/></p>
<p><a id="page_655"/>The IEEE P754 floating-point format is the most popular because so many manufacturers of floating-point integrated circuits comply with this standard[<a href="#ch12en8">8</a>,<a href="#ch12en20">20</a>&#8211;<a href="#ch12en22">22</a>]. Its exponent <em>e</em> is offset binary (biased exponent), and its fraction is a sign-magnitude binary number with a hidden bit that&#8217;s assumed to be 2<sup>0</sup>. The decimal value of a normalized IEEE P754 floating-point number is evaluated as</p>
<p class="caption"><a id="ch12equ31"/>(12-31)</p>
<p class="image"><img src="graphics/eq-12-31.jpg" alt="image"/></p>
<p>where <em>f</em> is the decimal-formatted value of the fractional bits divided by 2<sup>23</sup>. Value <em>e</em> is the decimal value of the floating-point number&#8217;s exponent bits.</p>
<p>The IBM floating-point format differs somewhat from the other floating-point formats because it uses a base of 16 rather than 2. Its exponent is offset binary, and its fraction is sign magnitude with no hidden bit. The decimal value of a normalized IBM floating-point number is evaluated as</p>
<p class="caption"><a id="ch12equ32"/>(12-32)</p>
<p class="image"><img src="graphics/eq-12-32.jpg" alt="image"/></p>
<p>The DEC floating-point format uses an offset binary exponent, and its fraction is sign magnitude with a hidden bit that&#8217;s assumed to be 2<sup>&#8722;1</sup>. The decimal value of a normalized DEC floating-point number is evaluated as</p>
<p class="caption"><a id="ch12equ33"/>(12-33)</p>
<p class="image"><img src="graphics/eq-12-33.jpg" alt="image"/></p>
<p>MIL-STD 1750A is a United States Military Airborne floating-point standard. Its exponent <em>e</em> is a two&#8217;s complement binary number residing in the least significant eight bits. MIL-STD 1750A&#8217;s fraction is also a two&#8217;s complement number (with no hidden bit), and that&#8217;s why no sign bit is specifically indicated in <a href="#ch12tab06">Table 12-6</a>. The decimal value of a MIL-STD 1750A floating-point number is evaluated as</p>
<p class="caption"><a id="ch12equ34"/>(12-34)</p>
<p class="image"><img src="graphics/eq-12-34.jpg" alt="image"/></p>
<p>Notice how the floating-point formats in <a href="#ch12tab06">Table 12-6</a> all have word lengths of 32 bits. This was not accidental. Using 32-bit words makes these formats easier to handle using 8-, 16-, and 32-bit hardware processors. That fact not withstanding and given the advantages afforded by floating-point number formats, these formats do require a significant amount of logical comparisons and branching to correctly perform arithmetic operations. Reference <a href="#ch12en23">[23]</a> provides useful flow charts showing what procedural steps must be taken when floating-point numbers are added and multiplied.</p>
<p><a id="ch12sec2lev11"/></p>
<h4><a id="page_656"/>12.4.1 Floating-Point Dynamic Range</h4>
<p>Attempting to determine the dynamic range of an arbitrary floating-point number format is a challenging exercise. We start by repeating the expression for a number system&#8217;s dynamic range from <a href="#ch12equ06">Eq. (12-6)</a> as</p>
<p class="caption"><a id="ch12equ35"/>(12-35)</p>
<p class="image"><img src="graphics/eq-12-35.jpg" alt="image"/></p>
<p>When we attempt to determine the largest and smallest possible values for a floating-point number format, we quickly see that they depend on such factors as</p>
<p class="indenthangingB">&#8226; the position of the binary point</p>
<p class="indenthangingB">&#8226; whether a hidden bit is used or not (If used, its position relative to the binary point is important.)</p>
<p class="indenthangingB">&#8226; the base value of the floating-point number format</p>
<p class="indenthangingB">&#8226; the signed binary format used for the exponent and the fraction (For example, recall from <a href="#ch12tab02">Table 12-2</a> that the binary two&#8217;s complement format can represent larger negative numbers than the sign-magnitude format.)</p>
<p class="indenthangingB">&#8226; how unnormalized fractions are handled, if at all (<em>Unnormalized,</em> also called <em>gradual underflow,</em> means a nonzero number that&#8217;s less than the minimum normalized format but can still be represented when the exponent and hidden bit are both zero.)</p>
<p class="indenthangingB">&#8226; how exponents are handled when they&#8217;re either all ones or all zeros. (For example, the IEEE P754 format treats a number having an all-ones exponent and a nonzero fraction as an invalid number, whereas the DEC format handles a number having a sign = 1 and a zero exponent as a special instruction instead of a valid number.)</p>
<p>Trying to develop a dynamic range expression that accounts for all the possible combinations of the above factors is impractical. What we can do is derive a rule-of-thumb expression for dynamic range that&#8217;s often used in practice[<a href="#ch12en8">8</a>,<a href="#ch12en22">22</a>,<a href="#ch12en24">24</a>].</p>
<p>Let&#8217;s assume the following for our derivation: the exponent is a <em>b<sub>e</sub></em>-bit offset binary number, the fraction is a normalized sign-magnitude number having a sign bit and <em>b<sub>m</sub></em> magnitude bits, and a hidden bit is used just left of the binary point. Our hypothetical floating-point word takes the following form:</p>
<p class="image"><img src="graphics/656fig01.jpg" alt="image"/></p>
<p><a id="page_657"/>First we&#8217;ll determine what the largest value can be for our floating-point word. The largest fraction is a one in the hidden bit, and the remaining <em>b<sub>m</sub></em> fraction bits are all ones. This would make fraction <em>f</em> = [1 + (1 &#8722; 2<sup>&#8722;<em>bm</em></sup>)]. The first 1 in this expression is the hidden bit to the left of the binary point, and the value in parentheses is all <em>b<sub>m</sub></em> bits equal to ones to the right of the binary point. The greatest positive value we can have for the <em>b<sub>e</sub></em>-bit offset binary exponent is 2<sup>(2<sup><em>b<sub>e</sub></em>&#8722;1</sup>&#8722;1)</sup>. So the largest value that can be represented with the floating-point number is the largest fraction raised to the largest positive exponent, or</p>
<p class="caption"><a id="ch12equ36"/>(12-36)</p>
<p class="image"><img src="graphics/eq-12-36.jpg" alt="image"/></p>
<p>The smallest value we can represent with our floating-point word is a one in the hidden bit times two raised to the exponent&#8217;s most negative value, 2<sup>&#8722;(2<em>b</em><sub>e</sub>&#8722;1)</sup>, or</p>
<p class="caption"><a id="ch12equ37"/>(12-37)</p>
<p class="image"><img src="graphics/eq-12-37.jpg" alt="image"/></p>
<p>Plugging <a href="#ch12equ36">Eqs. (12-36)</a> and <a href="#ch12equ37">(12-37)</a> into <a href="#ch12equ35">Eq. (12-35)</a>,</p>
<p class="caption"><a id="ch12equ38"/>(12-38)</p>
<p class="image"><img src="graphics/eq-12-38.jpg" alt="image"/></p>
<p>Now here&#8217;s where the thumb comes in&#8212;when <em>b<sub>m</sub></em> is large, say over seven, the 2<sup>&#8722;<em>b<sub>m</sub></em></sup> value approaches zero; that is, as <em>b<sub>m</sub></em> increases, the all-ones fraction (1 &#8722; 2<sup>&#8722;<em>b<sub>m</sub></em></sup>) value in the numerator approaches 1. Assuming this, <a href="#ch12equ38">Eq. (12-38)</a> becomes</p>
<p class="caption"><a id="ch12equ39"/>(12-39)</p>
<p class="image"><img src="graphics/eq-12-39.jpg" alt="image"/></p>
<p>Using <a href="#ch12equ39">Eq. (12-39)</a>, we can estimate, for example, the dynamic range of the single-precision IEEE P754 standard floating-point format with its eight-bit exponent:</p>
<p class="caption"><a id="ch12equ40"/>(12-40)</p>
<p class="image"><img src="graphics/eq-12-40.jpg" alt="image"/></p>
<p><a id="page_658"/>Although we&#8217;ve introduced the major features of the most common floating-point formats, there are still more details to learn about floating-point numbers. For the interested reader, the references given in this section provide a good place to start.</p>
<p><a id="ch12sec1lev5"/></p>
<h3>12.5 Block Floating-Point Binary Format</h3>
<p>A marriage of fixed-point and floating-point binary formats is known as <em>block floating point</em>. This scheme is used, particularly in dedicated FFT integrated circuits, when large arrays, or blocks, of associated data are to be manipulated mathematically. Block floating-point schemes begin by examining all the words in a block of data, normalizing the largest-valued word&#8217;s fraction, and establishing the correct exponent. This normalization takes advantage of the fraction&#8217;s full dynamic range. Next, the fractions of the remaining data words are shifted appropriately, so that they can use the exponent of the largest word. In this way, all of the data words use the same exponent value to conserve hardware memory.</p>
<p>In FFT implementations, the arithmetic is performed treating the block normalized data values as fixed-point binary. However, when an addition causes an overflow condition, all of the data words are shifted one bit to the right (division by two), and the exponent is incremented by one. As the reader may have guessed, block floating-point formats have increased dynamic range and avoid the overflow problems inherent in fixed-point formats but do not reach the performance level of true floating-point formats[<a href="#ch12en8">8</a>,<a href="#ch12en25">25</a>,<a href="#ch12en26">26</a>].</p>
<p><a id="ch12sec1lev6"/></p>
<h3>References</h3>
<p class="chapterendnote"><a id="ch12en1"/>[1] Neugebauer, O. &#8220;The History of Ancient Astronomy,&#8221; <em>Journal of Near Eastern Studies,</em> Vol. 4, 1945, p. 12.</p>
<p class="chapterendnote"><a id="ch12en2"/>[2] Knuth, D. E. <em>The Art of Computer Programming: Seminumerical Methods</em>, Vol. 2, Addison-Wesley, Reading, Massachusetts, 1981, <a href="ch04.html#ch04sec1lev1">Section 4.1</a>, p. 179.</p>
<p class="chapterendnote"><a id="ch12en3"/>[3] Kester, W. &#8220;Peripheral Circuits Can Make or Break Sampling-ADC Systems,&#8221; <em>EDN Magazine</em>, October 1, 1992.</p>
<p class="chapterendnote"><a id="ch12en4"/>[4] Grove, M. &#8220;Measuring Frequency Response and Effective Bits Using Digital Signal Processing Techniques,&#8221; <em>Hewlett</em>-<em>Packard Journal,</em> February 1992.</p>
<p class="chapterendnote"><a id="ch12en5"/>[5] Tektronix. &#8220;Effective Bits Testing Evaluates Dynamic Range Performance of Digitizing Instruments,&#8221; <em>Tektronix Application Note,</em> No. 45W-7527, December 1989.</p>
<p class="chapterendnote"><a id="ch12en6"/>[6] Ushani, R. &#8220;Subranging ADCs Operate at High Speed with High Resolution,&#8221; <em>EDN Magazine</em>, April 11, 1991.</p>
<p class="chapterendnote"><a id="page_659"/><a id="ch12en7"/>[7] Demler, M. &#8220;Time-Domain Techniques Enhance Testing of High-Speed ADCs,&#8221; <em>EDN Magazine</em>, March 30, 1992.</p>
<p class="chapterendnote"><a id="ch12en8"/>[8] Hilton, H. &#8220;A 10-MHz Analog-to-Digital Converter with 110-dB Linearity,&#8221; <em>Hewlett-Packard Journal</em>, October 1993.</p>
<p class="chapterendnote"><a id="ch12en9"/>[9] Lyons, R. G. &#8220;Providing Software Flexibility for Optical Processor Noise Analysis,&#8221; <em>Computer Design</em>, July 1978, p. 95.</p>
<p class="chapterendnote"><a id="ch12en10"/>[10] Knuth, D. E. <em>The Art of Computer Programming: Seminumerical Methods</em>, Vol. 2, Addison-Wesley, Reading, Massachusetts, 1981, <a href="ch04.html#ch04sec1lev2">Section 4.2</a>, p. 198.</p>
<p class="chapterendnote"><a id="ch12en11"/>[11] Rabiner, L. R., and Gold, B. <em>Theory and Application of Digital Signal Processing</em>, <a href="ch05.html#ch05">Chapter 5</a>, Prentice Hall, Englewood Cliffs, New Jersey, 1975, p. 353.</p>
<p class="chapterendnote"><a id="ch12en12"/>[12] Jackson, L. B. &#8220;An Analysis of Limit Cycles Due to Multiplicative Rounding in Recursive Digital Filters,&#8221; <em>Proc. 7th Allerton Conf. Circuit System Theory</em>, 1969, pp. 69&#8211;78.</p>
<p class="chapterendnote"><a id="ch12en13"/>[13] Kan, E. P. F., and Aggarwal, J. K. &#8220;Error Analysis of Digital Filters Employing Floating Point Arithmetic,&#8221; <em>IEEE Trans. Circuit Theory</em>, Vol. CT-18, November 1971, pp. 678&#8211;686.</p>
<p class="chapterendnote"><a id="ch12en14"/>[14] Crochiere, R. E. &#8220;Digital Ladder Structures and Coefficient Sensitivity,&#8221; <em>IEEE Trans. Audio Electroacoustics</em>, Vol. AU-20, October 1972, pp. 240&#8211;246.</p>
<p class="chapterendnote"><a id="ch12en15"/>[15] Jackson, L. B. &#8220;On the Interaction of Roundoff Noise and Dynamic Range in Digital Filters,&#8221; <em>Bell System Technical Journal</em>, Vol. 49, February 1970, pp. 159&#8211;184.</p>
<p class="chapterendnote"><a id="ch12en16"/>[16] Roberts, R. A., and Mullis, C. T. <em>Digital Signal Processing</em>, Addison-Wesley, Reading, Massachusetts, 1987, p. 277.</p>
<p class="chapterendnote"><a id="ch12en17"/>[17] Jackson, L. B. &#8220;Roundoff Noise Analysis for Fixed-Point Digital Filters Realized in Cascade or Parallel Form,&#8221; <em>IEEE Trans. Audio Electroacoustics</em>, Vol. AU-18, June 1970, pp. 107&#8211;122.</p>
<p class="chapterendnote"><a id="ch12en18"/>[18] Oppenheim, A. V., and Schafer, R. W. <em>Discrete</em>-<em>Time Signal Processing</em>, Prentice Hall, Englewood Cliffs, New Jersey, 1989, Sections 9.7 and 9.8.</p>
<p class="chapterendnote"><a id="ch12en19"/>[19] Larimer, J., and Chen, D. &#8220;Fixed or Floating? A Pointed Question in DSPs,&#8221; <em>EDN Magazine</em>, August 3, 1995.</p>
<p class="chapterendnote"><a id="ch12en20"/>[20] Ashton, C. &#8220;Floating Point Math Handles Iterative and Recursive Algorithms,&#8221; <em>EDN Magazine</em>, January 9, 1986.</p>
<p class="chapterendnote"><a id="ch12en21"/>[21] Windsor, B., and Wilson, J. &#8220;Arithmetic Duo Excels in Computing Floating Point Products,&#8221; <em>Electronic Design</em>, May 17, 1984.</p>
<p class="chapterendnote"><a id="ch12en22"/>[22] Windsor, W. A. &#8220;IEEE Floating Point Chips Implement DSP Architectures,&#8221; <em>Computer Design</em>, January 1985.</p>
<p class="chapterendnote"><a id="ch12en23"/>[23] Texas Instruments Inc. <em>Digital Signal Processing Applications with the TMS320 Family: Theory, Algorithms, and Implementations</em>, SPRA012A, Texas Instruments, Dallas, Texas, 1986.</p>
<p class="chapterendnote"><a id="ch12en24"/>[24] Strauss, W. I. &#8220;Integer or Floating Point? Making the Choice,&#8221; <em>Computer Design Magazine</em>, April 1, 1990, p. 85.</p>
<p class="chapterendnote"><a id="page_660"/><a id="ch12en25"/>[25] Oppenheim, A. V., and Weinstein, C. J. &#8220;Effects of Finite Register Length in Digital Filtering and the Fast Fourier Transform,&#8221; <em>Proceedings of the IEEE</em>, August 1972, pp. 957&#8211;976.</p>
<p class="chapterendnote"><a id="ch12en26"/>[26] Woods, R. E. &#8220;Transform-Based Processing: How Much Precision Is Needed?&#8221; <em>ESD: The Electronic System Design Magazine</em>, February 1987.</p>
<p><a id="ch12sec1lev7"/></p>
<h3><a id="page_661"/>Chapter 12 Problems</h3>
<p class="indenthanging4"><strong>12.1</strong> Given their specified format, convert the following integer binary numbers to decimal format:</p>
<p class="indenthangingN1"><strong>(a)</strong> 1100 0111, unsigned,</p>
<p class="indenthangingN1"><strong>(b)</strong> 1100 0111, sign magnitude,</p>
<p class="indenthangingN1"><strong>(c)</strong> 1100 0111, two&#8217;s complement,</p>
<p class="indenthangingN1"><strong>(d)</strong> 1100 0111, offset binary.</p>
<p class="indenthanging4"><strong>12.2</strong> Convert the following unsigned integer binary numbers, given here in hexadecimal format, to decimal:</p>
<p class="indenthangingN1"><strong>(a)</strong> $A231,</p>
<p class="indenthangingN1"><strong>(b)</strong> 0x71F.</p>
<p class="indenthanging4"><strong>12.3</strong> Given the hexadecimal integer numbers $07 and $E2 in two&#8217;s complement format, what is the decimal value of $07 minus $E2? Show your work.</p>
<p class="indenthanging4"><strong>12.4</strong> Sign-extend the following two&#8217;s complement integer numbers, given in hexadecimal format, to 16 bits and express the results in hexadecimal format:</p>
<p class="indenthangingN1"><strong>(a)</strong> $45,</p>
<p class="indenthangingN1"><strong>(b)</strong> $B3.</p>
<p class="indenthanging4"><strong>12.5</strong> Show that the binary addition operation</p>
<p class="indentpara4"><img src="graphics/661fig01.jpg" alt="image"/></p>
<p class="indentpara4">gives the correct decimal results when the two binary addends and the sum are in the following two&#8217;s complement fractional formats:</p>
<p class="indenthangingN1"><strong>(a)</strong> 7.1 (7 integer bits and 1 fractional bit),</p>
<p class="indenthangingN1"><strong>(b)</strong> 6.2 (6 integer bits and 2 fractional bits),</p>
<p class="indenthangingN1"><strong>(c)</strong> 4.4 (4 integer bits and 4 fractional bits).</p>
<p class="indenthanging4"><strong>12.6</strong> Microchip Technology Inc. produces a microcontroller chip (Part #PIC24F) that accommodates 16-bit data words. When using a two&#8217;s complement integer number format, what are the most positive and most negative decimal numbers that can be represented by the microcontroller&#8217;s data word?</p>
<p class="indenthanging4"><a id="page_662"/><strong>12.7</strong> Consider four-bit unsigned binary words using a 2.2 (&#8220;two dot two&#8221;) &#8220;integer plus fraction&#8221; format. List all 16 possible binary words in this format and give their decimal equivalents.</p>
<p class="indenthanging4"><strong>12.8</strong> The annual residential property tax in California is 0.0165 times the assessed dollar value of the property. What is this 0.0165 tax rate factor in a two&#8217;s complement 1.15 format? Give the answer in both binary and hexadecimal representations. Show how you arrived at your solution.</p>
<p class="indenthanging4"><strong>12.9</strong> The decimal number 1/3 cannot be represented exactly with a finite number of decimal digits, nor with a finite number of binary bits. What would be the base of a number system that would allow decimal 1/3 to be exactly represented with a finite number of digits?</p>
<p class="indenthanging4"><strong>12.10</strong> If the number 4273<sub>6</sub> is in a base 6 numbering system, what would be its decimal value?</p>
<p class="indenthanging4"><strong>12.11</strong> Think about a 32-bit two&#8217;s complement fixed-point binary number having 31 fractional bits (a &#8220;1.31&#8221; two&#8217;s complement number). This number format is very common in today&#8217;s high-performance programmable DSP chips.</p>
<p class="indenthangingN1"><strong>(a)</strong> What is the most positive decimal value that can be represented by such a binary number? Show how you arrived at your solution.</p>
<p class="indenthangingN1"><strong>(b)</strong> What is the most negative decimal value?</p>
<p class="indenthanging4"><strong>12.12</strong> As of this writing, Analog Devices Inc. produces an integrated circuit (Part #AD9958), called a <em>direct digital synthesizer</em>, that generates high-precision analog sinewaves. The AD9958 uses a 31-bit binary word to control the device&#8217;s output frequency. When the control word is at its minimum value, the device&#8217;s output frequency is zero Hz. When the control word is at its maximum value, the output frequency is 250 MHz. What is the frequency resolution (the frequency step size) of this sinusoidal signal generator in Hz?</p>
<p class="indenthanging4"><strong>12.13</strong> The first commercial audio compact disc (CD) players used 16-bit samples to represent an analog audio signal. Their sample rate was <em>f<sub>s</sub></em> = 44.1 kHz. Those 16-bit samples were applied to a digital-to-analog (D/A) converter whose analog output was routed to a speaker. What is the combined data output rate of the digital portion, measured in bytes (8-bit binary words) per second, of a stereo CD player?</p>
<p class="indenthanging4"><strong>12.14</strong> When implementing a digital filter using a fixed-point binary number format, care must be taken to avoid arithmetic overflow errors. With that notion in mind, if the <em>x</em>(<em>n</em>) input samples in <a href="#ch12fig10">Figure P12-14</a> are eight-bit binary words, <a id="page_663"/>how many bits are needed to represent the <em>y</em>(<em>n</em>) output sequence to avoid any data overflow errors? Show how you arrived at your answer.</p>
<p class="caption"><a id="ch12fig10"/><strong>Figure P12-14</strong></p>
<p class="image"><img src="graphics/fig-p12-14.jpg" alt="image"/></p>
<p class="indentpara4"><strong>Hint:</strong> Review the last portion of the text&#8217;s <a href="#ch12sec2lev8">Section 12.3.2</a>.</p>
<p class="indenthanging4"><strong>12.15</strong> Review the brief description of <em><a href="app06.html#gloss01_01">allpass filters</a></em> in <a href="app06.html#app06">Appendix F</a>. One form of an allpass filter is shown in <a href="#ch12fig11">Figure P12-15(a)</a>. For the filter to have the desired constant magnitude response over its full operating frequency, coefficient <em>A</em> must be equal to</p>
<p class="image"><img src="graphics/uneq-12-15.jpg" alt="image"/></p>
<p class="caption"><a id="ch12fig11"/><strong>Figure P12-15</strong></p>
<p class="image"><img src="graphics/fig-p12-15.jpg" alt="image"/></p>
<p class="indentpara4">If the filter is designed such that <em>B</em> = 2.5, show why we cannot achieve the desired constant frequency magnitude response when coefficients <em>A</em> and <em>B</em> are quantized using four-bit unsigned binary words in a 2.2 (&#8220;two dot two&#8221;) &#8220;integer plus fraction&#8221; format, where <em>A</em><sub>Q</sub> and <em>B</em><sub>Q</sub> are the quantized coefficients as shown in <a href="#ch12fig11">Figure P12-15(b)</a>.</p>
<p class="indenthanging4"><a id="page_664"/><strong>12.16</strong> National Semiconductors Inc. produces a digital tuner chip (Part #CLC5903), used for building digital receivers, that has the capability to amplify its output signal by shifting its binary signal sample values to the left by as few as one bit to as many as seven bits. What is the maximum gain, measured in dB (decibels), of this tuner&#8217;s bit-shifting amplification capability?</p>
<p class="indenthanging4"><strong>12.17</strong> <a href="#ch12fig12">Figure P12-17</a> shows an algorithm that approximates the operation of dividing a sign-magnitude binary number <em>x</em>(<em>n</em>) by an integer value <em>K</em>. (A block containing the &#8220;&#8212;&gt; 2&#8221; symbol means truncation by way of a binary right shift by two bits.) What is the value of integer <em>K</em>? Show your work.</p>
<p class="caption"><a id="ch12fig12"/><strong>Figure P12-17</strong></p>
<p class="image"><img src="graphics/fig-p12-17.jpg" alt="image"/></p>
<p class="indenthanging4"><strong>12.18</strong> When using programmable DSP chips, multiplication is a simple straightforward operation. However, when using field-programmable gate arrays (FPGAs), multiplier hardware is typically difficult to implement and should be avoided whenever possible. <a href="#ch12fig13">Figure P12-18</a> shows how we can multiply a binary <em>x</em>(<em>n</em>) input sequence by 54, without the need for multiplier hardware. What are the values for <em>A</em> and <em>B</em> in <a href="#ch12fig13">Figure P12-18</a> so that <em>y</em>(<em>n</em>) equals 54 times <em>x</em>(<em>n</em>)?</p>
<p class="caption"><a id="ch12fig13"/><strong>Figure P12-18</strong></p>
<p class="image"><img src="graphics/fig-p12-18.jpg" alt="image"/></p>
<p class="indenthanging4"><a id="page_665"/><strong>12.19</strong> Consider the network shown in <a href="#ch12fig14">Figure P12-19</a> which approximates a 2nd-order differentiation operation. In many DSP implementations (using field-programmable gate arrays, for example) it is advantageous to minimize the number of multiplications. Assuming that all the sequences in <a href="#ch12fig14">Figure P12-19</a> use a binary two&#8217;s complement integer number format, what data bit manipulations must be implemented to eliminate the two multipliers?</p>
<p class="caption"><a id="ch12fig14"/><strong>Figure P12-19</strong></p>
<p class="image"><img src="graphics/fig-p12-19.jpg" alt="image"/></p>
<p class="indenthanging4"><strong>12.20</strong> Agilent Inc. produces an A/D converter (Model #DP1400) whose sample rate is 2&#215;10<sup>9</sup> samples/second (<em>f<sub>s</sub></em> = 2 GHz). This digitizer provides super-fine time resolution samples of analog signals whose durations are <em>T</em> = 5&#215;10<sup>&#8722;6</sup> seconds (5 microseconds) as shown in <a href="#ch12fig15">Figure P12-20</a>. If each converter output sample is stored in one memory location of a computer, how many memory locations are required to store the converter&#8217;s <em>x</em>(<em>n</em>) output sequence representing the 5-microsecond-duration <em>x</em>(<em>t</em>) signal?</p>
<p class="caption"><a id="ch12fig15"/><strong>Figure P12-20</strong></p>
<p class="image"><img src="graphics/fig-p12-20.jpg" alt="image"/></p>
<p class="indenthanging4"><strong>12.21</strong> Here is a problem often encountered by DSP engineers. Assume we sample exactly three cycles of a continuous <em>x</em>(<em>t</em>) sinewave resulting in a block of 1024 <a id="page_666"/><em>x</em>(<em>n</em>) time samples and compute a 1024-point fast Fourier transform (FFT) to obtain the FFT magnitude samples. Also assume that we repeat the sampling and FFT magnitude computations many times and average the FFT magnitude sequences to produce the average magnitude samples, &#124;<em>X</em><sub>ave</sub>(<em>m</em>)&#124;, shown in <a href="#ch12fig16">Figure P12-21</a>. (We averaged multiple FFT magnitude sequences to increase the accuracy, by reducing the variance, of our final &#124;<em>X</em><sub>ave</sub>(<em>m</em>)&#124; sequence.) If the A/D converter produces ten-bit binary words in sign-magnitude format and has an input full-scale bipolar voltage range of &#177;5 volts, what is the peak value of the continuous <em>x</em>(<em>t</em>) sinewave? Justify your answer.</p>
<p class="caption"><a id="ch12fig16"/><strong>Figure P12-21</strong></p>
<p class="image"><img src="graphics/fig-p12-21.jpg" alt="image"/></p>
<p class="indenthanging4"><strong>12.22</strong> Suppose we have a 12-bit A/D converter that operates over an input voltage range of &#177;5 volts (10 volts peak-peak). Assume the A/D converter is <em>ideal</em> in its operation and its transfer function is that shown in <a href="#ch12fig17">Figure P12-22</a> where the tick mark spacing of the <em>x</em>(<em>t</em>) and <em>x</em>(<em>n</em>) axes is the converter&#8217;s quantization-level <em>q</em>.</p>
<p class="caption"><a id="ch12fig17"/><strong>Figure P12-22</strong></p>
<p class="image"><img src="graphics/fig-p12-22.jpg" alt="image"/></p>
<p class="indenthangingN1"><a id="page_667"/><strong>(a)</strong> What is the A/D converter&#8217;s quantization-level <em>q</em> (least significant bit) voltage?</p>
<p class="indenthangingN1"><strong>(b)</strong> What are the A/D converter&#8217;s maximum positive and maximum negative quantization error voltages?</p>
<p class="indenthangingN1"><strong>(c)</strong> If we apply a 7-volt peak-peak sinusoidal voltage to the converter&#8217;s analog input, what A/D output signal-to-quantization noise value, <em>SNR</em><sub>A/D</sub> in dB, should we expect? Show how you arrived at your answer.</p>
<p class="indenthanging4"><strong>12.23</strong> Suppose an A/D converter manufacturer applies a 10-volt peak-peak sinusoidal voltage to their 12-bit converter&#8217;s analog input, conducts careful testing, and measures the converter&#8217;s overall signal-to-noise level to be 67 dB. What is the <em>effective number of bits</em> value, <em>b</em><sub>eff</sub>, for their A/D converter?</p>
<p class="indenthanging4"><strong>12.24</strong> Let&#8217;s reinforce our understanding of the quantization errors induced by typical A/D converters.</p>
<p class="indenthangingN1"><strong>(a)</strong> <a href="#ch12fig18">Figure P12-24</a> shows the quantized <em>x</em>(<em>n</em>) output integer values of truncating and rounding A/D converters as a function of their continuous <em>x</em>(<em>t</em>) input voltage. It&#8217;s sensible to call those bold <em>stair-step</em> curves the &#8220;transfer functions&#8221; of the A/D converters. The curves are normalized to the A/D converter&#8217;s quantization-level voltage <em>q</em>, such that an <em>x</em>(<em>t</em>) value of 2 represents a voltage of 2<em>q</em> volts. Draw the curves of the quantization error as a function of the continuous <em>x</em>(<em>t</em>) input for both truncating and rounding A/D converters.</p>
<p class="caption"><a id="ch12fig18"/><strong>Figure P12-24</strong></p>
<p class="image"><img src="graphics/fig-p12-24.jpg" alt="image"/></p>
<p class="indenthangingN1"><strong>(b)</strong> Fill in the following table of important A/D converter quantization error properties in terms of the A/D converters&#8217; quantization-level voltage <em>q</em>.</p>
<p class="image"><a id="page_668"/><img src="graphics/t0668-01.jpg" alt="image"/></p>
<p class="indenthanging4"><strong>12.25</strong> Assume we want to digitize the output voltage of a temperature measurement system, monitoring the internal temperature of an automobile radiator, as shown in <a href="#ch12fig19">Figure P12-25</a>. The system&#8217;s manufacturer states that its output voltage <em>v</em>(<em>t</em>) will represent the thermocouple&#8217;s junction temperature with an accuracy of 2 degrees Fahrenheit (1.1 degrees Celsius), and its operating range covers temperatures as low as just-freezing water to twice the temperature of boiling water. To accommodate the precision and operating range of the temperature measurement system, how many bits, <em>b</em>, do we need for our A/D converter? Show your work.</p>
<p class="caption"><a id="ch12fig19"/><strong>Figure P12-25</strong></p>
<p class="image"><img src="graphics/fig-p12-25.jpg" alt="image"/></p>
<p class="indenthanging4"><strong>12.26</strong> One useful way to test the performance of A/D converters is to apply a specific analog signal to the A/D converter&#8217;s analog input and perform a histogram of the converter&#8217;s output samples. For example, if an analog squarewave-like signal is applied to an A/D converter, the converter&#8217;s output sequence might be that shown in the left panel of <a href="#ch12fig20">Figure P12-26(a)</a>, and the histogram of the converter&#8217;s output samples is shown in the right panel of <a href="#ch12fig20">Figure P12-26(a)</a>. That histogram shows that there are many converter output samples whose values are &#8722;0.2, and many converter output samples whose values are 0.5, and no sample values other than &#8722;0.2 and 0.5. The shape of the histogram curve will indicate any severe defects in the converter&#8217;s performance.</p>
<p class="caption"><a id="ch12fig20"/><strong>Figure P12-26</strong></p>
<p class="image"><img src="graphics/fig-p12-26.jpg" alt="image"/></p>
<p class="indentpara4">If a triangular analog signal is applied to an A/D converter, the converter&#8217;s output sequence would be that shown in the left panel of <a href="#ch12fig20">Figure P12-26(b)</a> <a id="page_669"/>and the histogram of the converter&#8217;s output samples is shown in the right panel of <a href="#ch12fig20">Figure P12-26(b)</a>. This histogram shows that there are (ideally) an equal number of samples at all amplitudes between &#8722;1 and +1, which happens to indicate correct converter behavior.</p>
<p class="indentpara4">In the testing of high-frequency A/D converters, high-frequency analog square and triangular waves are difficult to generate, so A/D converter engineers use high-frequency analog sinewaves to test their converters. Assuming that an analog sinewave is used as an input for A/D converter histogram testing and the converter output samples are those shown in the left panel of <a href="#ch12fig20">Figure P12-26(c)</a>, draw a rough sketch of the histogram of converter output samples.</p>
<p class="indenthanging4"><strong>12.27</strong> In the text we discussed how to use the concept of a uniform probability density function (PDF), described in <a href="app04.html#app04sec1lev3">Section D.3</a> of <a href="app04.html#app04">Appendix D</a>, to help us determine the variance (a measure of power) of random A/D-converter <a id="page_670"/>quantization noise. Sometimes we want to generate random noise samples, for testing purposes, that have a uniform PDF such as that shown in <a href="#ch12fig21">Figure P12-27</a>. What is the value of <em>A</em> for a uniform PDF random sequence whose variance is equal to 2?</p>
<p class="caption"><a id="ch12fig21"/><strong>Figure P12-27</strong></p>
<p class="image"><img src="graphics/fig-p12-27.jpg" alt="image"/></p>
<p class="indenthanging4"><strong>12.28</strong> Assume we have a single numerical data sample value in floating-point binary format. What two bit manipulation methods exist to multiply that sample by 4 without using any multiplier hardware circuitry?</p>
<p class="indenthanging4"><strong>12.29</strong> Convert the following IEEE P754 floating-point number, given here in hexadecimal format, to a decimal number:</p>
<p class="indentpara4">$C2ED0000</p>
<p class="indentpara4">Show your work.</p>
<p class="indentpara4"><strong>Hint:</strong> Don&#8217;t forget to account for the hidden one in the IEEE P754 format.</p>
</body>
</html>