<?xml version="1.0" encoding="ISO-8859-1"?><article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id>1405-5546</journal-id>
<journal-title><![CDATA[Computación y Sistemas]]></journal-title>
<abbrev-journal-title><![CDATA[Comp. y Sist.]]></abbrev-journal-title>
<issn>1405-5546</issn>
<publisher>
<publisher-name><![CDATA[Instituto Politécnico Nacional, Centro de Investigación en Computación]]></publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id>S1405-55462004000300007</article-id>
<title-group>
<article-title xml:lang="en"><![CDATA[Exposing Instruction Level Parallelism in the Presence of Loops]]></article-title>
<article-title xml:lang="es"><![CDATA[Exponiendo el Paralelismo a Nivel de Instrucciones en Presencia de Bucles]]></article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname><![CDATA[de Alba]]></surname>
<given-names><![CDATA[Marcos R]]></given-names>
</name>
<xref ref-type="aff" rid="A01"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname><![CDATA[Kaeli]]></surname>
<given-names><![CDATA[David]]></given-names>
</name>
<xref ref-type="aff" rid="A02"/>
</contrib>
</contrib-group>
<aff id="A01">
<institution><![CDATA[,Northeastern University Department of Electrical and Computer Engineering ]]></institution>
<addr-line><![CDATA[Boston ]]></addr-line>
</aff>
<aff id="A02">
<institution><![CDATA[,Northeastern University Computer Architecture Research Laboratory ]]></institution>
<addr-line><![CDATA[Boston ]]></addr-line>
</aff>
<pub-date pub-type="pub">
<day>00</day>
<month>09</month>
<year>2004</year>
</pub-date>
<pub-date pub-type="epub">
<day>00</day>
<month>09</month>
<year>2004</year>
</pub-date>
<volume>8</volume>
<numero>1</numero>
<fpage>74</fpage>
<lpage>85</lpage>
<copyright-statement/>
<copyright-year/>
<self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_arttext&amp;pid=S1405-55462004000300007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_abstract&amp;pid=S1405-55462004000300007&amp;lng=en&amp;nrm=iso"></self-uri><self-uri xlink:href="http://www.scielo.org.mx/scielo.php?script=sci_pdf&amp;pid=S1405-55462004000300007&amp;lng=en&amp;nrm=iso"></self-uri><abstract abstract-type="short" xml:lang="en"><p><![CDATA[In this thesis we explore how to utilize a loop cache to relieve the unnecessary pressure placed on the trace cache by loops. Due to the high temporal locality of loops, loops should be cached. We have observed that when loops contain control flow instructions in their bodies it is better to collect traces on a dedicated loop cache instead of using trace cache space. The traces of instructions within loops tend to exhibit predictable patterns that can be detected and exploited at run-time. We propose to capture dynamic traces of loop bodies in a loop cache. The novelty of this loop cache consists of dynamically capturing loop iterations with conditional branches and correlating them to unique loops. Once loop iterations are cached in the loop cache, their bodies can be provided by the loop cache without polluting the trace cache and without any instruction cache accesses. The proposed loop cache includes hardware capable of dynamically unfolding loops such that large traces of instructions are accessed in a single loop cache interrogation. We evaluate our loop cache and compare it against a baseline machine with a larger first-level instruction cache. We also consider how the loop cache can compliment the introduction of a trace cache by filtering out loop traces that needlessly dominate the trace cache space. We quantify the benefits provided by a fetch engine equipped with the proposed loop cache and unrolling hardware. In our experiments we explore the design space of a loop cache and associated unfolding hardware and evaluate its efficiency to detect independent iterations in loops in SPECint2000, Media-Bench and MiBench applications. We show that trace cache efficiency and ILP can be significantly improved using our loop caching scheme. This improvement translates into up to 38% performance speedup when compared to a baseline machine with a loop cache and no trace cache to a baseline machine with no loop cache. Further experiments show up to a 16% speedup on a hybrid machine with loop and trace cache compared to a machine with a larger 1 cache and a trace cache.]]></p></abstract>
<abstract abstract-type="short" xml:lang="es"><p><![CDATA[Este trabajo se concentra en el análisis y detección de bucles para incrementar el paralelismo a nivel de instrucciones a través de la especulación de visitas enteras a los bucles. En la tesis se comparan las técnicas propuestas con otras existentes y se proponen técnicas híbridas que explotan las características benéficas de los mecanismos involucrados. Se lleva a cabo un estudio dinámico de las propiedades de muchos conjuntos de aplicaciones con el fin de determinar las características óptimas del hardware propuesto. Tal incluye una memoria cache especialmente diseñada para el almacenamiento y manejo óptimo de instrucciones pertenecientes a los bucles. Proveyendo miles de instrucciones para especulación en la memoria cache de bucles se obtienen aceleraciones en la mayoría de las aplicaciones con el mismo presupuesto de hardware. Se presenta de forma detallada el estudio exhaustivo de técnicas similares así como los detalles del diseño del hardware propuesto. Se justifican cada una de las características basadas en estudios dinámicos de las propiedades de las aplicaciones. También se analizan posibles formas de proveer mayor ganancia en el rendimiento y se presentan alternativas de adaptación del hardware en arquitecturas futuras y en procesadores comerciales existentes.]]></p></abstract>
</article-meta>
</front><body><![CDATA[ <p align="justify"><font face="verdana" size="4">Resumen de tesis doctoral </font></p>     <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="center"><font face="verdana" size="4"><b>Exposing Instruction Level Parallelism in the Presence of Loops</b></font></p>     <p align="center"><font face="verdana" size="2">&nbsp;</font></p>     <p align="center"><font face="verdana" size="3"><b><i>Exponiendo el Paralelismo a Nivel de Instrucciones en Presencia de Bucles</i></b></font></p>     <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="justify"><font face="verdana" size="2"><b>Graduated: Marcos R. de Alba    <br> </b><i>Department of Electrical and Computer Engineering    <br> Northeastern University    <br> Boston MA 02115</i>    ]]></body>
<body><![CDATA[<br> e&#150;mail: <a href="mailto:mdealba@ece.neu.edu">mdealba@ece.neu.edu</a></font></p>     <p align="justify"><font face="verdana" size="2"><b>Advisor: Dr. David Kaeli    <br> </b><i>Computer Architecture Research Laboratory (NUCAR)     <br> Northeastern University     <br> Boston MA 02115</i></font></p>     <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="center"><font face="verdana" size="2">Graduated on May 1, 2004</font></p>     <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="justify"><font face="verdana" size="2"><b>Abstract</b></font></p>     <p align="justify"><font face="verdana" size="2">In this thesis we explore how to utilize a loop cache to relieve the unnecessary pressure placed on the trace cache by loops. Due to the high temporal locality of loops, loops should be cached. We have observed that when loops contain control flow instructions in their bodies it is better to collect traces on a dedicated loop cache instead of using trace cache space. The traces of instructions within loops tend to exhibit predictable patterns that can be detected and exploited at run&#150;time. We propose to capture dynamic traces of loop bodies in a loop cache. The novelty of this loop cache consists of dynamically capturing loop iterations with conditional branches and correlating them to unique loops. Once loop iterations are cached in the loop cache, their bodies can be provided by the loop cache without polluting the trace cache and without any instruction cache accesses. The proposed loop cache includes hardware capable of dynamically unfolding loops such that large traces of instructions are accessed in a single loop cache interrogation.</font></p>     ]]></body>
<body><![CDATA[<p align="justify"><font face="verdana" size="2">We evaluate our loop cache and compare it against a baseline machine with a larger first&#150;level instruction cache. We also consider how the loop cache can compliment the introduction of a trace cache by filtering out loop traces that needlessly dominate the trace cache space. We quantify the benefits provided by a fetch engine equipped with the proposed loop cache and unrolling hardware. In our experiments we explore the design space of a loop cache and associated unfolding hardware and evaluate its efficiency to detect independent iterations in loops in SPECint2000, Media&#150;Bench and MiBench applications. We show that trace cache efficiency and ILP can be significantly improved using our loop caching scheme. This improvement translates into up to 38% performance speedup when compared to a baseline machine with a loop cache and no trace cache to a baseline machine with no loop cache. Further experiments show up to a 16% speedup on a hybrid machine with loop and trace cache compared to a machine with a larger 1 cache and a trace cache.</font></p>     <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="justify"><font face="verdana" size="2"><b>Resumen</b></font></p>     <p align="justify"><font face="verdana" size="2">Este trabajo se concentra en el an&aacute;lisis y detecci&oacute;n de bucles para incrementar el paralelismo a nivel de instrucciones a trav&eacute;s de la especulaci&oacute;n de visitas enteras a los bucles. En la tesis se comparan las t&eacute;cnicas propuestas con otras existentes y se proponen t&eacute;cnicas h&iacute;bridas que explotan las caracter&iacute;sticas ben&eacute;ficas de los mecanismos involucrados. Se lleva a cabo un estudio din&aacute;mico de las propiedades de muchos conjuntos de aplicaciones con el fin de determinar las caracter&iacute;sticas &oacute;ptimas del hardware propuesto. Tal incluye una memoria cache especialmente dise&ntilde;ada para el almacenamiento y manejo &oacute;ptimo de instrucciones pertenecientes a los bucles. Proveyendo miles de instrucciones para especulaci&oacute;n en la memoria cache de bucles se obtienen aceleraciones en la mayor&iacute;a de las aplicaciones con el mismo presupuesto de hardware. Se presenta de forma detallada el estudio exhaustivo de t&eacute;cnicas similares as&iacute; como los detalles del dise&ntilde;o del hardware propuesto. Se justifican cada una de las caracter&iacute;sticas basadas en estudios din&aacute;micos de las propiedades de las aplicaciones. Tambi&eacute;n se analizan posibles formas de proveer mayor ganancia en el rendimiento y se presentan alternativas de adaptaci&oacute;n del hardware en arquitecturas futuras y en procesadores comerciales existentes.</font></p>     <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="justify"><font face="verdana" size="2"><a href="/pdf/cys/v8n1/v8n1a7.pdf" target="_blank">DESCARGAR ART&Iacute;CULO EN FORMATO PDF</a></font></p>     <p align="justify"><font face="verdana" size="2">&nbsp;</font></p>     <p align="justify"><font face="verdana" size="2"><b>References</b></font></p>     <!-- ref --><p align="justify"><font face="verdana" size="2">1. <b>K. McKinley, S. Carr, </b>and<b> C.&#150;W. Tseng, </b>"Improving data locality with loop transformations," ACM Transactions in Programming Languages and Systems, vol. 18, no. 4, pp. 424&#150;53, 1996.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047897&pid=S1405-5546200400030000700001&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">2. <b>K. S. McKinley and O. Temam, </b>"Quantifying loop nest locality using SPEC'95 and the Perfect benchmarks," ACM Transactions on Computer Systems, vol. 17, no. 4, pp. 288&#150;336,1999.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047898&pid=S1405-5546200400030000700002&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">3. <b>R. Kessler, </b>"The alpha 21264 microprocessor," in IEEE Micro, March&#150;April 1999, pp. 24&#150;36.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047899&pid=S1405-5546200400030000700003&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">4. <b>S.&#150;A. Chi, R.&#150;M. Shiu, J.&#150;C. Chiu, S.&#150;E. Chang, </b>and<b> C.&#150;P. Chung, </b>"Instruction cache prefetching with extended btb," in IEEE Proc. of Intl. Conf. Parallel and Distributed Systems, December 1997, pp. 360&#150;365.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047900&pid=S1405-5546200400030000700004&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">5. <b>W.&#150;C. Hsu </b>and<b> J. E. Smith, </b>"A performance study of instruction cache prefetching methods," in IEEE Transactions on Computers, vol. 47, no. 5, May 1998.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047901&pid=S1405-5546200400030000700005&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">6. <b>T.&#150;Y. Yeh, D. T. Marr, and Y. N. Patt, </b>"Increasing the instruction fetch rate via multiple branch prediction and a branch address cache, "in Proc. of the International Conference on Supercomputing, Tokyo, Japan, July 1993, pp. 67&#150;76.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047902&pid=S1405-5546200400030000700006&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">7. <b>T. M. Conte, K. N. Menezes, P. M. Mills, </b>and<b> B. A. Patel, </b>"Optimization of instruction fetch mechanisms for high issue rates," in Proc. of the 22nd Annual International Symposium on Computer Architecture, Santa Margherita Ligure, Italy, 1995, pp. 333&#150;344.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047903&pid=S1405-5546200400030000700007&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">8. <b>R. Rosner, A. Mendelson, </b>and<b> R. Ronen, </b>"Filtering techniques to improve trace&#150;cache effciency," in Proc. of the International Conference on Parallel Architecture and Compilation Techniques, Barcelona, Spain, September 2001.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047904&pid=S1405-5546200400030000700008&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">9. <b>J. L. Hennessy </b>and<b> D. A. Patterson, </b>Computer Architecture: A Quantitative Approach, 2nd ed. Palo Alto, CA: Morgan Kaufmann, 1995.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047905&pid=S1405-5546200400030000700009&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">10. <b>Aiken </b>and<b> A. Nicolau, </b>"Loop quantization: An analysis and algorithm," March, 1987.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047906&pid=S1405-5546200400030000700010&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">11. <b>J. Davidson </b>and<b> S.  Jinturkar, </b>"Improving instruction&#150;level parallelism by loop unrolling and dynamic memory disambiguation, "in Proc. of the 28th Annual International Symposium on Microarchitecture. New York, NY: ACM Press, 1995, pp. 125&#150;132.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047907&pid=S1405-5546200400030000700011&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">12. <b>K. S. S&#150;T. Pan </b>and<b> J. T. Rahmeh, </b>"Correlation&#150;based branch prediction," Computer Engineering Research Center, University of Texas at Austin, Tech. Rep. UT&#150;CERC&#150;TR&#150;JTR91&#150;01, August 1991.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047908&pid=S1405-5546200400030000700012&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">13. <b>S.&#150;T. Pan, K. So, </b>and<b> J. T. Rahmeh, </b>"Improving the accuracy of dynamic branch prediction using branch correlation," in Proc. of the fifth International Conference on Architectural Support for Programming Languages and Operating System, vol. 27&#150;9. New York, NY: ACM Press, 1992, pp. 76&#150;84.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047909&pid=S1405-5546200400030000700013&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">14. <b>T. Y. Yeh </b>and<b> Y. N. Patt, </b>"A comparison of dynamic branch predictors that use two levels of branch history," in Proc. of the 20th Annual International Symposium on Computer Architecture, Goteborg, Sweden, 1993, pp. 257&#150;266.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047910&pid=S1405-5546200400030000700014&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">15. <b>M. R. de Alba </b>and<b> D. R. Kaeli, </b>"Runtime predictability of loops," in Proc. of the Fourth Annual IEEE International Workshop on Workload Characterization, I. C. Society, Ed., Austin, TX, December 2001, pp. 91&#150;98.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047911&pid=S1405-5546200400030000700015&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">16. <b>D. Burger, T. M. Austin, </b>and<b> S. Bennett, </b>"Evaluating future microprocessors: The simplescalar tool set," University of Wisconsin, Madison, Tech. Rep. CS&#150;TR&#150;1996&#150;1308, 1996.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047912&pid=S1405-5546200400030000700016&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --><!-- ref --><p align="justify"><font face="verdana" size="2">17. <b>G. J. M. Parcerisa, J. Sahuquillo </b>and<b> J. Duato, </b>"Effcient interconnects for clustered mi&#150;croarchitectures, "in Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques (PACT 2002), Charlottesville, Virginia, USA, September 2002, pp. 291&#150;300.</font>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[&#160;<a href="javascript:void(0);" onclick="javascript: window.open('/scielo.php?script=sci_nlinks&ref=2047913&pid=S1405-5546200400030000700017&lng=','','width=640,height=500,resizable=yes,scrollbars=1,menubar=yes,');">Links</a>&#160;]<!-- end-ref --> ]]></body><back>
<ref-list>
<ref id="B1">
<label>1</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[McKinley]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Carr]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
<name>
<surname><![CDATA[Tseng]]></surname>
<given-names><![CDATA[C.-W.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Improving data locality with loop transformations]]></article-title>
<source><![CDATA[ACM Transactions in Programming Languages and Systems]]></source>
<year>1996</year>
<volume>18</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>424-53</page-range></nlm-citation>
</ref>
<ref id="B2">
<label>2</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[McKinley]]></surname>
<given-names><![CDATA[K. S.]]></given-names>
</name>
<name>
<surname><![CDATA[Temam]]></surname>
<given-names><![CDATA[O]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Quantifying loop nest locality using SPEC'95 and the Perfect benchmarks]]></article-title>
<source><![CDATA[ACM Transactions on Computer Systems]]></source>
<year>1999</year>
<volume>17</volume>
<numero>4</numero>
<issue>4</issue>
<page-range>288-336</page-range></nlm-citation>
</ref>
<ref id="B3">
<label>3</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Kessler]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[The alpha 21264 microprocessor]]></article-title>
<source><![CDATA[IEEE Micro]]></source>
<year>1999</year>
<page-range>24-36</page-range></nlm-citation>
</ref>
<ref id="B4">
<label>4</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Chi]]></surname>
<given-names><![CDATA[S.-A.]]></given-names>
</name>
<name>
<surname><![CDATA[Shiu]]></surname>
<given-names><![CDATA[R.-M.]]></given-names>
</name>
<name>
<surname><![CDATA[Chiu]]></surname>
<given-names><![CDATA[J.-C.]]></given-names>
</name>
<name>
<surname><![CDATA[Chang]]></surname>
<given-names><![CDATA[S.-E.]]></given-names>
</name>
<name>
<surname><![CDATA[Chung]]></surname>
<given-names><![CDATA[C.-P.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Instruction cache prefetching with extended btb]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ IEEE Proc. of Intl. Conf. Parallel and Distributed Systems]]></conf-name>
<conf-date>1997</conf-date>
<conf-loc> </conf-loc>
<page-range>360-365</page-range></nlm-citation>
</ref>
<ref id="B5">
<label>5</label><nlm-citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hsu]]></surname>
<given-names><![CDATA[W.-C.]]></given-names>
</name>
<name>
<surname><![CDATA[Smith]]></surname>
<given-names><![CDATA[J. E.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A performance study of instruction cache prefetching methods]]></article-title>
<source><![CDATA[IEEE Transactions on Computers]]></source>
<year>1998</year>
<volume>47</volume>
<numero>5</numero>
<issue>5</issue>
</nlm-citation>
</ref>
<ref id="B6">
<label>6</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Yeh]]></surname>
<given-names><![CDATA[T.-Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Marr]]></surname>
<given-names><![CDATA[D. T.]]></given-names>
</name>
<name>
<surname><![CDATA[Patt]]></surname>
<given-names><![CDATA[Y. N.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Increasing the instruction fetch rate via multiple branch prediction and a branch address cache]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proc. of the International Conference on Supercomputing]]></conf-name>
<conf-date>1993</conf-date>
<conf-loc>Tokyo </conf-loc>
<page-range>67-76</page-range></nlm-citation>
</ref>
<ref id="B7">
<label>7</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Conte]]></surname>
<given-names><![CDATA[T. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Menezes]]></surname>
<given-names><![CDATA[K. N.]]></given-names>
</name>
<name>
<surname><![CDATA[Mills]]></surname>
<given-names><![CDATA[P. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Patel]]></surname>
<given-names><![CDATA[B. A.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Optimization of instruction fetch mechanisms for high issue rates]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proc. of the 22nd Annual International Symposium on Computer Architecture]]></conf-name>
<conf-date>1995</conf-date>
<conf-loc>Santa Margherita Ligure </conf-loc>
<page-range>333-344</page-range></nlm-citation>
</ref>
<ref id="B8">
<label>8</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Rosner]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
<name>
<surname><![CDATA[Mendelson]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
<name>
<surname><![CDATA[Ronen]]></surname>
<given-names><![CDATA[R]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Filtering techniques to improve trace-cache effciency]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proc. of the International Conference on Parallel Architecture and Compilation Techniques]]></conf-name>
<conf-date>2001</conf-date>
<conf-loc>Barcelona </conf-loc>
</nlm-citation>
</ref>
<ref id="B9">
<label>9</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Hennessy]]></surname>
<given-names><![CDATA[J. L.]]></given-names>
</name>
<name>
<surname><![CDATA[Patterson]]></surname>
<given-names><![CDATA[D. A.]]></given-names>
</name>
</person-group>
<source><![CDATA[Computer Architecture: A Quantitative Approach]]></source>
<year>1995</year>
<edition>2</edition>
<publisher-loc><![CDATA[Palo Alto^eCA CA]]></publisher-loc>
<publisher-name><![CDATA[Morgan Kaufmann]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B10">
<label>10</label><nlm-citation citation-type="">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Aiken]]></surname>
</name>
<name>
<surname><![CDATA[Nicolau]]></surname>
<given-names><![CDATA[A]]></given-names>
</name>
</person-group>
<source><![CDATA[Loop quantization: An analysis and algorithm]]></source>
<year>1987</year>
</nlm-citation>
</ref>
<ref id="B11">
<label>11</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Davidson]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Jinturkar]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation]]></article-title>
<source><![CDATA[]]></source>
<year>1995</year>
<conf-name><![CDATA[ Proc. of the 28th Annual International Symposium on Microarchitecture]]></conf-name>
<conf-loc> </conf-loc>
<page-range>125-132</page-range><publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[ACM Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B12">
<label>12</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pan]]></surname>
<given-names><![CDATA[K. S. S-T.]]></given-names>
</name>
<name>
<surname><![CDATA[Rahmeh]]></surname>
<given-names><![CDATA[J. T.]]></given-names>
</name>
</person-group>
<source><![CDATA[Correlation-based branch prediction]]></source>
<year>1991</year>
<publisher-name><![CDATA[Computer Engineering Research Center, University of Texas at Austin]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B13">
<label>13</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Pan]]></surname>
<given-names><![CDATA[S.-T.]]></given-names>
</name>
<name>
<surname><![CDATA[So]]></surname>
<given-names><![CDATA[K]]></given-names>
</name>
<name>
<surname><![CDATA[Rahmeh]]></surname>
<given-names><![CDATA[J. T.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Improving the accuracy of dynamic branch prediction using branch correlation]]></article-title>
<source><![CDATA[]]></source>
<year>1992</year>
<volume>27-9</volume>
<conf-name><![CDATA[ Proc. of the fifth International Conference on Architectural Support for Programming Languages and Operating System]]></conf-name>
<conf-loc> </conf-loc>
<page-range>76-84</page-range><publisher-loc><![CDATA[New York^eNY NY]]></publisher-loc>
<publisher-name><![CDATA[ACM Press]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B14">
<label>14</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Yeh]]></surname>
<given-names><![CDATA[T. Y.]]></given-names>
</name>
<name>
<surname><![CDATA[Patt]]></surname>
<given-names><![CDATA[Y. N.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[A comparison of dynamic branch predictors that use two levels of branch history]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proc. of the 20th Annual International Symposium on Computer Architecture]]></conf-name>
<conf-date>1993</conf-date>
<conf-loc>Goteborg </conf-loc>
<page-range>257-266</page-range></nlm-citation>
</ref>
<ref id="B15">
<label>15</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[de Alba]]></surname>
<given-names><![CDATA[M. R.]]></given-names>
</name>
<name>
<surname><![CDATA[Kaeli]]></surname>
<given-names><![CDATA[D. R.]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Runtime predictability of loops]]></article-title>
<source><![CDATA[]]></source>
<year>2001</year>
<conf-name><![CDATA[ Proc. of the Fourth Annual IEEE International Workshop on Workload Characterization]]></conf-name>
<conf-loc> </conf-loc>
<page-range>91-98</page-range><publisher-loc><![CDATA[Austin^eTX TX]]></publisher-loc>
<publisher-name><![CDATA[I. C. Society]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B16">
<label>16</label><nlm-citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Burger]]></surname>
<given-names><![CDATA[D]]></given-names>
</name>
<name>
<surname><![CDATA[Austin]]></surname>
<given-names><![CDATA[T. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Bennett]]></surname>
<given-names><![CDATA[S]]></given-names>
</name>
</person-group>
<source><![CDATA[Evaluating future microprocessors: The simplescalar tool set]]></source>
<year>1996</year>
<publisher-loc><![CDATA[Madison ]]></publisher-loc>
<publisher-name><![CDATA[University of Wisconsin]]></publisher-name>
</nlm-citation>
</ref>
<ref id="B17">
<label>17</label><nlm-citation citation-type="confpro">
<person-group person-group-type="author">
<name>
<surname><![CDATA[Parcerisa]]></surname>
<given-names><![CDATA[G. J. M.]]></given-names>
</name>
<name>
<surname><![CDATA[Sahuquillo]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
<name>
<surname><![CDATA[Duato]]></surname>
<given-names><![CDATA[J]]></given-names>
</name>
</person-group>
<article-title xml:lang="en"><![CDATA[Effcient interconnects for clustered mi-croarchitectures]]></article-title>
<source><![CDATA[]]></source>
<year></year>
<conf-name><![CDATA[ Proc. of the Int. Conf. on Parallel Architectures and Compilation Techniques (PACT 2002)]]></conf-name>
<conf-date>2002</conf-date>
<conf-loc>Charlottesville Virginia</conf-loc>
<page-range>291-300</page-range></nlm-citation>
</ref>
</ref-list>
</back>
</article>
