Clarify this sentence.

This commit is contained in:
Bradley M. Kuhn 2016-08-09 11:38:50 -07:00
parent 73f0b434ca
commit c41c973f99

View file

@ -10,7 +10,7 @@
<p>Software is often modified in various ways; indeed, Linux developers form a community that encourages and enables modification by many parties. Given this development model, communities often find it valuable to determine when software source code moves from one place to another with only minor modifications. Various scientifically-vetted techniques can be used to identify &quot;clones&quot; -- a portion of code that is substantially similar to pre-existing source code. The specific area of academic research is called &quot;code cloning detection&quot; or &quot;code duplication detection&quot;. The area has been under active research since the mid-1990s <a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>. In 2002, Japanese researchers published a tool called CCFinder <a href="#fn2" class="footnoteRef" id="fnref2"><sup>2</sup></a>, which, in its updated incarnation (called CCFinderX), is widely used and referenced by academic researchers in the field <a href="#fn3" class="footnoteRef" id="fnref3"><sup>3</sup></a> and has specifically been used to explore reuses of code in GPL'd software such as Linux <a href="#fn4" class="footnoteRef" id="fnref4"><sup>4</sup></a>.</p>
<p>CCFinderX uses a token-based clone detection method and a suffix-tree matching algorithm; both techniques have been highly vetted and considered in the academic literature. The techniques are considered viable and useful in detecting clones. Many academic papers on the subject have been peer-reviewed and published, and nearly every newly published paper compares its new techniques of clone detection to the seminal results found by CCFinderX. For purposes of our analysis, we have therefore chosen to use CCFinderX. These results can be easily reproduced since CCFinderX is, itself, also Open Source software.</p>
<h1 id="establishing-a-baseline-of-the-ccfinderx-tool">Establishing A Baseline of the CCFinderX Tool</h1>
<p>CCFinderX offers many statistics for clone detection. After expert analysis, we concluded that most relevant to this situation is the &quot;ratio of similarity&quot; between the existing code and the new code. To establish a baseline, we considered two different comparisons of Free and Open Source Software (FOSS). First, we compared the Linux kernel, Version 4.5.2, to the FreeBSD kernel, Version 10.3.0. This comparison was inspired by the similar 2002 study <a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a> of these two large C programs. The hypothesis remained that CCFinderX would encounter a small percentage of code similarity, since the FreeBSD and Linux projects collaborate on some subprojects and willingly share code under the 3-Clause BSD license for those parts. (These collaborations are public and well-documented.)</p>
<p>CCFinderX offers many statistics for clone detection. After expert analysis, we concluded that most relevant to this situation is the &quot;ratio of similarity&quot; between the existing code and the new code. To establish a baseline, we considered two different comparisons of Free and Open Source Software (FOSS). First, we compared the Linux kernel, Version 4.5.2, to the FreeBSD kernel, Version 10.3.0. This comparison was inspired by the similar 2002 study <a href="#fn5" class="footnoteRef" id="fnref5"><sup>5</sup></a> of these two large C programs. The hypothesis remained that CCFinderX would encounter a low but significant percentage of code similarity, since the FreeBSD and Linux projects collaborate on some subprojects and willingly share code under the 3-Clause BSD license for those parts. (These collaborations are public and well-documented.)</p>
<p>The experiment confirmed the hypothesis. We found that a 3.68% &quot;ratio of similarity&quot; when comparing code from Linux to the FreeBSD kernel.</p>
<p>Next, we compared the source code of the Linux Kernel 4.5.2 to the LLVM+Clang system, version 3.8.0. These two projects are each a large program written in the C programming language, but they are not known to actively share code. We would expect some very minimal similarity simply due to chance, but something much lower than the 3.68% found between Linux and FreeBSD's kernel.</p>
<p>Indeed, when the same test is run to compare Linux to the LLVM+Clang system, the &quot;ratio of similarity&quot; was 0.075%.</p>