An Algorithm for Differential File Comparison 论文

2008引用 301
Algorithms and Data CompressionAdvanced Data Storage TechnologiesParallel Computing and Optimization Techniques

摘要

The program diff reports differences between two files, expressed as a minimal list of line changes to bring either file into agreement with the other. Diff has been engineered to make efficient use of time and space on typical inputs that arise in vetting version-to-version changes in computer-maintained or computer-generated documents. Time and space usage are observed to vary about as the sum of the file lengths on real data, although they are known to vary as the product of the file lengths in the worst case. The central algorithm of diff solves the ‘longest common subsequence problem’ to find the lines that do not change between files. Practical efficiency is gained by attending only to certain critical ‘candidate’ matches between the files, the breaking of which would shorten the longest subsequence common to some pair of initial segments of the two files. Various techniques of hashing, presorting into equivalence classes, merging by binary search, and dynamic storage allocation are used to obtain good performance. [This document was scanned from Bell Laboratories Computing Science Technical Report #41, dated July 1976. Te xt was converted by OCR and hand-corrected (last changed June, 2012). Figures were reconstructed. Some OCR errors may remain, especially in tables and equations. Please report them to doug@cs.dartmouth.edu.] The program diff creates a list of what lines of one file have to be changed to bring it into agreement with a second file or vice versa. It is based on ideas from several sources[1,2,7,8]. As an example of its work, consider the two files, listed horizontally for brevity: a b c d e f g w a b x y z e It is easy to see that the first file can be made into the second by the following prescription, in which an imaginary line 0 is understood at the beginning of each: append after line 0: w, change lines 3 through 4, which were: c d into: x y z, delete lines 6 through 7, which were: f g. Going the other way, the first file can be made from the second this way: delete line 1, which was: w, change lines 4 through 6, which were: x y z into: c d, append after line 7: f g. Delete, change and append are the only operations available to diff . It indicates them by 1-letter