Search and Replace
Search and Replace is the most common operation that computers automate
Searching and replacing across hundreds, thousands or even millions of documents just cannot be done by a human. The time, and the errors introduced and replacements missed make it essential to find a set of tools to automate this whole process.
Commly an 'exact search' is required, whereby the text is found exactly as entered. However, to search for special characters such as carriage returns, line feeds, tabs etc, special 'escape codes' must be allowed such as \r, \n, \t, and these are interpreted specially. A backslash is entered as \\.
Even searching on its own, has spawned the necessity and creation of regular expressions. These are 'mini-programs' that look through text for a 'pattern' they match. Common search/replace examples are perl regular expressions and EasyPattern regular expressions.
Very often the found text has to be replaced in a different arrangement - for example, replacing, re-arranging or substituting other text. A very common example is reformatting of dates from US to EU formats - mm/dd/yyyy to dd-mm-yyyy. To do this, part of the found text has to be 'captured' and then substituted into the replacement string. Usually the fragments of captured text are stored in 'macros' or 'variables' named $1, $2, $3 etc, with $0 representing the entire match.
Search and Replace Tools - Binary Data Formats
The replace tools used differ depending on the format of the files being processed. Often, as with Microsoft products, the actual data is stored in a compressed binary format that is incomprehensible to humans, but is fast and small to store. Modifying these files naively can easily result in corrupted Word documents, although if the search and replace text have identical lengths it is possible for it to work with blind luck. Some popular tools work around this approach by interfacing directly to the native Microsoft application, ensuring no data loss or potential corruption.
Search and Replace Tools - Windows, Mac and Unix Line Feeds
Plain text files (ie those without formatting, fonts, tables, shading etc) can have each line ended with different ASCII control codes. Search and replace tools need to work with all types.
Search and Replace Tools - Mainframe format
Files that come from a Mainframe are usually encoded using EBCDIC, which is an alternative to ASCII. If Mainframe files are in ASCII already, then any numbers stored in packed formats get corrupted and lost by the conversion, so it is essential to use a tool that operates on the original EBCDIC data.
Search and Replace Tools - Data Size
Many tools rely on loading an entire file into main memory before processing it. Even as memory sizes grow, the text files we process also seem to grow to match! Loading huge files into memory is a very naive approach these days, and can slow a computer to a crawl or even crash it. Such tools as TextPipe Pro work around this issue by processing files in large chunks.