Editing Pipeline (Unix) (section)

== Pipelines in command line interfaces ==
{{anchor|pipe character}}
All widely used Unix shells have a special syntax construct for the creation of pipelines. In all usage one writes the commands in sequence, separated by the [[ASCII]] [[vertical bar]] character <code>|</code> (which, for this reason, is often called "pipe character"). The shell starts the processes and arranges for the necessary connections between their standard streams (including some amount of [[buffer (computer science)|buffer]] storage).

The pipeline uses [[anonymous pipe]]s. For anonymous pipes, data written by one process is buffered by the operating system until it is read by the next process, and this uni-directional channel disappears when the processes are completed; this differs from [[named pipe]]s, where messages are passed to or from a pipe that is named by making it a file, and remains after the processes are completed. The standard [[Shell (computing)|shell]] syntax for [[anonymous pipe]]s is to list multiple commands, separated by [[vertical bar]]s ("pipes" in common Unix verbiage): <syntaxhighlight lang="bash" style="width: 50%">command1 | command2 | command3</syntaxhighlight>

For example, to list files in the current directory ({{mono|[[ls]]}}), retain only the lines of {{mono|ls}} output containing the string {{mono|"key"}} ({{mono|[[grep]]}}), and view the result in a scrolling page ({{mono|[[Less (Unix)|less]]}}), a user types the following into the command line of a terminal:

<syntaxhighlight lang="bash" style="width: 50%">ls -l | grep key | less</syntaxhighlight>

The command <code>ls -l</code> is executed as a process, the output (stdout) of which is piped to the input (stdin) of the process for <code>grep key</code>; and likewise for the process for <code>less</code>. Each [[process (computing)|process]] takes input from the previous process and produces output for the next process via  ''[[standard streams]]''. Each <code>|</code> tells the shell to connect the standard output of the command on the left to the standard input of the command on the right by an [[inter-process communication]] mechanism called an [[anonymous pipe|(anonymous) pipe]], implemented in the operating system. Pipes are unidirectional; data flows through the pipeline from left to right.<!-- Shouldn't this be in the shell article? As with all shell commands, a command line can be extended over multiple physical lines by using a '\' character before the newline. -->

=== Example ===
Below is an example of a pipeline that implements a kind of [[spell checker]] for the [[World Wide Web|web]] resource indicated by a [[Uniform Resource Locator|URL]]. An explanation of what it does follows.

<syntaxhighlight lang="bash" line="">
curl 'https://en.wikipedia.org/wiki/Pipeline_(Unix)' |
sed 's/[^a-zA-Z ]/ /g' |
tr 'A-Z ' 'a-z\n' |
grep '[a-z]' |
sort -u |
comm -23 - <(sort /usr/share/dict/words) |
less
</syntaxhighlight>

# '''<code>[[CURL|curl]]</code>''' obtains the [[HTML]] contents of a web page (could use <code>[[wget]]</code> on some systems).
# '''<code>[[sed]]</code>''' replaces all characters (from the web page's content) that are not spaces or letters, with spaces. ([[Newline]]s are preserved.)
# '''<code>[[tr (program)|tr]]</code>''' changes all of the uppercase letters into lowercase and converts the spaces in the lines of text to newlines (each 'word' is now on a separate line).
# '''<code>[[grep]]</code>''' includes only lines that contain at least one lowercase [[alphabetical]] character (removing any blank lines).
# '''<code>[[Sort (Unix)|sort]]</code>''' sorts the list of 'words' into alphabetical order, and the <code>-u</code> switch removes duplicates.
# '''<code>[[comm (Unix)|comm]]</code>''' finds lines in common between two files, <code>-23</code> suppresses lines unique to the second file, and those that are common to both, leaving only those that are found only in the first file named.  The <code>-</code> in place of a filename causes <code>comm</code> to use its standard input (from the pipe line in this case). <code>sort /usr/share/dict/words</code> sorts the contents of the <code>words</code> file alphabetically, as <code>comm</code> expects, and <code>&lt;( ... )</code> outputs the results to a temporary file (via [[process substitution]]), which <code>comm</code> reads. The result is a list of words (lines) that are not found in /usr/share/dict/words.
# '''<code>[[less (Unix)|less]]</code>''' allows the user to page through the results.

===Error stream===
By default, the [[standard error stream]]s ("[[stderr]]") of the processes in a pipeline are not passed on through the pipe; instead, they are merged and directed to the [[system console|console]]. However, many shells have additional syntax for changing this behavior. In the [[C shell|csh]] shell, for instance, using <code>|&</code> instead of <code>|</code> signifies that the standard error stream should also be merged with the standard output and fed to the next process. The [[Bash (Unix shell)|Bash]] shell can also merge standard error with <code>|&</code> since version 4.0<ref>{{cite web |title=Bash release notes |url=https://tiswww.case.edu/php/chet/bash/NEWS |access-date=2017-06-14 |website=tiswww.case.edu}}</ref> or using <code>2>&1</code>, as well as redirect it to a different file.

===Pipemill===
<!-- section header used in redirect -->
In the most commonly used simple pipelines the shell connects a series of sub-processes via pipes, and executes external commands within each sub-process.  Thus the shell itself is doing no direct processing of the data flowing through the pipeline.

However, it's possible for the shell to perform processing directly, using a so-called '''mill''' or '''pipemill''' (since a <code lang="bash">while</code> command is used to "mill" over the results from the initial command). This construct generally looks something like:
<syntaxhighlight lang="bash">
command | while read -r var1 var2 ...; do
    # process each line, using variables as parsed into var1, var2, etc
    # (note that this may be a subshell: var1, var2 etc will not be available
    # after the while loop terminates; some shells, such as zsh and newer
    # versions of Korn shell, process the commands to the left of the pipe
    # operator in a subshell)
    done
</syntaxhighlight>

Such pipemill may not perform as intended if the body of the loop includes commands, such as <code>cat</code> and <code>ssh</code>, that read from <code>[[stdin]]</code>:<ref>{{cite web |date=6 March 2012 |title=Shell Loop Interaction with SSH |url=http://72.14.189.113/howto/shell/while-ssh/ |url-status=dead |archive-url=https://web.archive.org/web/20120306135439/http://72.14.189.113/howto/shell/while-ssh/ |archive-date=6 March 2012}}</ref> on the loop's first iteration, such a program (let's call it ''the drain'') will read the remaining output from <code>command</code>, and the loop will then terminate (with results depending on the specifics of the drain).  There are a couple of possible ways to avoid this behavior.  First, some drains support an option to disable reading from <code>stdin</code> (e.g. <code>ssh -n</code>).  Alternatively, if the drain does not ''need'' to read any input from <code>stdin</code> to do something useful, it can be given <code>&lt; /dev/null</code> as input.

As all components of a pipe are run in parallel, a shell typically forks a subprocess (a subshell) to handle its contents, making it impossible to propagate variable changes to the outside shell environment. To remedy this issue, the "pipemill" can instead be fed from a [[here document]] containing a [[command substitution]], which waits for the pipeline to finish running before milling through the contents. Alternatively, a [[named pipe]] or a [[process substitution]] can be used for parallel execution. [[GNU bash]] also has a {{code|lastpipe}} option to disable forking for the last pipe component.<ref>{{cite web |author=John1024 |title=How can I store the "find" command results as an array in Bash |url=https://stackoverflow.com/a/23357277 |website=Stack Overflow}}</ref>