Great — I’ll research how GNU wc handles each flag and edge case, including buffering, signal behavior, background/suspend handling, and file input quirks. I’ll also examine coreutils source and runtime behavior to ensure precise matching.

The goal will be to produce a fully specced plan for a Cython-friendly, modular implementation of wc with live preview updates every 200ms when output is a TTY. I’ll include design notes and annotations for portability (e.g. macOS/BSD differences), and structure the result around best practices with a top-level docstring for __doc__.

I’ll let you know when it’s ready.

Deep Dive into GNU wc (Word Count) Behavior

Overview and Default Behavior

GNU Coreutils wc is a command-line utility that counts lines, words, and bytes (and optionally characters and max line length) for text input. By default, running wc on one or more files prints three counts for each input: newline count, word count, and byte count, in that order. If multiple files are provided, wc will also output a total summary line with the cumulative counts. The default behavior is summarized as follows:

By default, the three counts appear in the order lines, words, bytes, each right-aligned in a column. The utility defines a “word” as a non-empty sequence of non-whitespace characters delimited by whitespace or by start/end of input. “Lines” are counted by newline characters (each '\n' increments the line count). Notably, if a file does not end with a newline, GNU wc does not count the last line as a line at all (i.e. it counts only complete lines ending in '\n'). This means a file with no newline at EOF could show 0 lines even if it contains text.

Command-Line Options and Flags

GNU wc provides several options to control which counts are displayed. You can mix these flags to display multiple counts at once (each specified count will be shown in the fixed column order described later). The flags and their meanings are:

Combining options: You can specify multiple count options together to display more than one kind of count. Each option you include will add its respective column to the output (in the standard column order). For example, wc -c -w will print both the byte count and the word count. Options do not override one another except in unusual cases – GNU wc will accumulate them. (By contrast, some non-GNU implementations treat -c and -m as mutually exclusive, where one might cancel the other, but GNU does not cancel flags; all specified counts are shown.) The only effect of specifying count flags is that only those specified counts are shown (the default three are omitted unless their flags are included). If no count-selective flags are given, the default is to show lines, words, and bytes.

Input Sources: Stdin vs Files vs NUL-Lists

Single file vs multiple files: When wc is given one or more filenames as arguments, it will process each sequentially. For each input file, it produces one line of output with the counts and then the filename. If no filename is given, or if a filename is specified as “-”, wc reads from standard input (stdin). Data from stdin is treated as one continuous stream (one “file” for counting purposes). In the output, stdin data is not labeled with a name (when no filename argument was provided at all), or is labeled as “-” if you explicitly included “-” in the file list.

Using --files0-from: Instead of listing files as separate command-line arguments, you can pass a NUL-separated list of file paths via --files0-from. This is especially useful when dealing with a large number of files or filenames containing special characters. For example, one could generate a list of files with find and pipe it:

find . -name "*.[ch]" -print0 | wc -L --files0-from=-

This command finds all .c or .h files and feeds them to wc -L. The -print0 | --files0-from=- combination ensures that even files with newlines in their names are handled correctly. When using --files0-from, each file from the list is processed just like a normal argument. The presence of this option means that wc will ignore any file arguments listed after it in the command (the GNU manual syntax shows these as mutually exclusive modes).

One thing to note is that --files0-from can be used to read from standard input by specifying - as the file list. In that case, stdin is consumed for filenames, and thus cannot simultaneously serve as a data source. If you need to count data from stdin along with other files, you should not use --files0-from=- for those same data. Also, if the NUL-separated list (from a file or stdin) contains an entry named exactly “-”, GNU wc will interpret that entry as a request to read from stdin (as if you had a “-” file argument). This is normally only useful if the list comes from a file and you truly want to include an on-the-fly stdin stream in the inputs. It’s an edge case and can be confusing if misused.

Byte Counting vs Character Counting

When counting bytes (-c) versus characters (-m), the difference becomes apparent with multibyte encodings (like UTF-8). Byte count is straightforward – it’s essentially the file size in bytes (or the total bytes read from a stream). Character count requires decoding according to the locale to correctly count multi-byte sequences as one.

GNU wc handles this by reading the input in the current locale’s character encoding and incrementing the char count for each valid character read. If it encounters a byte sequence that does not form a valid character in that encoding, it will not increment the character count for that sequence. Those bytes are effectively ignored in the -m tally (but still contribute to the byte count, and as mentioned, are considered non-whitespace for word counting logic). This behavior ensures that character counts aren’t inflated by counting error bytes individually. In practice, for a valid UTF-8 text, wc -m will produce the number of Unicode code points in the text, which may be less than the byte count if any characters above ASCII are present. For example, a file containing “€” (Euro sign, 3 bytes in UTF-8) will have --bytes=3 but --chars=1.

Encoding assumptions: wc uses the C library’s multibyte character handling (the locale’s MB_CUR_MAX, mbrtowc(), etc.) to decode characters. It does not attempt to guess encoding – it relies on the environment (e.g. LANG or LC_ALL variables on Unix) to be set correctly. If you run wc -m in a UTF-8 locale on a UTF-8 file, you get proper char counts. If the locale is C/POSIX (ASCII) and the file has multibyte sequences, those sequences will likely be reported as encoding errors and thus not counted as characters. In such a case, wc -m might produce a number lower than wc -c.

Word counting and locale: The definition of what constitutes a word delimiter (whitespace) is also locale-sensitive in subtle ways. GNU wc uses the locale’s classification of whitespace characters (via iswspace or similar). However, as noted, it specifically treats a few Unicode space characters as whitespace regardless of locale to be more intuitive. This means that in most locales, spaces, tabs, newlines, and other Unicode spaces will break words. Letters and printable symbols count as part of words. An “encoding error” byte (one that can’t form a valid character) is treated as a non-whitespace character, meaning it will be counted within a “word” if it’s between whitespace regions. Essentially, invalid bytes won’t split words apart.

Lines and multibyte: Line counting (-l) is unaffected by encoding except insofar as the newline character is represented (in ASCII and UTF-8, newline is a single byte 0x0A; in UTF-16, which wc would not directly handle unless piped through something, newline would be two bytes but as multi-byte to wc it appears as separate bytes including 0x0A). In normal usage, every '\n' increments the line counter. If the last line of a file is missing a newline terminator, wc does not add an extra line count. This is because it literally counts newline characters, rather than counting “lines” in a more abstract sense.

Output Formatting and Column Order

Regardless of which counts are selected, GNU wc always prints them in a specific column order: lines, words, characters, bytes, maximum-line-length, in that sequence. Only the requested fields are shown, but if multiple are requested, they appear in that relative order. For example, wc -c -l (lines and bytes) will output the line count first, then the byte count (even though -c was listed first in the command). If all options are used, the order would be: lines, words, chars, bytes, max-line-length.

Each count is formatted as a right-aligned integer in its column. By default, wc separates columns by at least one space. The numbers are arranged so that, when possible, the digits line up in vertical columns for easy reading. File names (or the word “total”) appear at the end of the line after at least one space separating them from the last number.

Column width and alignment: GNU wc uses dynamic column sizing. It will pad the numbers with spaces on the left so that the widest number in each column (among all the lines to be printed, including the “total” line) determines the width for that column. Typically, wc inspects all the input files before processing to estimate how large the counts might get. In fact, the GNU implementation will perform a stat() on each input file (when possible) before reading data. From the file size information, it can infer an upper bound on certain counts, which helps decide the field width. For example, if a file is 1,772 bytes, the byte count will definitely be 1772 (3 or 4 digits); the line count cannot exceed 1772 (if every byte was a newline) – so at most 4 digits; the word count also cannot exceed roughly half that (in an extreme alternating pattern) but as a safe upper bound it uses the file size as well. Using these sizes, wc will choose a column width that can accommodate the largest possible count. If all input files are regular files that were stat’ed, wc can often use a minimal width for the columns (just enough for the largest count).

If any input is not a regular file (for example, data from a pipe or tty), wc cannot know the total size ahead of time. In those cases, GNU wc falls back to a default minimum field width of 7 characters for each count column. This is why sometimes you see wc output with seemingly extra padding when using pipes. For instance, compare these scenarios:

In general, you should not assume a fixed width for wc output in scripts. The GNU manual warns that the field widths vary. Historically, many implementations used a fixed 7-column format for each of the three default counts, but GNU wc optimizes the width when possible (while still defaulting to 7 if uncertain). The only guarantee is that there will be at least one space separating fields. Also, as a GNU extension, if only one count field is being printed (for example, only lines because you used -l alone), then wc will not pad with any leading spaces in that output. It prints the number as-is (since there’s no alignment needed with other columns) followed by the filename or newline. This makes it easier to use in scripts – e.g., lines=$(wc -l < file) yields a clean number without leading blanks.

File name formatting: After the numeric fields, wc prints the file name (if a file was an argument) or total. There is always at least one space (usually just one) between the last number and the file name. In the special case of --total=only, no “total” label is printed, and leading spaces are suppressed, so it’s just the numbers on one line.

If output is going to a terminal, wc will output a newline at the end of each line of counts (and flush on newline, as usual). If output is redirected to a file or pipe, the buffering may delay printing until all processing is done (see next section on buffering).

Buffering, Signals, and Progress Updates

Buffering behavior: Like most Unix utilities, wc uses stdio for output. When the output is a terminal (TTY), stdio typically line-buffers the output, meaning it will flush each line as it’s printed (each newline triggers a flush). Since wc prints each result followed by a newline, you normally see each line immediately on the terminal as soon as that file’s counting is done. When output is redirected (for example, piped into another command or into a file), stdio uses block buffering, which means wc might buffer multiple lines of output before writing them out – but in practice, wc will flush at the end of execution anyway, and each line of results isn’t large, so you usually get the output once wc finishes each file or the whole job. In the multiple-files case, wc prints each file’s line immediately after counting that file (it doesn’t wait to finish all files to start outputting). Thanks to the internal logic with stat, it can print with correct alignment without needing to retroactively adjust previous lines.

SIGINT (Ctrl+C): GNU wc does not have special signal handling for interruption beyond the default. If you press Ctrl+C during execution, it will typically terminate immediately and not print any partial results. For example, if you are piping a huge input to wc and hit Ctrl+C, wc will abort and you will likely see no output (or whatever partial line was buffered and not yet flushed, which is usually nothing if it hadn’t finished a file). The program’s exit code will be non-zero (130, typical of an interrupt). In a custom Python implementation, one might consider catching SIGINT to perhaps print a summary of what was counted so far, but the actual GNU wc does not do this – it just stops.

SIGPIPE: If wc is writing output to a pipe (say you piped wc into another program) and the downstream program closes early (causing a broken pipe), wc will receive SIGPIPE. The default behavior on SIGPIPE is to terminate the process quietly. GNU wc doesn’t explicitly handle SIGPIPE; it will just exit when it tries to write and finds the pipe closed. In a Python version, you’d want to handle the BrokenPipeError by exiting cleanly (and not dumping a traceback). This is important for scripting (for instance, head -n1 piped from wc might cause wc to get a SIGPIPE after writing the first line, and it should exit without error message).

Suspension (Ctrl+Z): Stopping the process (SIGTSTP) and later resuming (fg) doesn’t affect wc’s logic – it will continue where it left off. There’s no special handling needed; the OS will freeze and thaw the process state.

TTY-based live progress (flushing every 200ms): By default, wc does not print any progress updates while counting – it only prints final results. However, for a custom implementation, one might introduce a feature to display live progress when working with a large input interactively. For example, if output is a TTY, the program could periodically output the current counts (perhaps overwriting the line or on a new line) so the user can see the count rising in real time. This is not a standard feature of GNU wc, but it’s a possible extension. If implementing this, consider the following best practices:

In summary, standard wc doesn’t do periodic flushing for progress – but a custom implementation can incorporate this as a non-standard enhancement. If doing so, make sure to throttle the updates (200 ms is reasonable) to avoid flooding output, and only do it when the output is interactive.

Structuring the Implementation (for Python/Cython)

When writing a Python program to mimic wc exactly, it’s important to structure the code for correctness, performance, and maintainability – especially if a future rewrite in Cython (or another low-level approach) is planned. Here are some best practices and considerations:

__doc__ = """Usage: wc [OPTION]... [FILE]...
Print newline, word, and byte counts for each FILE, and a total line if more than one FILE is specified.
With no FILE, or when FILE is -, read standard input.

  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
  -w, --words            print the word counts
  -L, --max-line-length  print the length of the longest line
      --files0-from=F    read input from the files specified by NUL-terminated names in file F;
                         If F is - then read names from standard input
      --total=WHEN       when to print a line with total counts: auto, always, only, never
"""

This mirrors the GNU help text (you can adjust wording to avoid any copyright issues, but the idea is to clearly document the options). Having this as __doc__ means if someone does help(your_wc_module) or runs it with -h, they can see usage.

Platform Differences and Compatibility

GNU wc (part of GNU coreutils) has some behaviors and extensions that may not be present in other systems’ versions of wc:

By structuring the program with clear modular functions and careful attention to these details, the resulting Python (and future Cython) implementation will closely emulate GNU wc in behavior and output. Comments should indicate any deliberate differences or tricky aspects (especially those involving locale or platform-specific behaviors). The end result will be a robust wc clone in Python, with a well-documented codebase ready for optimization or conversion to C/Cython as needed.