Monday, November 15, 2021

[SOLVED] How is line buffering implemented for C stdio input streams?

Issue

I understand that fully buffered input can be implemented by issuing a single read syscall for a block of data possibly larger than required by the application. But I don't understand how line buffering could ever be applied to input without support from the kernel. I imagine one would have to read a block of data and then look for newlines, but if so, what is the difference with full buffering?


To be more specific:

Suppose I have an input stream FILE* in. Is there any difference between the following, with regards to how the stdio library will retrieve bytes from the operating system to fill its buffer?

  • Line buffering: setvbuf(in, NULL, _IOLBF, BUFSIZ)
  • Full buffering: setvbuf(in, NULL, _IOFBF, BUFSIZ)

If so, what is that difference?


Solution

A FILE struct has a default internal buffer. After fopen, and on an fread, fgets, etc., the buffer is populated by the stdio layer from a read(2) call.

When you do fgets, it will copy data to your buffer, pulling it from the internal buffer [until newline is found]. If no newline is found, the stream internal buffer is replenished with another read(2) call. Then, the scan for newline and fill of your buffer continues.

This can repeat a number of times [particularly true if you're doing fread]. Whatever is left over is available for the next stream read operation (e.g. fread, fgets, fgetc).

You can set the size of stream buffer with setlinebuf. For efficiency, the typical default size is the machine page size [IIRC].

So, the stream buffer "stays one step ahead of you", so to speak. It operates much like a ring queue [in effect, if not actuality].


Dunno for sure, but line buffering [or any buffering mode] is usually for output files (e.g. set for stdout by default). It says, if you see a newline, do an implied fflush. Full buffering means do the fflush when the buffer is full. Unbuffered means do fflush on every character.

If you open an output logfile, you get full buffering [most efficient], so if your program crashes, you might not get the last N lines output (i.e. they're still pending in the buffer). You can set line buffering so you get the last trace line after a program crash.

On input, line buffering doesn't have any meaning for a file [AFAICT]. It just tries to use the most efficient size possible (e.g. the stream buffer size).

I think that the important point is that, on input, you don't know where the newline is beforehand, so _IOLBF operates like any other mode--because it has to. (i.e.) you do read(2) up to stream buf size (or the amount needed to fulfill the outstanding fread). In other words, the only things that matter are the internal buffer size and the size/count parameters of the fread and not the buffering mode.


For TTY device (e.g. stdin), the stream will wait for newline [unless you use a TIOC* ioctl on the underlying fildes (e.g. 0) to set char-at-a-time aka raw mode], regardless of the stream mode. That's because the TTY device canonical processing layer [in the kernel] will hold up the read (e.g. that's why you can type backspace, etc. without the application having to deal with it).

However, doing fgets on a TTY device/stream will get special treatment internally (e.g.) it will do select/poll and get the number of pending chars and read only that amount, so it won't block on the read. It will then look for newline, and reissue select/poll if no newline found. But, if newline found, it returns from the fgets. In other words, it will do whatever is necessary to allow the expected behavior on stdin. It wouldn't do for it to block on a 4096 byte read if the user entered 10 chars + newline.


UPDATE:

To answer your second round of followup questions

I see the tty subsystem and the stdio code running in the process as completely independent. The only way they interface is by the process issuing read syscalls; these may block or not, and this is what depends on the tty settings.

Normally, that is true. Most applications do not try to adjust the TTY layer settings. But, an app can do so if it wishes to, but not via any stream/stdio functions.

But the process is completely unaware of those settings and can't change them.

Again, normally true. But, again, the process can change them.

If we're on the same page, what you're saying implies that a setvbuf call will change the buffering policy of the tty device, and I find that hard to reconcile with my understanding of Unix I/O.

No setvbuf only sets the stream buffer size and policy. It has nothing to do with the kernel at all. The kernel only sees read(2) and has no idea whether the app did it raw or whether the stream did it via fread [or fgets]. It does not affect the TTY layer in any way.

In a normal app that is looping on fgetc and a user inputs abcdef\n, the fgetc will block [in the driver] until the newline is entered. This is the TTY canonical processing layer doing this. Then, when the newline is entered, the read(2) done by the fgetc will return with the value of 7. the first fgetc will return and the remaining six will occur rapidly, being fulfilled from the stream's internal buffer.

However ...

More sophisticated apps may change the TTY layer policy via ioctl(fileno(stdin),TIOC*,...). The stream will not be aware of this. So when doing so, one must be careful. Thus, if a process wants, it can fully control the TTY layer behind the file unit, but must do manually via the ioctl

Using the ioctl to modify [or even disable] TTY canonical processing [aka "TTY raw mode"] can be used by applications that need true char-at-a-time input. For example, vim, emacs, getkey, etc.

While an application can intermix raw mode and a stdio stream [and do so effectively], the normal usage is to either use streams in their normal mode/usage or bypass the stdio layer entirely, do ioctl(0,TIOC*,...) and then do read(2) directly.


Here's a sample getkey program:

// getkey -- wait for user input

#include <stdio.h>
#include <fcntl.h>
#include <termios.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <string.h>
#include <errno.h>

#define sysfault(_fmt...) \
    do { \
        printf(_fmt); \
        exit(1); \
    } while (0)

int
main(int argc,char **argv)
{
    int fd;
    int remain;
    int err;
    int oflag;
    int stdflg;
    char *cp;
    struct termios tiold;
    struct termios tinew;
    int len;
    int flag;
    char buf[1];
    int code;

    --argc;
    ++argv;

    stdflg = 0;

    for (;  argc > 0;  --argc, ++argv) {
        cp = *argv;
        if (*cp != '-')
            break;

        switch (cp[1]) {
        case 's':
            stdflg = 1;
            break;
        }
    }

    printf("using %s\n",stdflg ? "fgetc" : "read");

    fd = fileno(stdin);

    oflag = fcntl(fd,F_GETFL);
    fcntl(fd,F_SETFL,oflag | O_NONBLOCK);

    err = tcgetattr(fd,&tiold);
    if (err < 0)
        sysfault("getkey: tcgetattr failure -- %s\n",strerror(errno));

    tinew = tiold;

#if 1
    tinew.c_iflag &= ~(IGNBRK | BRKINT | PARMRK | ISTRIP |
        INLCR | IGNCR | ICRNL | IXON);
    tinew.c_oflag &= ~OPOST;
    tinew.c_lflag &= ~(ECHO | ECHONL | ICANON | ISIG | IEXTEN);
    tinew.c_cflag &= ~(CSIZE | PARENB);
    tinew.c_cflag |= CS8;

#else
    cfmakeraw(&tinew);
#endif

#if 0
    tinew.c_cc[VMIN] = 0;
    tinew.c_cc[VTIME] = 0;
#endif

    err = tcsetattr(fd,TCSAFLUSH,&tinew);
    if (err < 0)
        sysfault("getkey: tcsetattr failure -- %s\n",strerror(errno));

    for (remain = 9;  remain > 0;  --remain) {
        printf("\rHit any key within %d seconds to abort ...",remain);
        fflush(stdout);

        sleep(1);

        if (stdflg) {
            len = fgetc(stdin);
            if (len != EOF)
                break;
        }
        else {
            len = read(fd,buf,sizeof(buf));
            if (len > 0)
                break;
        }
    }

    tcsetattr(fd,TCSAFLUSH,&tiold);
    fcntl(fd,F_SETFL,oflag);

    code = (remain > 0);

    printf("\n");
    printf("%s (%d remaining) ...\n",code ? "abort" : "normal",remain);

    return code;
}


Answered By - Craig Estey