Troubleshooters.Com and Code Corner Present

Steve Litt's Perls of Wisdom:
Perl File Input, Output and Sorting
(With Snippets)

Copyright (C) 1998-2003 by Steve Litt



Debug like a Ninja

Contents

  • Introduction
  • File Output In Perl
  • File Input in Perl
  • File Conversion Example
  • Passing Files as Arguments
  • Piping
  • Other File Algorithms


  • Introduction

    File input and output is an integral part of every programming language. Perl has complete file input and output capabilities, but it has especially handy syntax for line-at-a-time sequential input. Since Perl's strengths are in text manipulation/parsing, this is especially important, and will be well covered on this web page. Also covered will be sequential file output. This website will not discuss fixed record reads or random i/o.

    Because writing files in Perl is actually simpler, we'll start with output, then move to input.

    File Output In Perl

    $append = 0;
    if ($append)
    {
    open(MYOUTFILE, ">filename.out"); #open for write, overwrite
    }
    else
    {
    open(MYOUTFILE, ">>filename.out"); #open for write, append
    }
    print MYOUTFILE "Timestamp: "; #write text, no newline
    print MYOUTFILE &timestamp(); #write text-returning fcn
    print MYOUTFILE "\n"; #write newline
    #*** Print freeform text, semicol required ***
    print MYOUTFILE <<"MyLabel";
    Steve was here
    and now is gone
    but left his name
    to carry on.
    MyLabel
    #*** Close the file ***
    close(MYOUTFILE);

    File Input in Perl

    Opening for Read requires no angle brackets in the filename. If you wish, you can put in a left angle bracket <, which means "input file". It's good practice to close any files you open. Files can be read line by line, or the entire contents of the file can be dumped into a list, with each list element being a line. Here is an example of a program that reads a file, capitalizes each line, and prints it to the screen:

    Reading a File a Line at a Time

    open(MYINPUTFILE, "<filename.out");
    while(<MYINPUTFILE>)
    {
    # Good practice to store $_ value because
    # subsequent operations may change it.
    my($line) = $_;

    # Good practice to always strip the trailing
    # newline from the line.
    chomp($line);

    # Convert the line to upper case.
    $line =~ tr/[a-z]/[A-Z]/;

    # Print the line to the screen and add a newline
    print "$line\n";
    }

    Reading a Whole File at Once

    Sometimes it's easier to read a whole file into a list, especially with complex break logic, read-ahead totals, or sorting. Here's a program that reads a file and prints it in sorted order:

    open(MYINPUTFILE, "<filename.out"); # open for input
    my(@lines) = <MYINPUTFILE>; # read file into list
    @lines = sort(@lines); # sort the list
    my($line);
    foreach $line (@lines) # loop thru list
    {
    print "$line"; # print in sort order
    }
    close(MYINPUTFILE);

    File Conversion Example

    Perl is exceptionally good at file conversion. Here's an example where each line in the file has 3 fields (in this order): A 5 digit zip code, a 20 char name (first last) and a mm/dd/yy birth date. You want to change it to a 16 char last name, a 10 char first name, a mm/dd/yyyy birth date, and a 5 digit zip. For simplicity, assume names have no spaces (no Mary Anns, no Van Gelders). Here's a 21 line program to do the conversion:

    open(MYINPUTFILE, "<filename.in");
    open(MYOUTPUTFILE, ">filename.out");
    while(<MYINPUTFILE>)
    {
    my($line) = $_;
    chomp($line);
    if($line =~ m|(\d{5})(.{20})(\d\d)/(\d\d)/(\d\d)|)
    {
    my($zip,$name,$mm,$dd,$yy) = ($1,$2,$3,$4,$5);
    if($yy > 10)
    {$yy += 1900}
    else
    {$yy += 2000}
    my($first, $last) = split(/ /, $name);
    $line = sprintf("%-16s%-10s%02d/%02d/%04d%5d",
    $last,$first,$mm,$dd,$yy,$zip);
    print MYOUTPUTFILE "$line\n";
    }
    }
    close(MYINPUTFILE);
    close(MYOUTPUTFILE);

    File Slurping

    You might occasionally want to grab an entire file without paying attention to line termination. You can do that by undefing the $/ built in variable, and then assigning the <file> to a scalar. This is called "slurping" the file.

    The following code slurps the STDIN file, then splits it into lines, then reassembles the lines into a single string, and prints the string:

    x
    #!/usr/bin/perl -w
    use strict;

    my $holdTerminator = $/;
    undef $/;
    my $buf = <STDIN>;
    $/ = $holdTerminator;
    my @lines = split /$holdTerminator/, $buf;
    $buf = "init";
    $buf = join $holdTerminator, @lines;
    print $buf;
    print "\n";

    The preceding code works like this:
    Slurping isn't as handy as it might seem. If you're a C programmer accustomed to using the read() and write() functions with a large buffer to accomplish incredibly fast I/O, you might think file-at-a-time I/O would be much faster than line oriented I/O. Not in Perl! For whatever reason, line oriented is faster.

    One reason is the need for huge amounts of memory, which on UNIX systems translates into huge disk usage as swap file space is used. But this doesn't account for the whole thing, as you'll see in the test following program:

    #!/usr/bin/perl -w
    use strict;

    my $bigfileName = "/scratch/bigfile.txt";
    my $sipfileName = "/scratch/sip.out";
    my $arrayfileName = "/scratch/array.out";
    my $slurpfileName = "/scratch/slurp.out";

    sub slurp()
    {
    my $inf;
    my $ouf;
    my $holdTerminator = $/;
    undef $/;
    open $inf, "<" . $bigfileName;
    my $buf = <$inf>;
    close $inf;
    $/ = $holdTerminator;
    my @lines = split /$holdTerminator/, $buf;
    $buf = "init";
    $buf = join $holdTerminator, @lines;
    open $ouf, ">" . $slurpfileName;
    print $ouf $buf;
    print $ouf "\n";
    close $ouf;
    }

    sub sip()
    {
    my $inf;
    my $ouf;
    open $inf, "<" . $bigfileName;
    open $ouf, ">" . $sipfileName;
    while(<$inf>)
    {
    my $line = $_;
    chomp $line;
    print $ouf $line, "\n";
    }
    close $ouf;
    close $inf;
    }

    sub buildarray()
    {
    my $inf;
    my $ouf;
    my @array;
    open $inf, "<" . $bigfileName;
    while(<$inf>)
    {
    my $line = $_;
    chomp $line;
    push @array, ($line);
    }

    close $inf;
    open $ouf, ">" . $arrayfileName;
    foreach my $line (@array)
    {
    print $ouf $line, "\n";
    }
    close $ouf;
    }

    sub main()
    {
    my $time1 = time();

    print "Starting sip\n";
    sip();
    print "End sip\n";

    my $time2 = time();

    print "Starting array\n";
    buildarray();
    print "End array\n";

    my $time3 = time();

    print "Starting slurp\n";
    slurp();
    print "End slurp\n";

    my $time4 = time();

    print "Sip time is ", $time2-$time1, " seconds\n";
    print "Array time is ", $time3-$time2, " seconds\n";
    print "Slurp time is ", $time4-$time3, " seconds\n";
    }

    main();

    The preceding program creates the following output:
    x
    [slitt@mydesk littperl]$ ./slurp.pl
    Starting sip
    End sip
    Starting array
    End array
    Starting slurp
    End slurp
    Sip time is 14 seconds
    Array time is 74 seconds
    Slurp time is 279 seconds
    [slitt@mydesk littperl]

    As you can see in the preceding program and output, the line in, line out method copied 50 a MB file in 14 seconds. A line at a time input that pushed on an array and then outputted it a line at a time took 74 seconds. Note that this stores the full file in memory. The slurp method, which reads the file into a string and then copies it to an array, takes 279 seconds. Looking more closely, the slurp version actually has two copies of the file in memory -- one in the array and one in the scalar. Indeed, if you add the following line to the array method, right after the building of the array is complete, array runtime more closely approximates that of the slurp method:
    my @arraycopy = @array;
    Adding the preceding statement means storing 2 copies of the file in memory, just like the slurp method. Here are the run results with the extra copy:

    [slitt@mydesk littperl]$ ./slurp.pl
    Starting sip
    End sip
    Starting array
    End array
    Starting slurp
    End slurp
    Sip time is 14 seconds
    Array time is 304 seconds
    Slurp time is 258 seconds
    [slitt@mydesk littperl]$

    The Moral of the Story

    The moral of the story is clear. Large buffer I/O is not efficient the way it is in C. If the file is large enough to save time by whole file reads, then it's so large as to exhaust electronic RAM memory, thus incurring swap penalties.

    The most efficient algorithm reads a line, writes a line, and stores nothing. That's not always practical, and it's certainly not the easiest way to design code.

    A further advantage of read a line, write a line occurs when dealing with pipes. This is  in the Piping section, later in this document.

    If you really want to get faster I/O in Perl, you might experiment with the sysopen(), sysread(), sysseek(), and syswrite() functions. But beware, they interact quirkily with normal Perl I/O functions.

    Passing Files as Arguments

    Given the Perl syntax, it's inobvious how to pass files as arguments. There are three methods, as Globs, as filehandles, and as variables.

    Globs

    The Glob method of passing files is very Perlistic, and as such appears incredibly inobvious to general purpose programmers not using Perl on a regular basis. The Glob method is useful when retrofitting file passing in programs using Perl's <FILENAME> syntax. If you're starting fresh, consider filehandles.

    Here's the Glob method:
    sub printFile($)
    {
    my $fileHandle = $_[0];
    while (<$fileHandle>)
    {
    my $line = $_;
    chomp($line);
    print "$line\n";
    }
    }

    open(MYINPUTFILE, "<filename.in");
    printFile(\*MYINPUTFILE);
    close(MYINPUTFILE);

    Output files work similarly.

    If you need to assign the glob to an actual variable, you can do that also. The code in the subroutine remains the same, and the following is the code doing the passing:
    open(MYINPUTFILE, "<filename.in");
    my $fileGlob = \*MYINPUTFILE;
    printFile($fileGlob);
    close(MYINPUTFILE);

    Use of an actual variable makes the code much more obvious to the programmer with only casual Perl experience.

    Once again, Globs are the old method, and they're compatible with older Perl file methods, but for new construction you'll probably prefer to use the FileHandle module.

    FileHandles

    This is the modern, preferred way. With the FileHandle module you can assign a file handle to a variable that can be passed, just like in C. Unlike Globs, its use is obvious to any experienced programmer.

    use FileHandle;

    sub printFile($)
    {
    my $fileHandle = $_[0];
    while (<$fileHandle>)
    {
    my $line = $_;
    chomp($line);
    print "$line\n";
    }
    }

    my $fh = new FileHandle;
    $fh->open("<filename.in") or die "Could not open file\n";
    printFile($fh);
    $fh->close(); # automatically closes file

    The FileHandle class also has methods like gets(), print(), printf(). This gives the programmer much better control, and helps in OOP programs.

    Variables

    We usually see files expressed as uppercase bare text, as in <INF>, but it can also be a variable, such as <$inf>. As such, the variable can be passed between subroutines. Usually the FileHandle method is preferred, but if you're an oldschool perl guy who wants to use the oldschool syntax but be able to pass open files without resorting to cumbersome globs, variables are just what's needed. Watch this:

    #!/usr/bin/perl -w
    use strict;

    sub printFile($)
    {
    my $fileHandle = $_[0];
    while (<$fileHandle>)
    {
    my $line = $_;
    chomp($line);
    print "$line\n";
    }
    }

    my $fh;
    open($fh,"<filename.in") or die "Could not open file\n";
    printFile($fh);
    close($fh);

    Piping

    One really quick, modular and high quality method of program design/coding is to build the program out of small executables connected with pipes. For instance, the following CGI shellscript, let's call it showrpt.cgi, illustrates such a piping situation:
    #!/bin/bash
    ./get_mainframe_data.pl | ./zap_extraneous_text.pl | ./parse_data.pl | ./make_into_web_page.pl
    In the preceding, zap_extraneous_text.pl, parse_data.pl , and make_into_web_page.pl are perl scripts receiving their data through STDIN and outputting data through STDOUT. They're what is calledfilters in the UNIX world. The get_mainframe_data.pl, program generates its own data and passes it out through STDOUT. The pipeline route is defined byshowrpt.cgi, which calls all four in a pipe.

    Now ask yourself this: What if a perl program had to decide the pipe route. This is a very real question. Perhaps a parsing program starts with a complex parse to determine which parser units to use, and then assembles the pipe, and then pipes data into it? You do that with a Perl Pipe:
    my pipestring;
    if(report_type eq 'ar')
    {
    $pipestring = "|./zap_extraneous_data_ar_report.pl | ./parse_ar_report | ./make_into_web_page.pl;
    }
    elsif(report_type eq 'journal')
    {
    $pipestring = "|./parse_journal_report | ./sort_journal_by_account.pl | ./make_into_web_page.pl;
    }
    my $pipe;
    open $pipe, $pipestring;
    foreach my $environment_line (@environment_lines)
    {
    print $pipe $environment_line, "\n";
    }
    while(<STDIN>)
    {
    my $line = $_;
    chomp ($line);
    print $pipe $line, "\n";
    }

    In the preceding code, the open $pipe, $pipestring sets it up so anything printed to $pipe is sent to the STDIN of the pipe laid out in $pipestring. From there, the environment lines are sent to that pipe, and then this program's STDIN is sent to that pipe.

    Piping Efficiency Issues

    Small executables piped together are a great way to rapidly develop an application. They're a great way to quickly rearrange an application. Applications built with piped executables are so modular that bugs are few, shallow, and easy to test for. The main problem with piped executables, especially those made with Perl, is that piping data is slow. Perl programs handle STDIN and STDOUT about half the speed of awk, and about 1/5 the speed of equivalently written C programs.

    Beyond that, assuming you're running a Linux, UNIX or BSD box, order counts. Ideally, you read a little, process a little, write a little:

    while(<STDIN>)
    {
    my $line = $_;
    chomp($line);
    $line = process_one_line($line);
    print $pipe $line, "\n";
    }

    The preceding code implements a true bucket brigade, where each process on the pipeline has something to do, and they can all work concurrently. This is especially important on multiprocessor machines.

    Often, however, you cannot output until all the input has been read and processed. This means that the next stage must wait until completion of the previous stage, and only then begin. Compound that by several stages, and processing time balloons. Unfortunately, it's often very difficult to write an executable so that it outputs before completion of input.

    Other File Algorithms

    truncate()

    This is a way of emptying a file without deleting it. This is wonderful for web apps, where the Apache user can be given write rights to the file, but not write rights to the whole directory. As a priveleged user, create the file with touch, and then change its permissions to be writeable by the Apache user. From there, it never gets deleted, so it's always modifiable by the Apache user.

    unlink()

    This deletes a file.

    rename()

    This renames a file, like the UNIX mv command.

    mkdir, rmdir, chdir, chmod, chown, chroot

    These perform identical functions to their UNIX counterparts.

    -X

    In this case the "X" is actually one of the following letters:
    -r File is readable by effective uid/gid.
    -w File is writable by effective uid/gid.
    -x File is executable by effective uid/gid.
    -o File is owned by effective uid.


    -R File is readable by real uid/gid.
    -W File is writable by real uid/gid.
    -X File is executable by real uid/gid.
    -O File is owned by real uid.

    sysopen(), sysread(), sysseek(), and syswrite()

    These are low level calls corresponding to C's open(), read() and write() functions. Due to lack of buffering, they interact strangely with calls to Perl's buffered file I/O. But if you really want to speed up Perl's I/O, this might (or might not) be a way to do it. This is beyond the scope of this document.


     [ Troubleshooters.com| Code Corner | Email Steve Litt ]

    Copyright (C)1998-2003 by Steve Litt --Legal