bash - three - linux less tricks




List files that contain `n` or fewer lines (7)

python -c "import sys; print '\n'.join([of.name for of in [open(fn) for fn in sys.argv[1:]] if len(filter(None, [of.readline() for _ in range(28)])) <= 27])" *.txt

Question

In a folder, I would like to print the name of every .txt files that contain n=27 lines or fewer lines. I could do

wc -l *.txt | awk '{if ($1 <= 27){print}}'

The problem is that many files in the folder are millions of lines (and the lines are pretty long) and hence the command wc -l *.txt is very slow. In principle a process could count the number of lines until finding at least n lines and then proceed to the next file.

What is a faster alternative?

FYI, I am on MAC OSX 10.11.6

Attempt

Here is an attempt with awk

#!/bin/awk -f

function printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
{
  if (previousNbLines <= n) 
  {
    print previousNbLines": "previousFILENAME
  }
}

BEGIN{
  previousNbLines=n+1
  previousFILENAME=NA
} 


{
  if (FNR==1)
  {
    printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
    previousFILENAME=FILENAME
  }
  previousNbLines=FNR
  if (FNR > n)
  {
    nextfile
  }
}

END{
  printPreviousFileIfNeeded(previousNbLines, previousFILENAME)
}

which can be called as

awk -v n=27 -f myAwk.awk *.txt

However, the code fails at printing out perfectly empty files. I am not sure how to fix that and I am not sure my awk script is the way to go.


How's this?

awk 'BEGIN { for(i=1;i<ARGC; ++i) arg[ARGV[i]] }
  FNR==28 { delete arg[FILENAME]; nextfile }
  END { for (file in arg) print file }' *.txt

We copy the list of file name arguments to an associative array, then remove all files which have a 28th line from it. Empty files obviously won't match this condition, so at the end, we are left with all files which have fewer lines, including the empty ones.

nextfile was a common extension in many Awk variants and then was codified by POSIX in 2012. If you need this to work on really old dinosaur OSes (or, good heavens, probably Windows), good luck, and/or try GNU Awk.


If you have to call awk individually, ask it to stop at line 28:

for f in ./*.txt
do
  if awk 'NR > 27 { fail=1; exit; } END { exit fail; }' "$f"
  then
    printf '%s\n' "$f"
  fi
done

The default value of awk variables is zero, so if we never hit line 28, the exit code is zero, making the if test successful, and so prints the filename.


Software tools and GNU sed (older versions before v4.5) mashup:

find *.txt -print0 | xargs -0 -L 1 sed -n '28q;$F'

That misses 0-byte files, to include those as well, do:

find *.txt \( -exec sed -n '28{q 1}' '{}' \; -or -size 0 \) -print

(For some reason running sed via -exec is about 12% slower than xargs.)


sed code stolen from ctac's answer.

Note: On my own system's older sed v4.4-2, the quit command combined with the --separate switch doesn't just quit the current file, it quits sed entirely. Which means it requires a separate instance of sed for every file.


With GNU awk for nextfile and ENDFILE:

awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt

With any awk:

awk -v n=27 '
    { fnrs[FILENAME] = FNR }
    END {
        for (i=1; i<ARGC; i++) {
            filename = ARGV[i]
            if ( fnrs[filename] < n ) {
                print filename
            }
        }
    }
' *.txt

Those will both work whether the input files are empty or not. The caveats for the non-gawk version are the same as for your other current awk answers:

  1. It relies on the same file name not appearing multiple times (e.g. awk 'script' foo bar foo) and you wanting it displayed multiple times, and
  2. It relies on there being no variables set in the arg list (e.g. awk 'script' foo FS=, bar)

The gawk version has no such restrictions.

UPDATE:

To test the timing between the above GNU awk script and the GNU grep+sed script posted by xhienne since she stated that her solution would be faster than a pure awk script I created 10,000 input files, all of 0 to 1000 lines in length by using this script:

$ awk -v numFiles=10000 -v maxLines=1000 'BEGIN{for (i=1;i<=numFiles;i++) {numLines=int(rand()*(maxLines+1)); out="out_"i".txt"; printf "" > out; for (j=1;j<=numLines; j++) print ("foo" j) > out} }'

and then ran the 2 commands on them and got these 3rd run timing results:

$ time grep -c -m28 -H ^ *.txt | sed '/:28$/ d; s/:[^:]*$//' > out.grepsed

real    0m1.326s
user    0m0.249s
sys     0m0.654s

$ time awk -v n=27 'FNR>n{f=1; nextfile} ENDFILE{if (!f) print FILENAME; f=0}' *.txt > out.awk

real    0m1.092s
user    0m0.343s
sys     0m0.748s

Both scripts produced the same output files. The above was run in bash on cygwin. I expect on different systems the timing results might vary a little but the difference will always be negligible.


To print 10 lines of up to 20 random chars per line (see the comments):

$ maxChars=20
    LC_ALL=C tr -dc '[:print:]' </dev/urandom |
    fold -w "$maxChars" |
    awk -v maxChars="$maxChars" -v numLines=10 '
        { print substr($0,1,rand()*(maxChars+1)) }
        NR==numLines { exit }
    '
0J)-8MzO2V\XA/o'qJH
@r5|g<WOP780
^[email protected]\
vP{l^pgKUFH9
-6r&]/-6dl}pp W
&.UnTYLoi['2CEtB
Y~wrM3>4{
^F1mc9
?~NHh}a-EEV=O1!y
of

To do it all within awk (which will be much slower):

$ cat tst.awk
BEGIN {
    for (i=32; i<127; i++) {
        chars[++charsSize] = sprintf("%c",i)
    }
    minChars = 1
    maxChars = 20
    srand()
    for (lineNr=1; lineNr<=10; lineNr++) {
        numChars = int(minChars + rand() * (maxChars - minChars + 1))
        str = ""
        for (charNr=1; charNr<=numChars; charNr++) {
            charsIdx = int(1 + rand() * charsSize)
            str = str chars[charsIdx]
        }
        print str
    }
}

$ awk -f tst.awk
Heer H{QQ?qHDv|
Psuq
Ey`-:O2v7[]|N^EJ0
j#@/y>CJ3:=3*b-joG:
?
^|O.[tYlmDo
TjLw
`2Rs=
!('IC
hui

You can use find with the help of a little bash inline script:

find -type f -exec bash -c '[ $(grep -cm 28 ^ "${1}") != "28" ] && echo "${1}"' -- {} \;

The command [ $(grep -cm 28 ^ "${1}") != "28" ] && echo "${1}" uses grep to search for the begin of a line (^) at maximum 28 times. If that command returns != "28", the file must have more less than 28 lines.


with sed (GNU sed) 4.5 :

sed -n -s '28q;$F' *.txt




awk