share
Unix & LinuxHow do I use grep to find lines, in which any word occurs 3 times?
[+4] [5] doelie247
[2020-11-30 12:46:36]
[ grep regular-expression ]
[ https://unix.stackexchange.com/questions/622195/how-do-i-use-grep-to-find-lines-in-which-any-word-occurs-3-times ]

I want to find the lines that contain any word three times. For this, I thought it would be best to use the grep command.

This was my attempt.

grep '\(.*\)\{3\}' myfile.txt
(4) how do you define a "word"? - umläute
(5) Please edit your question and include a few input lines and the exact expected output. - thanasisp
(1) This seems a popular question going around, see stackoverflow: Find multiple occurences of same string - Sundeep
[+13] [2020-11-30 13:01:01] Quasímodo [ACCEPTED]

Using the standard word definition,

  • GNU Grep, 3 or more occurrences of any word.

    grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file
    

  • GNU Grep, only 3 occurrences of any word.

    grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file | grep -Ev '(\W|^)(\w+)\W(.*\<\2\>){3}'
    

  • POSIX Awk, only 3 occurences of any word.

    awk -F '[^_[:alnum:]]+' '{           # Field separator is non-word sequences
        split("", cnt)                   # Delete array cnt
        for (i=1; i<=NF; i++) cnt[$i]++  # Count number of occurrences of each word
        for (i in cnt) {
            if (cnt[i]==3) {             # If a word appears exactly 3 times
                print                    # Print the line
                break
            }
        }
    }' file
    

    For 3 or more occurences, simply change == to >=.

    Equivalent golfed one-liner:

    awk -F '[^_[:alnum:]]+' '{split("",c);for(i=1;i<=NF;i++)c[$i]++;for(i in c)if(c[i]==3){print;next;}}' file
    

  • GNU Awk, only 3 occurrences of the word ab.

    gawk 'gsub(/\<ab\>/,"&")==3' file
    

    For 3 or more occurences, simply change == to >=.


Reading material

[1] https://www.gnu.org/software/grep/manual/html_node/Back_002dreferences-and-Subexpressions.html
[2] https://www.gnu.org/software/grep/manual/html_node/The-Backslash-Character-and-Special-Expressions.html
[3] https://en.wikipedia.org/wiki/Regular_expression#Character_classes

Since regex quantifiers are greedy, I think this will suffice: (\w+)(.+\1){2} - glenn jackman
Also, extended regexes provide the word boundary anchors \< and \> - glenn jackman
(1) @glennjackman That would match lorem ipsum lor ipsum lorem dolor. - Quasímodo
or gawk -F'\\<ab\\>' 'NF==3' infile - αғsнιη
@αғsнιη Indeed, but shoulb be NF==4. - Quasímodo
echo 'dog and foo and bar and baz land good' | grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' gave me Aborted (core dumped) .. I just wanted to test if it will run into this bug: unix.stackexchange.com/questions/579889/… - Sundeep
@Sundeep Oh dear! Nice finding. While that bug isn't tackled, better to favor the POSIX Awk version. - Quasímodo
1
[+9] [2020-11-30 13:01:12] umläute

Like this?

egrep '(\<.+\>).+\<\1\>.+\<\1\>'
  • egrep (or grep -E) enables extended regexes, which are required for backreferences
  • \<.+\> will match any word of at least 1 character
    • \< resp \> match word boundaries (in your attempt you didn't take word boundaries into account at all)
    • .+ matches a sequence of one or more characters (in your attempt you used .* which matches a sequence of zero or more characters!)
  • use back-references, to check whether the matched sequence occurs a 2nd time (\1) and a 3rd time (\1 again).
    • we allow any sequence of one or more characters (.+) between the matches, so "foo bar foo dorbs foo godly" will match (there's 3 occurences of the word "foo").
    • if you only want to match adjacent words (e.g. "foo foo foo"), use something like [[:space:]]+ instead.

(3) Very nice! It's not POSIX compliant because of \<\> but BSD Grep and GNU Grep both understand them. - Quasímodo
2
[+3] [2020-11-30 14:12:11] thanasisp

I assume that your question means if any of the words in the line exists at least 3 times, then print the line, else discard it. I would use awk, for a more readable and customizable solution:

awk -F '\\W+' '{
    delete c; for (i=1;i<=NF;i++) if (length($i) && ++c[$i]==3) {print; next}
}' file

It is a loop for all fields, counting their occurences per line. If any word reaches 3 times, it will print the line, delete [1] the array and go to next line. Also a test for the length of the field exists to avoid printing on any empty fields counted.

Here we can easily customize the meaning of "word" by adding different and/or many field separators, using -F (the standard BREs and EREs are supported). In the above, word separators are all characters except _ and [:alnum:]: awk -F '\\W+' or awk -F '[^_[:alnum:]]+', similar to matching word bountaries with grep.

For a human language, we may need different word bountaries, like everything except the letters, like: awk -F '[^[:alpha:]]+' or except letters and digits: awk -F '[^[:alnum:]]+' or to include not only the underscore, but the dash also into words: awk -F '[^-_[:alnum:]]+'.

Without setting -F, only the whitespace characters are used.

[1] https://www.gnu.org/software/gawk/manual/html_node/Delete.html

Let us continue this discussion in chat. - Quasímodo
The main reason I add this answer is that using the ERE definition of "word" could be not enough for some cases. For example, for human language, dashes can be defined as word characters, and we could have to exclude x-ray x-ray y-ray and complex EREs with word bountaries are not easy to modify for these cases. Using awk, we adjust the FS. - thanasisp
3
[+1] [2020-12-01 14:20:48] MikeFHay

grep -P '(\b\w+\b)(.*\b\1\b){2}'

See explanation and test cases at https://regex101.com/r/Kr2VUc/2 . You may also want to make this case-insensitive:

grep -P '(?i)(\b\w+\b)(.*\b\1\b){2}'


4
[0] [2020-11-30 21:59:53] guest_7

  • GNU sed in extended regex mode -E to detect lines in which any word is repeated exactly 3 times in a line.

$ r1='.*\<\1\>'
$ r2=$r1$r1 r3=$r2$r1
$ sed -Ee "
    /\<(\w+)\>$r2/! d
    /\<(\w+)\>$r3/d
" file

  • Perl using hashes to store word as key and its count in the current line as value.
$ perl -lne 'my %h;
    $h{$_}++ for /\w+/g;
    print if grep { $_ == 3 } values %h;
' file


5