I want to find the lines that contain any word three times. For this, I thought it would be best to use the grep command.
This was my attempt.
grep '\(.*\)\{3\}' myfile.txt
ACCEPTED]
Using the standard word definition,
GNU Grep, 3 or more occurrences of any word.
grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file
GNU Grep, only 3 occurrences of any word.
grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' file | grep -Ev '(\W|^)(\w+)\W(.*\<\2\>){3}'
POSIX Awk, only 3 occurences of any word.
awk -F '[^_[:alnum:]]+' '{ # Field separator is non-word sequences
split("", cnt) # Delete array cnt
for (i=1; i<=NF; i++) cnt[$i]++ # Count number of occurrences of each word
for (i in cnt) {
if (cnt[i]==3) { # If a word appears exactly 3 times
print # Print the line
break
}
}
}' file
For 3 or more occurences, simply change == to >=.
Equivalent golfed one-liner:
awk -F '[^_[:alnum:]]+' '{split("",c);for(i=1;i<=NF;i++)c[$i]++;for(i in c)if(c[i]==3){print;next;}}' file
GNU Awk, only 3 occurrences of the word ab.
gawk 'gsub(/\<ab\>/,"&")==3' file
For 3 or more occurences, simply change == to >=.
Reading material
\2 is a
back-reference
[1].\w \W \< \>
special expressions in GNU Grep
[2].[:alnum:] POSIX
character class
[3].(\w+)(.+\1){2} - glenn jackman
\< and \> - glenn jackman
lorem ipsum lor ipsum lorem dolor. - Quasímodo
gawk -F'\\<ab\\>' 'NF==3' infile - αғsнιη
NF==4. - Quasímodo
echo 'dog and foo and bar and baz land good' | grep -E '(\W|^)(\w+)\W(.*\<\2\>){2}' gave me Aborted (core dumped) .. I just wanted to test if it will run into this bug: unix.stackexchange.com/questions/579889/… - Sundeep
Like this?
egrep '(\<.+\>).+\<\1\>.+\<\1\>'
egrep (or grep -E) enables extended regexes, which are required for backreferences\<.+\> will match any word of at least 1 character
\< resp \> match word boundaries (in your attempt you didn't take word boundaries into account at all).+ matches a sequence of one or more characters (in your attempt you used .* which matches a sequence of zero or more characters!)\1) and a 3rd time (\1 again).
.+) between the matches, so "foo bar foo dorbs foo godly" will match (there's 3 occurences of the word "foo").[[:space:]]+ instead.\<\> but BSD Grep and GNU Grep both understand them. - Quasímodo
I assume that your question means if any of the words in the line exists at least 3 times, then print the line, else discard it. I would use awk, for a more readable and customizable solution:
awk -F '\\W+' '{
delete c; for (i=1;i<=NF;i++) if (length($i) && ++c[$i]==3) {print; next}
}' file
It is a loop for all fields, counting their occurences per line. If any word reaches 3 times, it will print the line, delete [1] the array and go to next line. Also a test for the length of the field exists to avoid printing on any empty fields counted.
Here we can easily customize the meaning of "word" by adding different and/or many field separators, using -F (the standard BREs and EREs are supported). In the above, word separators are all characters except _ and [:alnum:]: awk -F '\\W+' or awk -F '[^_[:alnum:]]+', similar to matching word bountaries with grep.
For a human language, we may need different word bountaries, like everything except the letters, like: awk -F '[^[:alpha:]]+' or except letters and digits: awk -F '[^[:alnum:]]+' or to include not only the underscore, but the dash also into words: awk -F '[^-_[:alnum:]]+'.
Without setting -F, only the whitespace characters are used.
x-ray x-ray y-ray and complex EREs with word bountaries are not easy to modify for these cases. Using awk, we adjust the FS. - thanasisp
grep -P '(\b\w+\b)(.*\b\1\b){2}'
See explanation and test cases at https://regex101.com/r/Kr2VUc/2 . You may also want to make this case-insensitive:
grep -P '(?i)(\b\w+\b)(.*\b\1\b){2}'
GNU sed in extended regex mode -E to detect lines in which any word is repeated exactly 3 times in a line.
$ r1='.*\<\1\>'
$ r2=$r1$r1 r3=$r2$r1
$ sed -Ee "
/\<(\w+)\>$r2/! d
/\<(\w+)\>$r3/d
" file
$ perl -lne 'my %h;
$h{$_}++ for /\w+/g;
print if grep { $_ == 3 } values %h;
' file