share
Stack OverflowBuild an ASCII chart of the most commonly used words in a given text
[+156] [59] ChristopheD
[2010-07-02 20:54:05]
[ language-agnostic code-golf ]
[ https://stackoverflow.com/questions/3169051/build-an-ascii-chart-of-the-most-commonly-used-words-in-a-given-text ]

The challenge:

Build an ASCII chart of the most commonly used words in a given text.

The rules:

Parse a given text (read a file specified via command line arguments or piped in; presume us-ascii) and build us a word frequency chart with the following characteristics:

An example:

The text for the example can be found here [1] (Alice's Adventures in Wonderland, by Lewis Carroll).

This specific text would yield the following chart:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|______________________________________________| was 
|__________________________________________| that 
|___________________________________| as 
|_______________________________| her 
|____________________________| with 
|____________________________| at 
|___________________________| s 
|___________________________| t 
|_________________________| on 
|_________________________| all 
|______________________| this 
|______________________| for 
|______________________| had 
|_____________________| but 
|____________________| be 
|____________________| not 
|___________________| they 
|__________________| so 


For your information: these are the frequencies the above chart is built upon:

[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that
', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t'
, 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), ('
but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]

A second example (to check if you implemented the complete spec): Replace every occurence of you in the linked Alice in Wonderland file with superlongstringstring:

 ________________________________________________________________
|________________________________________________________________| she 
|_______________________________________________________| superlongstringstring 
|_____________________________________________________| said 
|______________________________________________| alice 
|________________________________________| was 
|_____________________________________| that 
|______________________________| as 
|___________________________| her 
|_________________________| with 
|_________________________| at 
|________________________| s 
|________________________| t 
|______________________| on 
|_____________________| all 
|___________________| this 
|___________________| for 
|___________________| had 
|__________________| but 
|_________________| be 
|_________________| not 
|________________| they 
|________________| so 

The winner:

Shortest solution (by character count, per language). Have fun!


Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):

Language          Relaxed  Strict
=========         =======  ======
GolfScript          130     143
Perl                        185
Windows PowerShell  148     199
Mathematica                 199
Ruby                185     205
Unix Toolchain      194     228
Python              183     243
Clojure                     282
Scala                       311
Haskell                     333
Awk                         336
R                   298
Javascript          304     354
Groovy              321
Matlab                      404
C#                          422
Smalltalk           386
PHP                 450
F#                          452
TSQL                483     507

The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____| bars, closes the first bar on top with a ____ line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.

Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).

Spec: how to decide how big to make bars? E.g. all bars as 1 space long would fit the spec I think? - Brian
Spec: I presume we can assume that no 80+char-long word will be among most frequent (e.g. all words are reasonable length). - Brian
@Brian: good points, will edit a little further to (hopefully) clarify. - ChristopheD
@Brian: the length of the longest bar appears to be specified exactly, and the rest should scale, no? - dmckee --- ex-moderator kitten
(4) Well, 'longest bar'+word=80 may not fit within 80 cols if second-most-common-word is a much longer word. Am looking for the 'max constraint' I guess. - Brian
(1) Do we normalize casing? 'She' = 'she'? - Brian
Rather than blacklist e.g. 'ignore punctuation' (currently hyphens, numbers, etc would count), can we just whitelist alphabetics (a-z, A-Z)? - Brian
@Brian, dmckee: that's the idea yes; 1. scale bars according to the frequencies they represent) and 2. maximize bar width within the (bar + space + word + space) <= 80 constraint. I hope the spec is clearer now... - ChristopheD
@Brian: that'd be better, I'll update the spec a little - ChristopheD
@Brian: casing can be ignored: She == she - ChristopheD
(2) IMO making this perform, both in terms of execution time and memory usage, seems like a more interesting challenge than character count. - Frank Farmer
(81) I'm glad to see that my favorite words s and t are represented. - indiv
(1) @Frank, yes, but then it wouldn't be code golf. - JSBձոգչ
(1) @indiv: funny ;-) (feels a little late to change the spec now though ; already 3 solutions in ;-) - ChristopheD
@ChristopheD: change it to "don't count words of length 1", that will take care of 's' and 't' as well as 'i' and 'a'. Seems like 15 votes till now disagree that it's too late - Nas Banov
(8) @indiv, @Nas Banov -- silly too-simple tokenizer reads "didn't" as {didn, t} and "she's" as {she, s} :) - hobbs
What should happen with the trailing space? Your sample output has it on each line, however, I read the bar length restriction as if that should only be included for calculation purposes. So should a trailing space be output on every line? - Joey
@Johannes Rössel: feel free to drop the trailing space writing (most solutions don't have them), I'll edit the question a bit. - ChristopheD
@indiv, @Nas Banov: updated the spec to allow (optionally) for the dropping of 1-character words. - ChristopheD
@Johannes Rössel: i think logically the trailing space should not be included in the output, because otherwise the following new-line (at position 80) will cause empty line/line to be skipped. - Nas Banov
Does the spec really need to say "make everything lowercase", or is it sufficient to make "She" and "she" show up in the same bin as you told Brian? - Gabe
@Gabe: good point (although the results shouldn't differ), I'll update the spec a little. - ChristopheD
Thanks. I just wanted to make sure that it doesn't matter whether I print "she" or "She", so long as it has the right length bar. - Gabe
@LiraNuna: thanks! But the next time I'll make sure my specifications are a lot more rigid and unambiguous before posting ;-) - ChristopheD
You really should provide a test case that handles your bar length issue... There seem to be quite a few answers here that don't take that into account... - gnarf
@gnarf: I've added a second sample output about an hour ago (would this be sufficient?). I feel bad for not providing this sample output when I posted the question although it was already worded in the specification since the beginning. In a few days I'll make a ASCII ranking (char count - language - user - full spec implemented). Implementing this bar+word scaling in general accounts for about 50/60 extra chars. - ChristopheD
@ChrisopheD -- sorry missed the sed 's/\byou\b/superlongstring/gI' 11.txt portion... :) - gnarf
Hm, fun quiz: Which implementations handle "foo_the" in the input string correctly? According to the spec only foo needs to remain. However, a few regex-based splitting implementations (mine included) yield foo and the (I'm going to fix it, but it might be a pretty common error as well). [And yes, bar scaling is exactly 50 additional chars here ;-)] - Joey
Hm, me again: What can we assume about the input (apart from being ASCII and no word is longer than 75 characters)? How many words are there at least? How many distinct words are there at least? - Joey
Can you really assume no word is longer than 75 characters? I would only assume that no word in the top 22 would be that long, but you can imagine a line of asterisks or Os. - Gabe
@Johannes Rössel: considering the fun quiz: foot_the should become two seperate words (foo and the) in the spirit of the fourth bullet point in the spec. I know it isn't ideal, but the spec is already on the heavy side. - ChristopheD
@Johannes Rössel, Gabe: you can indeed safely assume that no word in the input is longer then 75 characters (for this code-golf) and that there are at least 22 different words. - ChristopheD
Where in the specs is it asked to write the first bar (the one without words). Many solutions implement that to look like the given sample chart, but is seems not to be asked. - kriss
Can someone please do one in LOLCODE? That would look really funny. - Almo
[+122] [2010-07-04 05:07:54] Joe Z

LabVIEW 51 nodes, 5 structures, 10 diagrams

Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.

labVIEW code

results

The program flows from left to right:

labVIEW code explained


Wow, I've never before seen an example of visual programming that looks useful! I'd thought it was kind of consensus that it was impossible or, not worth it. - JDonner
(10) It IS not worth it - user216441
(4) LabVIEW's very happy in its hardware control and measurement niche, but really pretty awful for string manipulation. - Joe Z
(19) Best code golf answer I've seen. +1 for thinking outside the box! - Blair Holloway
(1) Gotta count the elements for us...every box and widget you had to drag to the screen counts. - dmckee --- ex-moderator kitten
@dmckee Good call. Most metrics are based on node count, so I'll add that. - Joe Z
@Underflow: Fair enough. I'm not sure it is a precise comparison, but it is something. - dmckee --- ex-moderator kitten
(1) Would it be possible to add a link to a bigger version of those charts? - Svish
@Svish Switched to a different host for the images. Hopefully it helps. - Joe Z
My first programming language was LabView, so it sends chills down my back to see it used so beautifully and with such great reception from the community. Very nice work. - Carter Pape
1
[+42] [2010-07-03 11:38:27] Ventero

Ruby 1.9, 185 chars

(heavily based on the other Ruby solutions)

w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]

Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt)

Since I'm using character-literals here, this solution only works in Ruby 1.9.

Edit: Replaced semicolons by line breaks for "readability". :P

Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.

Edit 3: Removed the trailing space again ;)


It's missing the trailing space, after each word. - Stéphan Kochen
Aww shoot, disregard that. Looks like the golf was just updated, trailing space no longer required. :) - Stéphan Kochen
Does not seem to accomodate for 'superlongstringstring' in 2nd or later position? (see problem description) - Nas Banov
(2) That looks really maintainable. - Zombies
2
[+39] [2010-07-03 09:52:29] Nabb

GolfScript, 177 175 173 167 164 163 144 131 130 chars

Slow - 3 minutes for the sample text (130)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"
|"\~1*2/0*'| '@}/

Explanation:

{           #loop through all characters
 32|.       #convert to uppercase and duplicate
 123%97<    #determine if is a letter
 n@if       #return either the letter or a newline
}%          #return an array (of ints)
]''*        #convert array to a string with magic
n%          #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa"   #push this string
2/          #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
-           #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$           #sort the array of words
(1@         #takes the first word in the array, pushes a 1, reorders stack
            #the 1 is the current number of occurrences of the first word
{           #loop through the array
 .3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/         #gather stack into an array and split into groups of 2
{~~\;}$     #sort by the latter element - the count of occurrences of each word
22<         #take the first 22 elements
.0=~:2;     #store the highest count
,76\-:1     #store the length of the first line
'_':0*' '\@ #make the first line
{           #loop through each word
"
|"\~        #start drawing the bar
1*2/0       #divide by zero
*'| '@      #finish drawing the bar
}/

"Correct" (hopefully). (143)

{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"
|"\~1*^/0*'| '@}/

Less slow - half a minute. (162)

'"'/' ':S*n/S*'"#{%q
'\+"
.downcase.tr('^a-z','
')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"
|"\~1*2/0*'| '@}/

Output visible in revision logs.


(2) About GolfScript: golfscript.com/golfscript - Assaf Lavie
(2) Not correct, in that if the second word is really long it will wrap to the next line. - Gabe
(5) "divide by zero" ...GolfScript allows that? - JAB
3
[+35] [2010-07-03 09:15:35] stor

206

shell, grep, tr, grep, sort, uniq, sort, head, perl

~ % wc -c wfg
209 wfg
~ % cat wfg
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
~ % # usage:
~ % sh wfg < 11.txt

hm, just seen above: sort -nr -> sort -n and then head -> tail => 208 :)
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206

egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'



for fun, here's a perl-only version (much faster):

~ % wc -c pgolf
204 pgolf
~ % cat pgolf
perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'
~ % # usage:
~ % sh pgolf < 11.txt

4
[+35] [2010-07-03 23:48:56] Martin Smith

Transact SQL set based solution (SQL Server 2005) 1063 892 873 853 827 820 783 683 647 644 630 characters

Thanks to Gabe for some useful suggestions to reduce the character count.

NB: Line breaks added to avoid scrollbars only the last line break is required.

DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',
SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING
(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D
FROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C
INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH
(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it',
'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+
REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @

Readable Version

DECLARE @  VARCHAR(MAX),
        @F REAL
SELECT @=BulkColumn
FROM   OPENROWSET(BULK'A',SINGLE_BLOB)x; /*  Loads text file from path
                                             C:\WINDOWS\system32\A  */

/*Recursive common table expression to
generate a table of numbers from 1 to string length
(and associated characters)*/
WITH N AS
     (SELECT 1 i,
             LEFT(@,1)L

     UNION ALL

     SELECT i+1,
            SUBSTRING(@,i+1,1)
     FROM   N
     WHERE  i<LEN(@)
     )
  SELECT   i,
           L,
           i-RANK()OVER(ORDER BY i)R
           /*Will group characters
           from the same word together*/
  INTO     #D
  FROM     N
  WHERE    L LIKE'[A-Z]'OPTION(MAXRECURSION 0)
             /*Assuming case insensitive accent sensitive collation*/

SELECT   TOP 22 W,
         -COUNT(*)C
INTO     #
FROM     (SELECT DISTINCT R,
                          (SELECT ''+L
                          FROM    #D
                          WHERE   R=b.R FOR XML PATH('')
                          )W
                          /*Reconstitute the word from the characters*/
         FROM             #D b
         )
         T
WHERE    LEN(W)>1
AND      W NOT IN('the',
                  'and',
                  'of' ,
                  'to' ,
                  'it' ,
                  'in' ,
                  'or' ,
                  'is')
GROUP BY W
ORDER BY C

/*Just noticed this looks risky as it relies on the order of evaluation of the 
 variables. I'm not sure that's guaranteed but it works on my machine :-) */
SELECT @F=MIN(($76-LEN(W))/-C),
       @ =' '      +REPLICATE('_',-MIN(C)*@F)+' '
FROM   #

SELECT @=@+' 
|'+REPLICATE('_',-C*@F)+'| '+W
             FROM     #
             ORDER BY C

PRINT @

Output

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| You
|____________________________________________________________| said
|_____________________________________________________| Alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| This
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| So
|___________________| very
|__________________| what

And with the long string

 _______________________________________________________________ 
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| Alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| at
|_________________________| with
|_______________________| on
|______________________| all
|____________________| This
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| So
|________________| very
|________________| what

(12) I gave you a +1 because you did it in T-SQL, and to quote Team America - "You have balls. I like balls." - user174624
I took the liberty of converting some spaces into newlines to make it more readable. Hopefully I didn't mess things up. I also minified it a bit more. - Gabe
@Gabe Thanks. I ended up largely rewriting it though. It is now shorter and quicker than before. - Martin Smith
(3) That code is screaming at me! :O - Joey
(1) One good way to save is by changing 0.000 to just 0, then using -C instead of 1.0/C. And making FLOAT into REAL will save a stroke too. The biggest thing, though, is that it looks like you have lots of AS instances that should be optional. - Gabe
@Gabe - Thanks for the tips. I was able to replace float with real and get rid of some of the 'AS's the two that remain are both required. The -C thing didn't work. The row with 0 on it is the top of the top bar. This ended up positioned at the bottom and I would have needed to replace 0.000 with a large magnitude negative number to get it at the right place. Thanks though! - Martin Smith
How about this: SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM # ORDER BY 1 - Gabe
@Gabe - Yep that works I'll implement the $ thing thanks. The problem is though that it returns an additional; column to the output that isn't part of the spec (Hence the need for the additional temp table step) - Martin Smith
(1) OK, how about SELECT [ ] FROM (SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM #)X ORDER BY O? - Gabe
@Gabe - Nice! That brings it down to comfortably less than 800. Thanks for your help! - Martin Smith
You don't need to declare @F where it's used. You can declare it up with @ and save a whole DECLARE worth of chars. - Gabe
Is i-ROW_NUMBER the same as RANK? Can the second CTE be moved into the FROM clause where it's used? Can the #D table query be made into a CTE? Can the #t table query be made into a CTE, or at least put into the FROM clause of the SELECT TOP 22 query? - Gabe
@Gabe - Thanks, All good points. Made some other simplifications as well and collectively knocked another 100 off. The '#D' needs to be a temp table. At the moment it takes about 12 seconds on my machine. Swapping to a CTE slowed it down massively (I cancelled the query after 2 minutes so don't know how long it would have taken-or indeed if it would have finished at all) – Martin Smith 6 mins ago - Martin Smith
@Gabe I love your duo with Martin. I took a punt and tried to shorten it (another answer) - I thought the CTE looked longish. - RichardTheKiwi
5
[+34] [2010-07-03 08:55:16] archgoon

Ruby 207 213 211 210 207 203 201 200 chars

An improvement on Anurag, incorporating suggestion from rfusca. Also removes argument to sort and a few other minor golfings.

w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}

Execute as:

ruby GolfedWordFrequencies.rb < Alice.txt

Edit: put 'puts' back in, needs to be there to avoid having quotes in output.
Edit2: Changed File->IO
Edit3: removed /i
Edit4: Removed parentheses around (f*1.0), recounted
Edit5: Use string addition for the first line; expand s in-place.
Edit6: Made m float, removed 1.0. EDIT: Doesn't work, changes lengths. EDIT: No worse than before
Edit7: Use STDIN.read.


+1 - love the sorting part, very clever :) - Anurag
Hey, small optimization compared to coming up with the bulk of it in the first place. :) - archgoon
Nice! Added two of the changes I also made in Anurag's version. Shaves off another 4. - Stéphan Kochen
The solution has deviated from the original output, I'm going totry and figure out where that happened. - archgoon
Huh, note that the last two are the same length (in our and several other versions), but the original questioner has them as different. Anurag's original solution has this issue. It's going to be a pain tracking it down. I'm putting back in the 76.0 trick, since it isn't the problem. - archgoon
@archgoon: I applaud your noble effort, but string addition is not shorter for the loop. It's only shorter because you took out the trailing space. But don't feel bad, it doesn't make Perl look any better. ;) - Stéphan Kochen
@Shtééf, ah, that explains why you didn't do it already ;). At least we got the proper count. Congratulations to you. - archgoon
How about [0..21] instead of .take 22? - Dogbert
(1) There's a shorter variant of this down further. - archgoon
6
[+28] [2010-07-03 02:43:14] Dr. belisarius

Mathematica (297 284 248 244 242 199 chars) Pure Functional

and Zipf's Law Testing

Look Mamma ... no vars, no hands, .. no head

Edit 1> some shorthands defined (284 chars)

f[x_, y_] := Flatten[Take[x, All, y]]; 

BarChart[f[{##}, -1], 
         BarOrigin -> Left, 
         ChartLabels -> Placed[f[{##}, 1], After], 
         Axes -> None
] 
& @@
Take[
  SortBy[
     Tally[
       Select[
        StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]], 
       !MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&]
     ], 
  Last], 
-22]

Some explanations

Import[] 
   # Get The File

ToLowerCase []
   # To Lower Case :)

StringSplit[ STRING , RegularExpression["\\W+"]]
   # Split By Words, getting a LIST

Select[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&]
   #  Select from LIST except those words in LIST_TO_AVOID
   #  Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the test

Tally[LIST]
   # Get the LIST {word,word,..} 
     and produce another  {{word,counter},{word,counter}...}

SortBy[ LIST ,Last]
   # Get the list produced bt tally and sort by counters
     Note that counters are the LAST element of {word,counter}

Take[ LIST ,-22]
   # Once sorted, get the biggest 22 counters

BarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST
   # Get the list produced by Take as input and produce a bar chart

f[x_, y_] := Flatten[Take[x, All, y]]
   # Auxiliary to get the list of the first or second element of lists of lists x_
     dependending upon y
   # So f[{##}, -1] is the list of counters
   # and f[{##}, 1] is the list of words (labels for the chart)

Output

alt text http://i49.tinypic.com/2n8mrer.jpg [1]

Mathematica is not well suited for golfing, and that is just because of the long, descriptive function names. Functions like "RegularExpression[]" or "StringSplit[]" just make me sob :(.

Zipf's Law Testing

The Zipf's law [2] predicts that for a natural language text, the Log (Rank) vs Log (occurrences) Plot follows a linear relationship.

The law is used in developing algorithms for criptography and data compression. (But it's NOT the "Z" in the LZW algorithm).

In our text, we can test it with the following

 f[x_, y_] := Flatten[Take[x, All, y]]; 
 ListLogLogPlot[
     Reverse[f[{##}, -1]], 
     AxesLabel -> {"Log (Rank)", "Log Counter"}, 
     PlotLabel -> "Testing Zipf's Law"]
 & @@
 Take[
  SortBy[
    Tally[
       StringSplit[ToLowerCase[b], RegularExpression["\\W+"]]
    ], 
   Last],
 -1000]

The result is (pretty well linear)

alt text http://i46.tinypic.com/33fcmdk.jpg [3]

Edit 6 > (242 Chars)

Refactoring the Regex (no Select function anymore)
Dropping 1 char words
More efficient definition for function "f"

f = Flatten[Take[#1, All, #2]]&; 
BarChart[
     f[{##}, -1], 
     BarOrigin -> Left, 
     ChartLabels -> Placed[f[{##}, 1], After], 
     Axes -> None] 
& @@
  Take[
    SortBy[
       Tally[
         StringSplit[ToLowerCase[Import[i]], 
          RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]]
       ],
    Last],
  -22]

Edit 7 → 199 characters

BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@ 
  Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i, 
    RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22]
  • Replaced f with Transpose and Slot (#1/#2) arguments.
  • We don't need no stinkin' brackets (use f@x instead of f[x] where possible)

[1] http://i49.tinypic.com/2n8mrer.jpg
[2] http://en.wikipedia.org/wiki/Zipf%27s_law
[3] http://i46.tinypic.com/33fcmdk.jpg

(9) You think "RegularExpression" is bad? I cried when I typed "System.Text.RegularExpressions.Regex.Split" into the C# version, up until I saw the Objective-C code: "stringWithContentsOfFile", "enumerateSubstringsInRange", "NSStringEnumerationByWords", "sortedArrayUsingComparator", and so on. - Gabe
(2) @Gabe Thanks ... I feel better now. In spanish we say "mal de muchos, consuelo de tontos" .. Something like "Many troubled, fools relieved" :D - Dr. belisarius
(1) The |i| is redundant in your regex because you already have .|. - Gabe
(1) I like that Spanish saying. The closest thing I can think of in English is "misery loves company". Here's my translation attempt: "It's a fool who, when suffering, takes consolation in thinking of others in the same situation." Amazing work on the Mathematica implementation, btw. - dreeves
@dreeves Foolishness surpass the language barrier easily ... Glad to see you like my little Mathematica program, I'm just starting to learn the language - Dr. belisarius
@Michael Pilat Wow! I've a lot to learn ... wonderful! - Dr. belisarius
The 199 version will not interpret things like the_foo according to spec, right? - Joey
@Johannes Rössel Good eye! The bug is in all versions due to Mathematica matching the underscore as a letter char (Why did they do that??!!). The regexp should be something like "(_|\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+" , but \\W also recognizes digits as letters, so perhaps an utterly correct version is a little longer. - Dr. belisarius
@belisarius: Actually all regex engines consider \w as something like [a-zA-Z0-9_] or maybe [\p{L}\p{Nd}_] for Unicode-aware engines. And since \b is considered a boundary between \w and \W this doesn't work according to the spec here. But many solutions have that problem and it took me quite a few characters to get that part right in my solution. As for the why I think it fits with what many programming languages allow as identifiers. You can simply match them with \w+ (doesn't quite work, but close enough for most hackish solutions). - Joey
7
[+26] [2010-07-02 21:37:32] Paul Creasey

C# - 510 451 436 446 434 426 422 chars (minified)

Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.

using C=System.Console;   // alias for Console
using System.Linq;  // for Split, GroupBy, Select, OrderBy, etc.

class Class // must define a class
{
    static void Main()  // must define a Main
    {
        // split into words
        var allwords = System.Text.RegularExpressions.Regex.Split(
                // convert stdin to lowercase
                C.In.ReadToEnd().ToLower(),
                // eliminate stopwords and non-letters
                @"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")
            .GroupBy(x => x)    // group by words
            .OrderBy(x => -x.Count()) // sort descending by count
            .Take(22);   // take first 22 words

        // compute length of longest bar + word
        var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));

        // prepare text to print
        var toPrint = allwords.Select(x=> 
            new { 
                // remember bar pseudographics (will be used in two places)
                Bar = new string('_',(int)(x.Count()/lendivisor)), 
                Word=x.Key 
            })
            .ToList();  // convert to list so we can index into it

        // print top of first bar
        C.WriteLine(" " + toPrint[0].Bar);
        toPrint.ForEach(x =>  // for each word, print its bar and the word
            C.WriteLine("|" + x.Bar + "| " + x.Word));
    }
}

422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):

using System.Linq;using C=System.Console;class M{static void Main(){var
a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var
b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}

+1 for the smart-ass downloading the file inline. :) - sarnold
(1) Steal the short URL from Matt's answer. - indiv
(2) The spec said the file must be piped in or passed as an args. If you were to assume that args[0] contained the local file name, you could shorten it considerably by using args[0] instead of (new WebClient()).DownloadString(@"gutenberg.org/files/11/11.txt"‌​) -> it would save you approx 70 characters - thorkia
(1) Here is a version replacing the WebClient call with args 0, a call to StreamReader, and removing a few extra spaces. Total char count=413 var a=Regex.Replace((new StreamReader(args[0])).ReadToEnd(),"[^a-zA-Z]"," ").ToLower().Split(' ').Where(x=>!(new[]{"the","and","of","to","a","i","it","in",‌​"or","is"}).Contains‌​(x)).GroupBy(x=>x).S‌​elect(g=>new{w=g.Key‌​,c=g.Count()}).Order‌​ByDescending(x=>x.c)‌​.Skip(1).Take(22).To‌​List();var m=a.OrderByDescending(x=>x.c).First();a.ForEach(x=>Console.W‌​riteLine("|"+new String('_',x.c*(80-m.w.Length-4)/m.c)+"| "+x.w)); - thorkia
"new StreamReader" without "using" is dirty. File.ReadAllText(args[0]) or Console.In.ReadToEnd() are much better. In the latter case you can even remove argument from your Main(). :) - Rotsor
The bar widths are incorrect. "with"'s bar is shorter than "at"'s. - Rotsor
Rotsor: As far as I can tell, "with" and "at" have the same width of bar, which they should because they have the same frequency. - Gabe
You use Console.WriteLine a number of times. Save some more chars by aliasing using C=System.Console; and then in your code C.WriteLine(..), or a different char since you already have C as a class name. - John K
This is an awesome example of the power of LINQ. Just imagine that in Java. - James Davies
@Zoomzoom83: It would be great to have but it would probably still be two orders of magnitude longer. We're talking about Java, after all ;) (and it will probably only show up in Java 8 which set its release date after Duke Nukem Forever). - Joey
8
[+25] [2010-07-02 21:29:35] JSBձոգչ

Perl, 237 229 209 chars

(Updated again to beat the Ruby version with more dirty golf tricks, replacing split/[^a-z/,lc with lc=~/[a-z]+/g, and eliminating a check for empty string in another place. These were inspired by the Ruby version, so credit where credit is due.)

Update: now with Perl 5.10! Replace print with say, and use ~~ to avoid a map. This has to be invoked on the command line as perl -E '<one-liner>' alice.txt. Since the entire script is on one line, writing it as a one-liner shouldn't present any difficulty :).

 @s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];

Note that this version normalizes for case. This doesn't shorten the solution any, since removing ,lc (for lower-casing) requires you to add A-Z to the split regex, so it's a wash.

If you're on a system where a newline is one character and not two, you can shorten this by another two chars by using a literal newline in place of \n. However, I haven't written the above sample that way, since it's "clearer" (ha!) that way.


Here is a mostly correct, but not remotely short enough, perl solution:

use strict;
use warnings;

my %short = map { $_ => 1 } qw/the and of to a i it in or is/;
my %count = ();

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);
my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
my $widest = 76 - (length $sorted[0]);

print " " . ("_" x $widest) . "\n";
foreach (@sorted)
{
    my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);
    print "|" . ("_" x $width) . "| $_ \n";
}

The following is about as short as it can get while remaining relatively readable. (392 chars).

%short = map { $_ => 1 } qw/the and of to a i it in or is/;
%count;

$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);
@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
$widest = 76 - (length $sorted[0]);

print " " . "_" x $widest . "\n";
print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;

Has a few bugs right now; fixing and shortening. - JSBձոգչ
(4) This doesn't cover the case when the second word is much longer than the first, right? - Joey
(1) Both foreach s can be written as for s. That's 8 chars down. Then you have the grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>, which I believe could be written as grep{!(/$_/i~~@s)}<>=~/[a-z]+/g to go 4 more down. Replace the " " with $" and you're down 1 more... - Zaid
sort{$c{$b}-$c{$a}}... to save two more. You can also just pass %c instead of keys %c to the sort function and save four more. - mob
9
[+20] [2010-07-03 10:51:43] Joey

Windows PowerShell, 199 chars

$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *
filter f($w){' '+'_'*$w
$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}
f(76..1|?{!((f $_)-match'.'*80)})[0]

(The last line break isn't necessary, but included here for readability.)

(Current code and my test files available in my SVN repository [1]. I hope my test cases catch most common errors (bar length, problems with regex matching and a few others))

Assumptions:

  • US ASCII as input. It probably gets weird with Unicode.
  • At least two non-stop words in the text

History [2]

Relaxed version (137), since that's counted separately by now, apparently:

($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}
  • doesn't close the first bar
  • doesn't account for word length of non-first word

Variations of the bar lengths of one character compared to other solutions is due to PowerShell using rounding instead of truncation when converting floating-point numbers into integers. Since the task required only proportional bar length this should be fine, though.

Compared to other solutions I took a slightly different approach in determining the longest bar length by simply trying out and taking the highest such length where no line is longer than 80 characters.

An older version explained can be found here [3].

[1] http://svn.lando.us/joey/Public/SO/3169051
[2] http://svn.lando.us/joey/Public/SO/3169051/history.txt
[3] http://svn.lando.us/joey/Public/SO/3169051/words_explained.ps1

Impressive, seems Powershell is a suitable environment for golfing. Your approach considering the bar length is exactly what I tried to describe (not so brilliantly, I admit) in the spec. - ChristopheD
(1) @ChristopheD: In my experience (Anarchy Golf, some Project Euler tasks and some more tasks just for the fun of it), PowerShell is usually only slightly worse than Ruby and often tied with or better than Perl and Python. No match for GolfScript, though. But as far as I can see, this might be the shortest solution that correctly accounts for bar lengths ;-) - Joey
Apparently I was right. Powershell can do better -- much better! Please provide an expanded version with comments. - Gabe
Johannes: Did you try -split("\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]")? It works for me. - Gabe
Don't forget to interpolate the output string: "|$('_'*($w*$_.count/$x[0].count))| $($_.name) " (or eliminate the last space, as it's sort of automatic). And you can use -split("(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z])+") to save a few more by not including blanks (or use [-2..-23]). - Gabe
Note that without the trailing space you need to match .{80}. And you can guarantee that blanks will always be first like this: "\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]()" (the empty capturing group ensures a blank for every word) - Gabe
So if you have "$input" do you still need the ()? Also, now that you eliminated the trailing space, you can save a couple strokes by not interpolating the name: "|$('_'*($w*$_.count/$x[0].count))| "+$_.name. We'll get to 200 yet! - Gabe
With the elimination of .ToString, it's now back under 200! - Gabe
@Gabe: Yay, thank you. And good catch on function vs. filter. I thought about using filter, I just didn't think of the fact that filters can take arguments too. For me it was a comparison between function f($w){...} and filter{$w=$_;...} (since I definitely need a loop in the function and therefore can't leave the argument as $_. Nice trick to remember, thanks :-). Still, I think this approach has been golfed almost to death by now. [And I notice we killed it somewhere in between ... my other two test cases don't run anymore – debugging ...] - Joey
One could argue that you're making it a little too general, but I'm not going to complain as along as it's still under 200. - Gabe
@Gabe: Well, I've revisited my assumptions concerning at least one non-stop word already. But the \b problem was clearly against the spec and only happened to work for the test input. - Joey
10
[+19] [2010-07-03 00:03:27] Anurag

Ruby, 215, 216, 218, 221, 224, 236, 237 chars

update 1: Hurray! It's a tie with JS Bangs [1]' solution [2]. Can't think of a way to cut down any more :)

update 2: Played a dirty golf trick. Changed each to map to save 1 character :)

update 3: Changed File.read to IO.read +2. Array.group_by wasn't very fruitful, changed to reduce +6. Case insensitive check is not needed after lower casing with downcase in regex +1. Sorting in descending order is easily done by negating the value +6. Total savings +15

update 4: [0] rather than .first, +3. (@Shtééf)

update 5: Expand variable l in-place, +1. Expand variable s in-place, +2. (@Shtééf)

update 6: Use string addition rather than interpolation for the first line, +2. (@Shtééf)

w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}

update 7: I went through a whole lot of hoopla to detect the first iteration inside the loop, using instance variables. All I got is +1, though perhaps there is potential. Preserving the previous version, because I believe this one is black magic. (@Shtééf)

(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}

Readable version

string = File.read($_).downcase

words = string.scan(/[a-z]+/i)
allowed_words = words - %w{the and of to a i it in or is}
sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)
highest_frequency = sorted_words.first
highest_frequency_count = highest_frequency[1]
highest_frequency_word = highest_frequency[0]

word_length = highest_frequency_word.size
widest = 76 - word_length

puts " #{'_' * widest}"    
sorted_words.each do |word, freq|
  width = (freq * 1.0 / highest_frequency_count) * widest
  puts "|#{'_' * width}| #{word} "
end

To use:

echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so 
[1] https://stackoverflow.com/users/8078/js-bangs
[2] https://stackoverflow.com/questions/3169051/code-golf-word-frequency-chart/3169203#3169203

(3) Isn't "p" a shortcut for "puts" ? That could shave a few. - rfusca
(1) Nice. Your use of scan, though, gave me a better idea, so I got ahead again :). - JSBձոգչ
@rfusca, p puts quotes around the output so it wouldn't match OP's. - Anurag
@JS Looks like its going to be a cat and mouse game, until J comes along :) - Anurag
You miscounted. At update 3, you were at 224 characters. I brought you back to 221, any way. :) That reduce trick is black magic. :o - Stéphan Kochen
@Shtééf - thanks :) .. last thing we want in code-golf is miscounting on the higher side.. lol :o) - Anurag
Isnt using the shell to "read" and pipe the file cheating ? - mP.
The question states that the input can be piped in. Also, we are just piping in the file name, not its contents. - Anurag
Okay, I am now totally done with this. You may now shout at me. :) I sure hope that Perl version doesn't become much shorter. - Stéphan Kochen
(2) You need to scale the bars so the longest word plus its bar fits on 80 characters. As Brian suggested, a long second word will break your program. - Gabe
(3) I wonder why this is still gathering votes. The solution is incorrect (in the general case) and two way shorter Ruby solutions are here by now. - Joey
(1) Now, Correct me if i'm wrong, but instead of using "downcase", why don't you use the REGEXP case insensitive flag, that saves 6-7 bytes, does it not? - st0le
How about [0..21] instead of .take 22? - Dogbert
11
[+19] [2010-07-03 06:32:16] Nas Banov

Python 2.x, latitudinarian approach = 227 183 chars

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]
for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w

Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion (the, and, of, to, a, i, it, in, or, is) - plus it also excludes the two infamous "words" s and t from the example - and I threw in for free the exclusion for an, for, he. I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings:itheandtoforinis and andithetoforinis.

PS. borrowed from other solutions to shorten the code.

=========================================================================== she 
================================================================= you
============================================================== said
====================================================== alice
================================================ was
============================================ that
===================================== as
================================= her
============================== at
============================== with
=========================== on
=========================== all
======================== this
======================== had
======================= but
====================== be
====================== not
===================== they
==================== so
=================== very
=================== what
================= little

Rant

Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the text corpus [1] used. Per one of the most popular lists (http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for-students.com/Frequently-Used-Words.html, http://www.sporcle.com/games/common_english_words.php), top 10 words are: the be(am/are/is/was/were) to of and a in that have I

The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for

So question is why or was included in the problem's ignore list, where it's ~30th in popularity when the word that (8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).

Alternative idea would be simply to skip the top 10 words from the result - which actually would shorten the solution (elementary - have to show only the 11th to 32nd entries).


Python 2.x, punctilious approach = 277 243 chars

The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:

import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]
h=min(9*l/(77-len(w))for l,w in r)
print'',9*r[0][0]/h*'_'
for l,w in r:print'|'+9*l/h*'_'+'|',w

I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"

This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243

PS. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|_________________________________________| was
|______________________________________| that
|_______________________________| as
|____________________________| her
|__________________________| at
|__________________________| with
|_________________________| s
|_________________________| t
|_______________________| on
|_______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|___________________| not
|_________________| they
|_________________| so
[1] http://en.wikipedia.org/wiki/Text_corpus

Nice solution so far although the word ignore list isn't implemented (yet) and the bars are a bit rudimentary at the moment. - ChristopheD
@ChristopheD: it was there, but there was no "user guide". Just added bunch text - Nas Banov
Regarding your list of languages and solutions: Please look for solutions that use splitting along \W or use \b in a regex because those are very likely not according to spec, meaning they won't split on digits or _ and they might also not remove stop words from strings such as the_foo_or123bar. They may not appear in the test text but the specification is pretty clear on that case. - Joey
Amazing work Nas, I spent an afternoon trying to optimize this and only found one improvement. You can cut it down to 239 chars by removing the sys.argv hack and using: re.findall(r'\b(?!(?:the|and|.|of|to|i[tns]|or)\b)\w+',sys.s‌​tdin.read().lower()) - intgr
12
[+12] [2010-07-03 19:46:33] Thomas

Haskell - 366 351 344 337 333 characters

(One line break in main added for readability, and no line break needed at end of last line.)

import Data.List
import Data.Char
l=length
t=filter
m=map
f c|isAlpha c=toLower c|0<1=' '
h w=(-l w,head w)
x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w
q?(g,w)=q*(77-l w)`div`g
b x=m(x!)x
a(l:r)=(' ':t(=='_')l):l:r
main=interact$unlines.a.b.take 22.sort.m h.group.sort
  .t(`notElem`words"the and of to a i it in or is").words.m f

How it works is best seen by reading the argument to interact backwards:

  • map f lowercases alphabetics, replaces everything else with spaces.
  • words produces a list of words, dropping the separating whitespace.
  • filter (notElemwords "the and of to a i it in or is") discards all entries with forbidden words.
  • group . sort sorts the words, and groups identical ones into lists.
  • map h maps each list of identical words to a tuple of the form (-frequency, word).
  • take 22 . sort sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.
  • b maps tuples to bars (see below).
  • a prepends the first line of underscores, to complete the topmost bar.
  • unlines joins all these lines together with newlines.

The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so || would be a bar of zero length. The function b maps c x over x, where x is the list of histograms. The entire list is passed to c, so that each invocation of c can compute the scale factor for itself by calling u. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.

Note the trick of using -frequency. This removes the need to reverse the sort since sorting (ascending) -frequency will places the words with the largest frequency first. Later, in the function u, two -frequency values are multiplied, which will cancel the negation out.


Very nice work (would upvote but ran out of votes for today with all the great answers in this thread). - ChristopheD
This hurts my eyes in a way that's painful even to think about describing, but I learned a lot of Haskell by reverse-engineering it into legible code. Well done, sir. :-) - Owen S.
It's actually fairly idiomatic Haskell still, albeit not really efficient. The short names make it look far worse than it really is. - Thomas
@Thomas: You can say that again. :-) - Owen S.
u q(g,w)=q*div(77-l w)g -- can save you 2 chars - Edward Kmett
@MtnViewMark: Nice work! I didn't know that words discards runs of whitespace, nor that you can put | conditions onto one line. And I can't believe I put a two-letter variable name in there... - Thomas
(1) Can't move the div, actually! Try it- the output is wrong. The reason is that doing the div before the * looses precision. - MtnViewMark
Ah, whoops, got precedences wrong. Should've tested before editing :P - Thomas
@trinithis: It's shorter alright, but now I don't understand how it works any longer! I'm afraid you moved beyond my understanding of Haskell. Why is a bang pattern needed? What does the question mark even mean? - Thomas
Its not a bang pattern :D. All I did was change binary functions to infix operators. I just chose to use ? and ! for the operator names. - Thomas Eding
13
[+11] [2010-07-02 23:05:58] Matt

JavaScript 1.8 (SpiderMonkey) - 354

x={};p='|';e=' ';z=[];c=77
while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))
z=z.sort(function(a,b)b.c-a.c).slice(0,22)
for each(v in z){v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

Sadly, the for([k,v]in z) from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile() is a little easier than using readline() but moving up to 1.8 allows us to use function closures to cut a few more lines....

Adding whitespace for readability:

x={};p='|';e=' ';z=[];c=77
while(l=readline())
  l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,
   function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )
  )
z=z.sort(function(a,b) b.c - a.c).slice(0,22)
for each(v in z){
  v.r=v.c/z[0].c
  c=c>(l=(77-v.w.length)/v.r)?l:c
}
for(k in z){
  v=z[k]
  s=Array(v.r*c|0).join('_')
  if(!+k)print(e+s+e)
  print(p+s+p+e+v.w)
}

Usage: js golf.js < input.txt

Output:

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| for
|______________________| had
|______________________| but
|_____________________| be
|_____________________| not
|___________________| they
|___________________| so

(base version - doesn't handle bar widths correctly)

JavaScript (Rhino) - 405 395 387 377 368 343 304 chars

I think my sorting logic is off, but.. I duno. Brainfart fixed.

Minified (abusing \n's interpreted as a ; sometimes):

x={};p='|';e=' ';z=[]
readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})
z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)
for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}

Ah, sir. I believe this is your gauntlet. Have your second speak to mine. - dmckee --- ex-moderator kitten
(2) BTW-- I like the i[tns]? bit. Very sneaky. - dmckee --- ex-moderator kitten
@dmckee - well played, I don't think I can beat your 336, enjoy your much-deserved upvote :) - Matt
You can definitely beat 336... There is a 23 character cut available -- .replace(/[^\w ]/g, e).split(/\s+/).map( can be replaced with .replace(/\w+/g, and use the same function your .map did... Also not sure if Rhino supports function(a,b)b.c-a.c instead of your sort function (spidermonkey does), but that will shave {return } ... b.c-a.c is a better sort that a.c<b.c btw... Editing a Spidermonkey version at the bottom with these changes - gnarf
I moved my SpiderMonkey version up to the top since it conforms to the bar width constraint... Also managed to cut out a few more chars in your original version by using a negative lookahead regexp to deny words allowing for a single replace(), and golfed a few ifs with ?: Great base to work from though! - gnarf
This will not eliminate stop words when surrounded by digits or underscores such as in foo_the123 where only foo should remain. - Joey
14
[+11] [2010-07-03 14:37:48] pdehaan

perl, 205 191 189 characters/ 205 characters (fully implemented)

Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.

Original:

$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s
",'_'x$l;printf"|%s| $_
",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];

Latest version down to 191 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@e[0,0..21]

Latest version down to 189 characters:

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@_[0,0..21]

This version (205 char) accounts for the lines with words longer than what would be found later.

/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
";}@e[0,0..21]

15
[+11] [2010-07-03 17:13:20] Sam Dolan

Python 3.1 - 245 229 charaters

I guess using Counter [1] is kind of cheating :) I just read about it about a week ago, so this was the perfect chance to see how it works.

import re,collections
o=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)
print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))

Prints out:

|____________________________________________________________________________| she
|__________________________________________________________________| you
|_______________________________________________________________| said
|_______________________________________________________| alice
|_________________________________________________| was
|_____________________________________________| that
|_____________________________________| as
|__________________________________| her
|_______________________________| with
|_______________________________| at
|______________________________| s
|_____________________________| t
|____________________________| on
|___________________________| all
|________________________| this
|________________________| for
|________________________| had
|________________________| but
|______________________| be
|______________________| not
|_____________________| they
|____________________| so

Some of the code was "borrowed" from AKX's solution.

[1] http://docs.python.org/dev/library/collections.html#collections.Counter

The first line is missing. And the bar length isn't correct. - Joey
in your code seems that open('!') reads from stdin - which version/OS is that on? or do you have to name the file '!'? - Nas Banov
Name the file "!" :) Sorry that was pretty unclear, and I should have mentioned it. - Sam Dolan
16
[+11] [2010-07-03 23:17:32] user382874

PHP CLI version (450 chars)

This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!

Usage: php.exe <this.php> <file.txt>

Minified:

<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>

Human readable:

<?php

// Read:
$s = strtolower(file_get_contents($argv[1]));

// Split:
$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);

// Remove unwanted words:
$a = array_filter($a, function($x){
       return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);
     });

// Count:
$a = array_count_values($a);

// Sort:
arsort($a);

// Pick top 22:
$a=array_slice($a,0,22);


// Recursive function to adjust bar widths
// according to the last requirement:
function R($a,$F,$B){
    $r = array();
    foreach($a as $x=>$f){
        $l = strlen($x);
        $r[$x] = $b = $f * $B / $F;
        if ( $l + $b > 76 )
            return R($a,$f,76-$l);
    }
    return $r;
}

// Apply the function:
$c = R($a,max($a),76-strlen(key($a)));


// Output:
foreach ($a as $x => $f)
    echo '|',str_repeat('-',$c[$x]),"| $x\n";

?>

Output:

|-------------------------------------------------------------------------| she
|---------------------------------------------------------------| you
|------------------------------------------------------------| said
|-----------------------------------------------------| alice
|-----------------------------------------------| was
|-------------------------------------------| that
|------------------------------------| as
|--------------------------------| her
|-----------------------------| at
|-----------------------------| with
|--------------------------| on
|--------------------------| all
|-----------------------| this
|-----------------------| for
|-----------------------| had
|-----------------------| but
|----------------------| be
|---------------------| not
|--------------------| they
|--------------------| so
|-------------------| very
|------------------| what

When there is a long word, the bars are adjusted properly:

|--------------------------------------------------------| she
|---------------------------------------------------| thisisareallylongwordhere
|-------------------------------------------------| you
|-----------------------------------------------| said
|-----------------------------------------| alice
|------------------------------------| was
|---------------------------------| that
|---------------------------| as
|-------------------------| her
|-----------------------| with
|-----------------------| at
|--------------------| on
|--------------------| all
|------------------| this
|------------------| for
|------------------| had
|-----------------| but
|-----------------| be
|----------------| not
|---------------| they
|---------------| so
|--------------| very

17
[+10] [2010-07-03 15:33:14] Syntaera

Perl: 203 202 201 198 195 208 203 / 231 chars

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]

Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):

$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}

The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)

PS: Reddit says "Hi" ;)

Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.

Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.

Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.

Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob

Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.

Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.

Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)

Examples:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

Alternative implementation in pathological case example:

 _______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| with
|_________________________| at
|_______________________| on
|______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| so
|________________| very
|________________| what

You can shorten the regex for the stop words by collapsing is|in|it|i into i[snt]? – and then there's no difference with the optional rule anymore. (Hm, I never would have thought about telling a Perl guy how to do Regex :D) – only problem now: I have to look how I can shave off three bytes from my own solution to be better than Perl again :-| - Joey
Ok, disregard part of what I said earlier. Ignoring one-letter words is indeed a byte shorter than not doing it. - Joey
Every byte counts ;) I considered doing the newline trick, but I figured it was actually the same number of bytes, even if it was fewer printable characters. Still working on seeing if I can shrink it some more :) - Syntaera
Ah well, case normalization threw me back to 209. I don't see what else I could cut. Although PowerShell can be shorter than Perl. ;-) - Joey
I don't see where you restrict the output to the top 22 words, nor where you make sure that a long second word doesn't wrap. - Gabe
You can save even more by using say: perl -E '$/=\0;map{/^(the|and|of|to|.|it|in|or|is|)$/||$x{$_}++}spli‌​t(/[^a-z]/,lc<>);map‌​{$z=$x{$_};$y||{$y=(‌​76-y///c)/$z}&&say" "."_"x($z*$y);say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x' - Chas. Owens
Even more by using a character class and using for instead of map where possible: perl -E '$/=\0;/^(the|and|of|to|.|i[tns]|or)$/||$x{$_}++for split(/[^a-z]/,lc<>);map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&s‌​ay" "."_"x($z*$y);say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x' - Chas. Owens
Thanks for that - I came to the same conclusion about for, but also got rid of split(), just using a bare regex instead for it. Back down to 203! - Syntaera
18
[+9] [2010-07-02 21:52:47] Brian

F#, 452 chars

Strightforward: get a sequence a of word-count pairs, find the best word-count-per-column multiplier k, then print results.

let a=
 stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1)
 |>Seq.map(fun s->s.ToLower())|>Seq.countBy id
 |>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w))
 |>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22
let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min
let u n=String.replicate(int(float(n)*k)-2)"_"
printfn" %s "(u(snd(Seq.nth 0 a)))
for(w,n)in a do printfn"|%s| %s "(u n)w

Example (I have different freq counts than you, unsure why):

% app.exe < Alice.txt

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| had
|______________________| for
|_____________________| but
|_____________________| be
|____________________| not
|___________________| they
|__________________| so

turns out my own solution was indeed a little off (due to a little different spec), the solutions correspond now ;-) - ChristopheD
+1 for the only correct bar scaling implementation so far - Rotsor
(2) (@Rotsor: Ironic, given that mine is the oldest solution.) - Brian
I bet you could shorten it quite a bit by merging the split, map, and filter stages. I'd also expect that you wouldn't need so many floats. - Gabe
Isn't nesting functions usually shorter than using the pipeline operator |>? - Joey
19
[+8] [2010-07-02 23:27:39] AKX

Python 2.6, 347 chars

import re
W,x={},"a and i in is it of or the to".split()
[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]
W=sorted(W.items(),key=lambda p:p[1])[:22]
bm=(76.-len(W[0][0]))/W[0][1]
U=lambda n:"_"*int(n*bm)
print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|_____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| s 
|____________________________| t 
|__________________________| on 
|__________________________| all 
|_______________________| this 
|_______________________| for 
|_______________________| had 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so 

(1) You can lose the line bm=(76.-len(W[0][0]))/W[0][1] since you're only using bm once (make the next line U=lambda n:"_"*int(n*(76.-len(W[0][0]))/W[0][1]), shaves off 5 characters. Also: why would you use a 2-character variable name in code golfing? ;-) - ChristopheD
On the last line the space after print isn't necessary, shaves off one character - ChristopheD
(1) Doesn't consider the case when the second-most frequent word is very long, right? - Joey
@ChristopheD: Because I had been staring at that code for a little too long. :P Good catch. @Johannes: That could be fixed too, yes. Not sure all other implementations did it when I wrote this either. - AKX
20
[+7] [2010-07-02 22:54:33] dmckee --- ex-moderator kitten

Gawk -- 336 (originally 507) characters

(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)

Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solutioncounter challenge! ;) and [AKX's python][2].

The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.

It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.

Minified:

{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}
END{split("the and of to a i it in or is",b," ");
for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}
for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;
t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;
print"|"t"| "x;delete a[x]}}

line breaks for clarity only: they are not necessary and should not be counted.


Output:

$ gawk -f wordfreq.awk.min < 11.txt 
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min
 ______________________________________________________________________
|______________________________________________________________________| she
|_____________________________________________________________| superlongstring
|__________________________________________________________| said
|__________________________________________________| alice
|____________________________________________| was
|_________________________________________| that
|_________________________________| as
|______________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|__________________________| t
|________________________| on
|________________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|_________________| so

Readable; 633 characters (originally 949):

{
    gsub("[^a-zA-Z]"," ");
    for(;NF;NF--)
    a[tolower($NF)]++
}
END{
    # remove "short" words
    split("the and of to a i it in or is",b," ");
    for (w in b) 
    delete a[b[w]];
    # Find the bar ratio
    d=1;
    for (w in a) {
    e=a[w]/(78-length(w));
    if (e>d)
        d=e
    }
    # Print the entries highest count first
    for (i=22; i; --i){               
    # find the highest count
    e=0;
    for (w in a) 
        if (a[w]>e)
        e=a[x=w];
        # Print the bar
    l=a[x]/d-2;
    # make a string of "_" the right length
    t=sprintf(sprintf("%%%dc",l)," ");
    gsub(" ","_",t);
    if (i==22) print" "t;
    print"|"t"| "x;
    delete a[x]
    }
}

Nice work, good you included an indented / commented version ;-) - ChristopheD
21
[+7] [2010-07-02 22:55:19] Frank Farmer

*sh (+curl), partial solution

This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:

curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22

22
[+7] [2010-07-04 15:53:54] 6502

Common LISP, 670 characters

I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).

(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(
make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda
(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test
'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y
(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-
76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))
(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)
(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))
(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(
reverse w))c 0))(setf w nil)))))

can be run on for example with cat alice.txt | clisp -C golf.lisp.

In readable form is

(flet ((r () (let ((x (read-char t nil)))
               (and x (char-downcase x)))))
  (do ((c (make-hash-table :test 'equal))  ; the word count map
       w y                                 ; current word and final word list
       (x (r) (r)))  ; iteration over all chars
       ((not x)

        ; make a list with (word . count) pairs removing stopwords
        (maphash (lambda (k v)
                   (if (not (find k '("" "the" "and" "of" "to"
                                      "a" "i" "it" "in" "or" "is")
                                  :test 'equal))
                       (push (cons k v) y)))
                 c)

        ; sort and truncate the list
        (setf y (sort y #'> :key #'cdr))
        (setf y (subseq y 0 (min (length y) 22)))

        ; find the scaling factor
        (let ((f (apply #'min
                        (mapcar (lambda (x) (/ (- 76.0 (length (car x)))
                                               (cdr x)))
                                y))))
          ; output
          (flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))
             (write-char #\Space)
             (outx (cdar y))
             (write-char #\Newline)
             (dolist (x y)
               (write-char #\|)
               (outx (cdr x))
               (format t "| ~a~%" (car x))))))

       ; add alphabetic to current word, and bump word counter
       ; on non-alphabetic
       (cond
        ((char<= #\a x #\z)
         (push x w))
        (t
         (incf (gethash (concatenate 'string (reverse w)) c 0))
         (setf w nil)))))

have you tried installing a custom reader macro to shave off some input size? - Aaron
@Aaron actually it wasn't trivial for me even just getting this working... :-) for the actual golfing part i just used one-letter variables and that's all. Anyway besides somewhat high verbosity that is inherent in CL for this scale of problems ("concatenate 'string", "setf" or "gethash" are killers... in python they are "+", "=", "[]") still I felt this a lot worse that I would have expected even on a logical level. In a sense I've a feeling that lisp is ok, but common lisp is so-so and this beyond naming (re-reading it a very unfair comment as my experience with CL is close to zero). - 6502
true. scheme would make the golfing a bit easier, with the single namespace. instead of string-append all over the place, you could (letrec ((a string-append)(b gethash)) ... (a "x" "yz") ...) - Aaron
23
[+6] [2010-07-04 10:16:55] ShinTakezou

C (828)

It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?

It does not handle failures and it does not release used memory.

#include <glib.h>
#define S(X)g_string_##X
#define H(X)g_hash_table_##X
GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}

Newlines do count as characters, but you can strip any from lines that are not preprocessor instructions. For a golf, I wouldn't consider not freeing memory a bad practice. - Stéphan Kochen
ok... put all in a line(expect preproc macros) and given a vers without freeing mem (and with two other spaces removed... a little bit of improvement can be made on the "obfuscation", e.g. *v=*v*(77-lw)/m will give 929 ... but I think it can be ok unless I find a way to do it a lot shorter) - ShinTakezou
I think you can move at least the int c into the main declaration and main is implicitly int (as are any untyped arguments, afaik): main(c){...}. You could probably also just write 0 instead of NULL. - Joey
doing it... of course will trigger some warning with the -Wall or with -std=c99 flag on... but I suppose this is pointless for a code-golf, right? - ShinTakezou
uff, sorry for short-gap time edits, ... I should change Without freeing memory stuff, it reaches 866 (removed some other unuseful space) to something else to let not think people that the difference with the free-memory version is all in that: now the no-free-memory version has a lot of more "improvements". - ShinTakezou
still some improvements can be done shortening names of variables+function - ShinTakezou
@Shin: BTW--you can have more than one answer to a single question. Scroll to the very bottom of the page to find the [Add Another Answer] button. I supposed it's moved down because the expectation is that multiple answer will be the exception, not the rule. - dmckee --- ex-moderator kitten
@dmckee thanks, I am going to disentangle C and Smalltalk! - ShinTakezou
24
[+6] [2010-07-06 03:17:02] mob

Perl, 185 char

200 (slightly broken) 199 197 195 193 187 185 characters. Last two newlines are significant. Complies with the spec.

map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;
$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];
die map{$U='_'x($X{$_}/$n);" $U
"x!$z++,"|$U| $_
"}@w

First line loads counts of valid words into %X.

The second line computes minimum scaling factor so that all output lines will be <= 80 characters.

The third line (contains two newline characters) produces the output.


This won't remove stop words from strings such as "foo_the_bar". Line length is also one too long (re-read the spec: "bar + space + word + space <= 80 chars") - Joey
25
[+5] [2010-07-03 05:57:45] BalusC

Java - 886 865 756 744 742 744 752 742 714 680 chars

  • Updates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.

  • Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (\\s in regex replaced by and ArrayList replaced by Vector). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.

  • Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z to get result.

  • Update 752 > 742 chars: I removed public and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.

  • Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k) by m.get(k)!=null (730 > 728), introduced substringing of line (728 > 714).

  • Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split() to remove unnecessary replaceAll().


import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}

More readable version:

import java.util.*;
class F{
 public static void main(String[]a)throws Exception{
  StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));
  final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);
  List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});
  int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);
  for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);
 }
}

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

It pretty sucks that Java doesn't have String#join() and closures [1] (yet).

Edit by Rotsor:

I have made several changes to your solution:

  • Replaced List with a String[]
  • Reused the 'args' argument instead of declaring my own String array. Also used it as an argument to .ToArray()
  • Replaced StringBuffer with a String (yes, yes, terrible performance)
  • Replaced Java sorting with a selection-sort with early halting (only first 22 elements have to be found)
  • Aggregated some int declaration into a single statement
  • Implemented the non-cheating algorithm finding the most limiting line of output. Implemented it without FP.
  • Fixed the problem of the program crashing when there were less than 22 distinct words in the text
  • Implemented a new algorithm of reading input, which is fast and only 9 characters longer than the slow one.

The condensed code is 688 711 684 characters long:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

The fast version (720 693 characters)

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}

More readable version:

import java.util.*;class F{public static void main(String[]l)throws Exception{
    Map<String,Integer>m=new HashMap();String w="";
    int i=0,k=0,j=8,x,y,g=22;
    for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{
        if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";
    }}
    l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;
    for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}
    for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}
    String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');
    System.out.println(" "+s);
    for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}
}

The version without behaviour improvements is 615 characters:

import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i<g;++i)for(j=i;++j<l.length;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}i=76-l[0].length();String s=new String(new char[i]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/m.get(l[0]))+"| "+w);}}}
[1] http://javac.info

Couldn't you just use the fully-qualified name to IOUtils instead of importing it? As far as I can see you're using it only once anyway. - Joey
(5) You kind of cheated by assuming that the longest bar will be exactly 75 characters. You have to make sure that no bar+word is longer than 80 chars. - Gabe
You're missing a space after the word. ;) - st0le
As I was trimming down my answer, I was hoping I'd beat BalusC's submission. I still have 200 characters to go, ugh! I wonder how long this would be without Commons IO & 75 char assumption. - Jonathon Faust
@Gabe and @Jon: I removed the 75 char fix, it's now however only dependent on 1st word. I removed the Commons IO dependency as well. - BalusC
m.get(w)==null is a shorter check than m.containsKey(w), and you don't actually need to declare t - you can dump it straight into the for-each construct. I think it might also be shorter to assign the new String(new char[c]).replace... to a string, and use substring in the second call to get a slice of it. - Carl
@Carl: thanks, I now see in the update history that you actually updated it, but I totally missed it and have overriden it, sorry! :) - BalusC
(1) Looks like you could shave some characters by making b a String instead of a StringBuffer. I don't want to think about what the performance would be, though (especially since you're adding one character at a time). - Michael Myers
@mmyers: Code golf isn't all about performance! You're right, I was too much used to a StringBuilder/Buffer for this kind of job. Updating... - BalusC
@mmyers: It didn't work nicely, it's indeed too slow. Pressing Ctrl+Z would abort it immediately. It took about 10 seconds to gather the example input instead of a subsecond. - BalusC
What is the purpose of final before Map? It seems extraneous to me. - Gabe
Here is a trick: Instead of writing System.out.println(...), define System x; and write later x.out.println(...) - Landei
@Gabe: so that it can be accessed in the anonymous Comparator class. @Landei: that's not possible in Java. - BalusC
Can you make m a Map<String,Float>? I would expect that to save 4 strokes. - Gabe
You can save some space by placing the code in an initializer: enum X{X{{..code goes here..}}} - ealf
@ealf: you need somewhere a main() anyway to execute the code. - BalusC
I think there is a way to shave some characters by doing something like: final Map<String,Integer>m=new HashMap();new BufferedInputReader(new InputStreamReader(System.in)){{for(String w:readLine().toLowerCase().replaceAll("\\b(.|the|and|of|to|i‌​[tns]|or)\\b|\\W"," ").split(" +"))m.put(w,m.get(w)!=null?m.get(w)+1:1);}}; - Carl
though that assumes no words are hyphen split over lines, which doesn't seem to be accounted for elsewhere, so seems safe enough an assumption. - Carl
@Carl: Unfortunately, the input can exist of multiple lines. I already considered that. - BalusC
hrm, how about sticking a while(ready()) in front of the for? - Carl
Why do you need so much casting here? (int)(((float)m.get(w)/m.get(l.get(0)))*c) You can do this: m.get(w)*c/m.get(l.get(0)) saving 16 characters, which will have an added benefit of being always exact and not FP-dangerous. However, l.get(0) is cheating. This solution will not work with 'superlongstring' example. - Rotsor
You can replace .replaceAll("\\b(.|the|and|of|to|i[tns]|or)\\b|\\W"," ").split(" +") with .split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"), saving 18 characters! - Rotsor
Solution with two changes above and StringBuffer replaced with a String is just 641 character long. Nobody required linear time, so I think String instead of StringBuffer is acceptable. - Rotsor
@Rotsor: Thank you very much for the hints :) I'll update the answer soon. And yes, I am cheating. But that requirement came later in and it's going to be another >100 chars extra ... - BalusC
I have changed it to a version without cheating, also fixing a bug of IndexOutOfBoundsException. I will append it to your post. Feel free to do anything you want with it. - Rotsor
The requirement didn't come later by the way, it was there from the beginning, but there just was no example to show this bug. - Rotsor
26
[+4] [2010-07-03 14:02:20] pr1001

Scala, 368 chars

First, a legible version in 592 characters:

object Alice {
  def main(args:Array[String]) {
    val s = io.Source.fromFile(args(0))
    val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)
    val freqs = words.foldLeft(Map[String, Int]())((countmap, word)  => countmap + (word -> (countmap.getOrElse(word, 0)+1)))
    val sortedFreqs = freqs.toList.sort((a, b)  => a._2 > b._2)
    val top22 = sortedFreqs.take(22)
    val highestWord = top22.head._1
    val highestCount = top22.head._2
    val widest = 76 - highestWord.length
    println(" " + "_" * widest)
    top22.foreach(t => {
      val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt
      println("|" + "_" * width + "| " + t._1)
    })
  }
}

The console output looks like this:

$ scalac alice.scala 
$ scala Alice aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

We can do some aggressive minifying and get it down to 415 characters:

object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}

The console session looks like this:

$ scalac a.scala 
$ scala A aliceinwonderland.txt
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

I'm sure a Scala expert could do even better.

Update: In the comments Thomas gave an even shorter version, at 368 characters:

object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}

Legibly, at 375 characters:

object Alice {
  def main(a:Array[String]) {
    val t = (Map[String, Int]() /: (
      for (
        x <- io.Source.fromFile(a(0)).getLines
        y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)
      ) yield y.toLowerCase
    ).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)
    val w = 76 - t.head._1.length
    print (" "+"_"*w)
    t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)
  }
}

383 chars: object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\‌​w+\\b(?<!\\bthe|and|‌​of|to|a|i|it|in|or|i‌​s)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).t‌​oList.sortBy(_._2).r‌​everse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}} - Thomas Jung
Of course, the ever handy for comprehension! Nice! - pr1001
27
[+4] [2010-07-05 22:52:24] mkneissl

Scala 2.8, 311 314 320 330 332 336 341 375 characters

including long word adjustment. Ideas borrowed from the other solutions.

Now as a script (a.scala):

val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22
def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toInt
println(" "+b(t(0)._2))
for(p<-t)printf("|%s| %s \n",b(p._2),p._1)

Run with

scala -howtorun:script a.scala alice.txt

BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).


28
[+4] [2010-07-07 12:21:34] Alex Taggart

Clojure 282 strict

(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))

Somewhat more legibly:

(let[[[_ m]:as s](->> (slurp *in*)
                   .toLowerCase
                   (re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")
                   frequencies
                   (sort-by val >)
                   (take 22))
     [b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))
     p #(do
          (print %1)
          (dotimes[_(* b %2)] (print \_))
          (apply println %&))]
  (p " " m)
  (doseq[[k v] s] (p \| v \| k)))

29
[+3] [2010-07-03 03:19:20] Jonathon Faust

Java - 896 chars

931 chars

1233 chars made unreadable

1977 chars "uncompressed"


Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.

I envy C# and LINQ so much.

import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}

"Readable":

import java.util.*;
import java.io.*;
import static java.util.regex.Pattern.*;
class g
{
   public static void main(String[] a)throws Exception
      {
      PrintStream o = System.out;
      Map<String,Integer> w = new HashMap();
      Scanner s = new Scanner(new File(a[0]))
         .useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));
      while(s.hasNext())
      {
         String z = s.next().trim().toLowerCase();
         if(z.equals(""))
            continue;
         w.put(z,(w.get(z) == null?0:w.get(z))+1);
      }
      List<Integer> v = new Vector(w.values());
      Collections.sort(v);
      List<String> q = new Vector();
      int i,m;
      i = m = v.size()-1;
      while(q.size()<22)
      {
         for(String t:w.keySet())
            if(!q.contains(t)&&w.get(t).equals(v.get(i)))
               q.add(t);
         i--;
      }
      int r = 80-q.get(0).length()-4;
      String l = String.format("%1$0"+r+"d",0).replace("0","_");
      o.println(" "+l);
      o.println("|"+l+"| "+q.get(0)+" ");
      for(i = m-1; i > m-22; i--)
      {
         o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");
      }
   }
}

Output of Alice:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what

Output of Don Quixote (also from Gutenberg):

 ________________________________________________________________________
|________________________________________________________________________| that
|________________________________________________________| he
|______________________________________________| for
|__________________________________________| his
|________________________________________| as
|__________________________________| with
|_________________________________| not
|_________________________________| was
|________________________________| him
|______________________________| be
|___________________________| don
|_________________________| my
|_________________________| this
|_________________________| all
|_________________________| they
|________________________| said
|_______________________| have
|_______________________| me
|______________________| on
|______________________| so
|_____________________| you
|_____________________| quixote

(8) Wholly carp, is there really no way to make it shorter in Java? I hope you guys get paid by number of characters and not by functionality :-) - Nas Banov
30
[+3] [2010-07-03 15:47:17] DMA57361

C++, 647 chars

I don't expect to score highly by using C++, but nevermind that. I'm pretty sure it hits all the requirements. Note that I used the C++0x auto keyword for variable declaration, so adjust your complier appropriately if you decide to test my code.

Minimised version

#include <iostream>
#include <cstring>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
int main(){map<C,int>f;char d[230];int i=1,v;for(;i<256;i++)d[i<123?i-1:i-27]=i;d[229]=0;char w[99];while(cin>>w){for(i=0;w[i];i++)w[i]=tolower(w[i]);char*p=strtok(w,d);while(p)++f[p],p=strtok(0,d);}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

Here's a second version that is more "C++" by using string, not char[] and strtok. It's a bit larger, at 669 (+22 vs above), but I can't get it smaller at the moment so thought I'd post it anyway.

#include <iostream>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
#define E e=w.find_first_of(d,g);g=w.find_first_not_of(d,e);
int main(){map<C,int>f;int i,v;C w,x,d="abcdefghijklmnopqrstuvwxyz";while(cin>>w){for(i=w.size();i-->0;)w[i]=tolower(w[i]);unsigned g=0,E while(g-e>0){x=w.substr(e,g-e),++f[x],E}}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}

I've removed the full version, because I can't be bothered to keep updating it with my tweaks to the minimised version. See edit history if you're interested in the (possibly outdated) long version.


If you're going to put an arbitrary limit on word length, you might as well make it 999 instead of 1024 and save a stroke. - Gabe
If you use float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A; you can eliminate a #define and shave a few strokes. - Gabe
@Gabe - thanks for that second one, trimmed a few extra away. As for word, having an arbitrary length doesn't really feel right - but I'm not sure of the best way to extract cin into a char array, as opposed to a string, without the risk of breaking in the middle of a word (ie, if I just pulled it in 80-char chunks). But I've put finding a "better" solution until probably tomorrow. - DMA57361
Isn't d[i-27]=0; the same as d[229]=0;? - Gabe
Why did you decide to use a char buffer instead of a string? - EvilTeach
You could save a space by making L{A=F/(76.0-G.length()),a=a>A?a:A;} into L A=F/(76.0-G.length()),a=a>A?a:A;. - Gabe
@EvilTeach - so I could use strtok. I'm not aware of a C++ string tokenising function (see stackoverflow.com/questions/53849/…) that would take "lots" of delimiters, and needed a reliable method to split words like "don't" on the punctuation. @Gabe - nice catch (again, thanks!) on d[229], as for the second suggestion - you'd already given that earlier and I obviously hadn't paid sufficient attention... - DMA57361
31
[+3] [2010-07-05 02:13:24] kriss

Yet another python 2.x - 206 chars (or 232 with 'width bar')

I believe this one if fully compliant with the question. Ignore list is here, it fully checks for line length (see exemple where I replaced Alice by Aliceinwonderlandbylewiscarroll througout the text making the fifth item the longest line. Even the filename is provided from command line instead of hardcoded (hardcoding it would remove about 10 chars). It has one drawback (but I believe it's ok with the question) as it compute an integer divider to make line shorter than 80 chars, the longest line is shorter than 80 characters, not exactly 80 characters. The python 3.x version does not have this defect (but is way longer).

Also I believe it is not so hard to read.

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
for l,w in b:print"|"+l/min(z/(78-len(e))for z,e in b)*'-'+"|",w

|----------------------------------------------------------------| she
|--------------------------------------------------------| you
|-----------------------------------------------------| said
|----------------------------------------------| aliceinwonderlandbylewiscarroll
|-----------------------------------------| was
|--------------------------------------| that
|-------------------------------| as
|----------------------------| her
|--------------------------| at
|--------------------------| with
|-------------------------| s
|-------------------------| t
|-----------------------| on
|-----------------------| all
|---------------------| this
|--------------------| for
|--------------------| had
|--------------------| but
|-------------------| be
|-------------------| not
|------------------| they
|-----------------| so

As it is not clear if we must print the max bar alone on it's line (like in sample output). Below is another one that do it, but 232 chars.

import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
f=min(z/(78-len(e))for z,e in b)
print"",b[0][0]/f*'-'
for y,w in b:print"|"+y/f*'-'+"|",w

Python 3.x - 256 chars

Using Counter class from python 3.x, there was high hopes to make it shorter (as Counter does everything that we need here). It comes out it's not better. Below is my trial 266 chars:

import sys,re,collections as c
b=c.Counter(re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",
sys.stdin.read().lower())).most_common(22)
F=lambda p,x,w:print(p+'-'*int(x/max(z/(77.-len(e))for e,z in b))+w)
F(" ",b[0][1],"")
for w,y in b:F("|",y,"| "+w)

The problem is that collections and most_common are very long words and even Counter is not short... really, not using Counter makes code only 2 characters longer ;-(

python 3.x also introduce other constraints : dividing two integers is not an integer any more (so we have to cast to int), print is now a function (must add parenthesis), etc. That's why it comes out 22 characters longer than python2.x version, but way faster. Maybe some more experimented python 3.x coder will have ideas to shorten the code.


That's a clever way of sorting from high to low. - Ponkadoodle
In the Python 2 solution, you can do for l,w in b:print"|"+int(l/min(z/(76.-len(e))for z,e in b))*'-'+"|",w in the last line to align your lines correctly. This adds five characters and makes your code adhere to the rules ("Maximize bar width within these constraints and scale the bars appropriately"). You can, however, strip a few chars by importing os instead of sys and then doing os.read(0,1e9) instead of sys.stdin.read(). This makes for 208 chars overall, still one of the best solutions. - Colin Emonds
32
[+2] [2010-07-03 05:09:21] BalusC

Java - 991 chars (incl newlines and indentations)

I took the code of @seanizer [1], fixed a bug (he omitted the 1st output line), made some improvements to make the code more 'golfy'.

import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.IOUtils;
public class WF{
 public static void main(String[] a)throws Exception{
  String t=IOUtils.toString(new java.net.URL(a[0]).openStream());
  class W implements Comparable<W> {
   String w;int f=1;W(String W){w=W;}public int compareTo(W o){return o.f-f;}
   String d(float r){char[]c=new char[(int)(f/r)];Arrays.fill(c,'_');return "|"+new String(c)+"| "+w;}
  }
  Map<String,W>M=new HashMap<String,W>();
  Matcher m=Pattern.compile("\\b\\w+\\b").matcher(t.toLowerCase());
  while(m.find()){String w=m.group();W W=M.get(w);if(W==null)M.put(w,new W(w));else W.f++;}
  M.keySet().removeAll(Arrays.asList("the,and,of,to,a,i,it,in,or,is".split(",")));
  List<W>L=new ArrayList<W>(M.values());Collections.sort(L);int l=76-L.get(0).w.length();
  System.out.println(" "+new String(new char[l]).replace('\0','_'));
  for(W w:L.subList(0,22))System.out.println(w.d((float)L.get(0).f/(float)l));
 }
}

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

[1] https://stackoverflow.com/questions/3169051/code-golf-word-frequency-chart/3169871#3169871

(1) new String(new char[l]).replace('\0','_') that's a nice trick to remember, thanks. - Sean Patrick Floyd
33
[+2] [2010-07-03 12:27:28] nico

R 449 chars

can probably get shorter...

bar <- function(w, l)
    {
    b <- rep("-", l)
    s <- rep(" ", l)
    cat(" ", b, "\n|", s, "| ", w, "\n ", b, "\n", sep="")
    }

f <- "alice.txt"
e <- c("the", "and", "of", "to", "a", "i", "it", "in", "or", "is", "")
w <- unlist(lapply(readLines(file(f)), strsplit, s=" "))
w <- tolower(w)
w <- unlist(lapply(w, gsub, pa="[^a-z]", r=""))
u <- unique(w[!w %in% e])
n <- unlist(lapply(u, function(x){length(w[w==x])}))
o <- rev(order(n))
n <- n[o]
m <- 77 - max(unlist(lapply(u[1:22], nchar)))
n <- floor(m*n/n[1])
u <- u[o]

for (i in 1:22)
    bar(u[i], n[i])

@Johannes Rössel: It is dynamic, just scaled to 100% = 60px = max length. E.g.: 1st world = 50 occurrences, 2nd world = 25 occurrences. 1st bar = 60 px, 2nd bar = 30 px - nico
@Johannes Rössel: Ok, I didn't read the part that said you should maximise the length, thought it just needed to fit 80 chars... now it works as intended :) Thanks for spotting that - nico
Well, it's the one thing most often done wrong in the answers here, I think. Took me also quite a while to figure out an elegant way of doing so. - Joey
34
[+2] [2010-07-03 13:12:32] Sean Patrick Floyd

Groovy, 424 389 378 321 chars

replaced b=map.get(a) with b=map[a], replaced split with matcher / iterator

def r,s,m=[:],n=0;def p={println it};def w={"_".multiply it};(new URL(this.args[0]).text.toLowerCase()=~/\b\w+\b/).each{s=it;if(!(s==~/(the|and|of|to|a|i[tns]?|or)/))m[s]=m[s]==null?1:m[s]+1};m.keySet().sort{a,b->m[b]<=>m[a]}.subList(0,22).each{k->if(n++<1){r=(m[k]/(76-k.length()));p" "+w(m[k]/r)};p"|"+w(m[k]/r)+"|"+k}

(executed as groovy script with the URL as cmd line arg. No imports required!)

Readable version here:

def r,s,m=[:],n=0;
def p={println it};
def w={"_".multiply it};
(new URL(this.args[0]).text.toLowerCase()
        =~ /\b\w+\b/
        ).each{
        s=it;
        if (!(s ==~/(the|and|of|to|a|i[tns]?|or)/))
            m[s] = m[s] == null ? 1 : m[s] + 1
        };
    m.keySet()
        .sort{
            a,b -> m[b] <=> m[a]
        }
        .subList(0,22).each{
            k ->
                if( n++ < 1 ){
                    r=(m[k]/(76-k.length()));
                    p " " + w(m[k]/r)
                };
                p "|" + w(m[k]/r) + "|" + k
}

35
[+2] [2010-07-03 14:04:26] user382714

Python 2.6, 273 269 267 266 characters.

(Edit: Props to ChristopheD for character-shaving suggestions)

import sys,re
t=re.findall('[a-z]+',"".join(sys.stdin).lower())
d=sorted((t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:-23:-1]
r=min((78.-len(m[1]))/m[0]for m in d)
print'','_'*(int(d[0][0]*r-2))
for(a,b)in d:print"|"+"_"*(int(a*r-2))+"|",b

Output:

 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so

You could drop the square brackets in r=min([(78.0-len(m[1]))/m[0] for m in d]) (shaves off 2 characters: min((78.0-len(m[1]))/m[0] for m in d)). The same goes for the square brackets in line three: sorted([... - ChristopheD
Also in line three and four you can lose an unneeded space just before for (shaves off 2 characters). - ChristopheD
I like the way you abuse this print'', to print the starting space on the first line; clever ;-) - ChristopheD
Just realised I didn't need a following zero to declare a float on the fourth line. Is this the only Python entry that takes into account that some words might be significantly longer than the most common one? - user382714
instead of 78 you can use 76 and saving two "-2"; instead of m[0],m[1] you can use w and r by doing "for w,r in d". you can use \w instead of [a-z]. sys.stdin.read() is shorter. I like the idea of using commas! - 6502
Good points; however \w matches underscores, which is why I didn't use it. - user382714
36
[+2] [2010-07-03 19:40:12] reso

MATLAB 335 404 410 bytes 357 bytes. 390 bytes.

The updated code is now 335 characters instead of 404, and seems to do well for both examples.


Original Message (For code of 404 characters)

This version is a bit longer, however, it will properly scale the length of the bars if there is a word that is ridiculously long so that none of the columns go over 80.

So, my code is 357 bytes without re-scaling, and 410 long with re-scaling.

A=textscan(fopen('11.txt'),'%s','delimiter',' 0123456789,.!?-_*^:;=+\\/(){}[]@&#$%~`|"''');
s=lower(A{1});s(cellfun('length', s)<2)=[];s(ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);N=hist(i,max(i)); [j,k]=sort(N,'descend'); b=k(1:22); n=cellfun('length',w(b));
q=80*N(b)'/N(k(1))+n; q=floor(q*78/max(q)-n); for i=1:22, fprintf('%s| %s\n',repmat('_',1,l(i)),w{k(i)});end

Results:

___________________________________________________________________________| she
_________________________________________________________________| you
______________________________________________________________| said
_______________________________________________________| alice
________________________________________________| was
____________________________________________| that
_____________________________________| as
_________________________________| her
______________________________| at
______________________________| with
____________________________| on
___________________________| all
_________________________| this
________________________| for
________________________| had
________________________| but
_______________________| be
_______________________| not
_____________________| they
____________________| so
___________________| very
___________________| what

For example, replacing all instances of "you" in the Alice in Wonderland text with "superlongstringofridiculousness", my code will correctly scale the results:

____________________________________________________________________| she
_________________________________________________________| superlongstringstring
________________________________________________________| said
_________________________________________________| alice
____________________________________________| was
________________________________________| that
_________________________________| as
______________________________| her
___________________________| with
___________________________| at
_________________________| on
________________________| all
_____________________| this
_____________________| for
_____________________| had
_____________________| but
____________________| be
____________________| not
__________________| they
__________________| so
_________________| very
_________________| what

Here is the updated code written a little bit more legibly:

A=textscan(fopen('t'),'%s','delimiter','':'@');
s=lower(A{1});
s(cellfun('length', s)<2|ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);
N=hist(i,max(i)); 
[j,k]=sort(N,'descend'); 
n=cellfun('length',w(k));
q=80*N(k)'/N(k(1))+n; 
q=floor(q*78/max(q)-n); 
for i=1:22, 
    fprintf('%s| %s\n',repmat('_',1,q(i)),w{k(i)});
end

Kudos for implementing the spec completely! (I would upvote but I've run out of votes for today...) - ChristopheD
(10) shouldn't the bar for "superlongstringofridiculousness" be longer than the bar for "said"? - Bwmat
@Bwmat: ahh!! good eye! back to the drawing board... - reso
(1) you could save 18 chars by replacing the delimiter string with: char([32:64 91:96 123:126]) - Amro
@Amro: hey, thanks for the tip, that is great. One day I will go back and fix the bug that Bwmat spoted and add that to it as well - reso
Fixed typo and shortened code quite a bit, did not look at most of the logic, just tried to go for shorter syntax. - Dennis Jaheruddin
37
[+2] [2010-07-04 15:58:56] mb14

Shell, 228 characters , with 80 chars constraint working

tr A-Z a-z|tr -Cs a-z "\n"|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$" |uniq -c|sort -r|head -22>g
n=1
while :
do
awk '{printf "|%0*s| %s\n",$1*'$n'/1e3,"",$2;}' g|tr 0 _>o 
egrep -q .{80} o&&break
n=$((n+1))
done
cat o

I'm surprised nobody seems to have used the amazing * feature of printf.

cat 11-very.txt > golf.sh

|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|______________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so

cat 11 | golf.sh

|_________________________________________________________________| she
|_________________________________________________________| verylongstringstring
|______________________________________________________| said
|_______________________________________________| alice
|__________________________________________| was
|_______________________________________| that
|________________________________| as
|_____________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|_________________________| t
|________________________| on
|_______________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|__________________| so

Missing the very first line in the output (the top line of the first bar). Also couldn't you just sort ascending and then use the last 22 lines instead? Dunno whether that would make it shorter here but for me it was a serious consideration. - Joey
I know for the first. I Just don't see a simple way to do it and I wasn't sure if that was really mandatory. I could not reverse indeed but then the output would be inversed (she at the last line) - mb14
38
[+2] [2010-07-06 02:47:26] Daniel C. Sobral

Scala, 327 characters

This was adapted from mkneissl's answer [1] inspired by a Python version, though it is bigger. I'm leaving it here in case someone can make it shorter.

val f="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile("11.txt").mkString.toLowerCase).toSeq
val t=f.toSet[String].map(x=> -f.count(x==)->x).toSeq.sorted take 22
def b(p:Int)="_"*(-p/(for((c,w)<-t)yield-c/(76.0-w.size)).max).toInt
println(" "+b(t(0)._1))
for(p<-t)printf("|%s| %s \n",b(p._1),p._2)
[1] https://stackoverflow.com/questions/3169051/code-golf-word-frequency-chart/3182550#3182550

39
[+2] [2010-07-08 18:53:08] Baishampayan Ghose

Clojure - 611 chars (not minimized)

I tried writing the code in as much idiomatic Clojure as I could so late in the night. I am not too proud of the draw-chart function, but I guess the code will speak volumes of the succinctness of Clojure.

(ns word-freq
(:require [clojure.contrib.io :as io]))

(defn word-freq
  [f]
  (take 22 (->> f
                io/read-lines ;;; slurp should work too, but I love map/red
                (mapcat (fn [l] (map #(.toLowerCase %) (re-seq #"\w+" l))))
                (remove #{"the" "and" "of" "to" "a" "i" "it" "in" "or" "is"})
                (reduce #(assoc %1 %2 (inc (%1 %2 0))) {})
                (sort-by (comp - val)))))

(defn draw-chart
  [fs]
  (let [[[w f] & _] fs]
    (apply str
           (interpose \newline
                      (map (fn [[k v]] (apply str (concat "|" (repeat (int (* (- 76 (count w)) (/ v f 1))) "_") "| " k " ")) ) fs)))))

;;; (println (draw-chart (word-freq "/Users/ghoseb/Desktop/alice.txt")))

Output:

|_________________________________________________________________________| she 
|_______________________________________________________________| you 
|____________________________________________________________| said 
|____________________________________________________| alice 
|_______________________________________________| was 
|___________________________________________| that 
|____________________________________| as 
|________________________________| her 
|_____________________________| with 
|_____________________________| at 
|____________________________| t 
|____________________________| s 
|__________________________| on 
|__________________________| all 
|_______________________| for 
|_______________________| had 
|_______________________| this 
|_______________________| but 
|______________________| be 
|_____________________| not 
|____________________| they 
|____________________| so

I know, this doesn't follow the spec, but hey, this is some very clean Clojure code which is already so small :)


40
[+1] [2010-07-03 00:54:35] Sean Patrick Floyd

Java, slowly getting shorter (1500 1358 1241 1020 913 890 chars)

stripped even more white space and var name length. removed generics where possible, removed inline class and try/catch block too bad, my 900 version had a bug

removed another try / catch block

import java.net.*;import java.util.*;import java.util.regex.*;import org.apache.commons.io.*;public class G{public static void main(String[]a)throws Exception{String text=IOUtils.toString(new URL(a[0]).openStream()).toLowerCase().replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b","");final Map<String,Integer>p=new HashMap();Matcher m=Pattern.compile("\\b\\w+\\b").matcher(text);Integer b;while(m.find()){String w=m.group();b=p.get(w);p.put(w,b==null?1:b+1);}List<String>v=new Vector(p.keySet());Collections.sort(v,new Comparator(){public int compare(Object l,Object m){return p.get(m)-p.get(l);}});boolean t=true;float r=0;for(String w:v.subList(0,22)){if(t){t=false;r=p.get(w)/(float)(80-(w.length()+4));System.out.println(" "+new String(new char[(int)(p.get(w)/r)]).replace('\0','_'));}System.out.println("|"+new String(new char[(int)(((Integer)p.get(w))/r)]).replace('\0','_')+"|"+w);}}}

Readable version:

import java.net.*;
import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.*;

public class G{

    public static void main(String[] a) throws Exception{
        String text =
            IOUtils.toString(new URL(a[0]).openStream())
                .toLowerCase()
                .replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b", "");
        final Map<String, Integer> p = new HashMap();
        Matcher m = Pattern.compile("\\b\\w+\\b").matcher(text);
        Integer b;
        while(m.find()){
            String w = m.group();
            b = p.get(w);
            p.put(w, b == null ? 1 : b + 1);
        }
        List<String> v = new Vector(p.keySet());
        Collections.sort(v, new Comparator(){

            public int compare(Object l, Object m){
                return p.get(m) - p.get(l);
            }
        });
        boolean t = true;
        float r = 0;
        for(String w : v.subList(0, 22)){
            if(t){
                t = false;
                r = p.get(w) / (float) (80 - (w.length() + 4));
                System.out.println(" "
                    + new String(new char[(int) (p.get(w) / r)]).replace('\0',
                        '_'));
            }
            System.out.println("|"
                + new String(new char[(int) (((Integer) p.get(w)) / r)]).replace('\0',
                    '_') + "|" + w);
        }
    }
}

(7) I like goofball high-character-count golf submissions. It's good to break up the monotony of line noise with something readable and almost laughably verbose. - John Y
(3) @John: I disagree. Even if you are going to use a verbose language (see my fortran 77 entries in some earlier code golfs for instance) you should code it as tightly as the language allows.Code golf isn't about good practices; indeed it is very nearly the antithesis of good practice. - dmckee --- ex-moderator kitten
(2) @dmckee: I completely understand and accept your viewpoint. Still, I personally like to see just about any submission. Variety is the spice of life, and to me that even includes differing (even opposing) spirit and ideals in code golf. Better to dance, but dance "poorly" (for whatever definition of dance), than to stand in the corner or worse yet, not even show up. - John Y
41
[+1] [2010-07-03 01:01:53] user216441

Javascript, 348 characters

After I finished mine, I stole some ideas from Matt :3

t=prompt().toLowerCase().replace(/\b(the|and|of|to|a|i[tns]?|or)\b/gm,'');r={};o=[];t.replace(/\b([a-z]+)\b/gm,function(a,w){r[w]?++r[w]:r[w]=1});for(i in r){o.push([i,r[i]])}m=o[0][1];o=o.slice(0,22);o.sort(function(F,D){return D[1]-F[1]});for(B in o){F=o[B];L=new Array(~~(F[1]/m*(76-F[0].length))).join('_');print(' '+L+'\n|'+L+'| '+F[0]+' \n')}

Requires print and prompt function support.


This will have some problems with strings like the_foo, right? (Because then \b breaks apart) - Joey
42
[+1] [2010-07-03 07:29:46] Joshua Weinberg

Gotta love the big ones...Objective-C (1070 931 905 chars)

#define S NSString
#define C countForObject
#define O objectAtIndex
#define U stringWithCString
main(int g,char**b){id c=[NSCountedSet set];S*d=[S stringWithContentsOfFile:[S U:b[1]]];id p=[NSPredicate predicateWithFormat:@"SELF MATCHES[cd]'(the|and|of|to|a|i[tns]?|or)|[^a-z]'"];[d enumerateSubstringsInRange:NSMakeRange(0,[d length])options:NSStringEnumerationByWords usingBlock:^(S*s,NSRange x,NSRange y,BOOL*z){if(![p evaluateWithObject:s])[c addObject:[s lowercaseString]];}];id s=[[c allObjects]sortedArrayUsingComparator:^(id a,id b){return(NSComparisonResult)([c C:b]-[c C:a]);}];g=[c C:[s O:0]];int j=76-[[s O:0]length];char*k=malloc(80);memset(k,'_',80);S*l=[S U:k length:80];printf(" %s\n",[[l substringToIndex:j]cString]),[[s subarrayWithRange:NSMakeRange(0,22)]enumerateObjectsUsingBlock:^(id a,NSUInteger x,BOOL*y){printf("|%s| %s\n",[[l substringToIndex:[c C:a]*j/g]cString],[a cString]);}];}

Switched to using a lot of depreciate APIs, removed some memory management that wasn't needed, more aggressive whitespace removal

 _________________________________________________________________________
|_________________________________________________________________________| she
|______________________________________________________________| said
|__________________________________________________________| you
|____________________________________________________| alice
|________________________________________________| was
|_______________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|________________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| so
|___________________| very
|__________________| what
|_________________| they

(1) Note that the spec calls for ignoring 's, so "don't" parses as two words "don" and "s". You'll see in the reference implementation that "s" and "t" are represented in the top 22... - dmckee --- ex-moderator kitten
Kudos for doing it in obj-c (not a language you see often in code golfing)! - ChristopheD
@Christophe: And here we see exactly why we don't see it that often ;) - Joey
Try #define S NSString, #define C countForObject, and use these two appropriately. Also replace calloc(80,1) with simply malloc(80), since you're setting the contents straight afterwards. Also, reuse the a parameter, to save on an int declaration. This should get it less than 1,000 chars... - please delete me
@brone thanks for the idea, took those, and some other extra stuff I saw, well below 1000 now - Joshua Weinberg
Use id instead of NSCountedSet* etc! - kennytm
Holy hell, how did I not think of that....edited to fix, 905 :) - Joshua Weinberg
43
[+1] [2010-07-03 16:38:16] dhruvbird

Python, 320 characters

import sys
i="the and of to a i it in or is".split()
d={}
for j in filter(lambda x:x not in i,sys.stdin.read().lower().split()):d[j]=d.get(j,0)+1
w=sorted(d.items(),key=lambda x:x[1])[:-23:-1]
m=sorted(dict(w).values())[-1]
print" %s\n"%("_"*(76-m)),"\n".join(map(lambda x:("|%s| "+x[0])%("_"*((76-m)*x[1]/w[0][1])),w))

44
[+1] [2010-07-04 02:57:14] Andrei

R, 298 chars

f=scan("stdin","ch")
u=unlist
s=strsplit
a=u(s(u(s(tolower(f),"[^a-z]")),"^(the|and|of|to|it|in|or|is|.|)$"))
v=unique(a)
r=sort(sapply(v,function(i) sum(a==i)),T)[2:23]  #the first item is an empty string, just skipping it
w=names(r)
q=(78-max(nchar(w)))*r/max(r)
cat(" ",rep("_",q[1])," \n",sep="")
for(i in 1:22){cat("|",rep("_",q[i]),"| ",w[i],"\n",sep="")}

The output is:

 _________________________________________________________________________ 
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what

And if "you" is replaced by something longer:

 ____________________________________________________________ 
|____________________________________________________________| she
|____________________________________________________| veryverylongstring
|__________________________________________________| said
|___________________________________________| alice
|______________________________________| was
|___________________________________| that
|_____________________________| as
|__________________________| her
|________________________| at
|________________________| with
|______________________| on
|_____________________| all
|___________________| this
|___________________| for
|___________________| had
|__________________| but
|__________________| be
|__________________| not
|________________| they
|________________| so
|_______________| very
|_______________| what

(1) This is not doing the maximum scaling - 6502
45
[+1] [2010-07-04 07:26:43] 6502

Python 290, 255, 253


290 characters in python (text read from standard input)

import sys,re
c={}
for w in re.findall("[a-z]+",sys.stdin.read().lower()):c[w]=c.get(w,0)+1-(","+w+","in",a,i,the,and,of,to,it,in,or,is,")
r=sorted((-v,k)for k,v in c.items())[:22]
sf=max((76.0-len(k))/v for v,k in r)
print" "+"_"*int(r[0][0]*sf)
for v,k in r:print"|"+"_"*int(v*sf)+"| "+k

but... after reading other solutions I all of a sudden realized that efficiency was not a request; so this is another shorter and much slower one (255 characters)

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print" "+"_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"| "+k

and after some more reading other solutions...

import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print"","_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"|",k

And now this solution is almost byte-per-byte identical to Astatine's one :-D


I worked out a very similar solution. Looking at yours there seems to be ways to merge both, you thought of some tricks I didn't... - kriss
46
[+1] [2010-07-05 01:50:49] user383340

Object Rexx 4.0 with PC-Pipes

Where the PC-Pipes [1] library can be found.
This solution ignores single letter words.


address rxpipe 'pipe (end ?) < Alice.txt',
   '|regex split /[^a-zA-Z]/', -- split at non alphbetic character
   '|locate 2',                -- discard words shorter that 2 char  
   '|xlate lower',             -- translate all words to lower case
   ,                           -- discard list words that match list
   '|regex not match /^(the||and||of||to||it||in||or||is)$/',
   '|l:lookup autoadd before count',  -- accumulate and count words
 '? l:',                       -- no master records to feed into lookup 
 '? l:',                       -- list of counted words comes here
   ,                           -- columns 1-10 hold count, 11-n hold word
   '|sort 1.10 d',             -- sort in desending order by count
   '|take 22',                 -- take first 22 records only
   '|array wordlist',          -- store into a rexx array
   '|count max',               -- get length of longest record 
   '|var maxword'              -- save into a rexx variable

parse value wordlist[1] with count 11 .  -- get frequency of first word
barunit = count % (76-(maxword-10))      -- frequency units per chart bar char

say ' '||copies('_', (count+barunit)%barunit)  -- first line of the chart
do cntwd over wordlist                    
  parse var cntwd count 11 word          -- get word frequency and the word
  say '|'||copies('_', (count+barunit)%barunit)||'| '||word||' '
end
The output produced
 ________________________________________________________________________________
|________________________________________________________________________________| she
|_____________________________________________________________________| you
|___________________________________________________________________| said
|__________________________________________________________| alice
|____________________________________________________| was
|________________________________________________| that
|________________________________________| as
|____________________________________| her
|_________________________________| at
|_________________________________| with
|______________________________| on
|_____________________________| all
|__________________________| this
|__________________________| for
|__________________________| had
|__________________________| but
|________________________| be
|________________________| not
|_______________________| they
|______________________| so
|_____________________| very
|_____________________| what
[1] http://ipages.iland.net/~jimj

How long is the solution (number of characters) - this is a code-golf? - Nas Banov
47
[+1] [2010-07-05 05:06:41] user383392

Ruby, 205


This Ruby version handles "superlongstringstring". (The first two lines are almost identical to the previous Ruby programs.)

It must be run this way:

ruby -n0777 golf.rb Alice.txt


W=($_.upcase.scan(/\w+/)-%w(THE AND OF TO A I IT
IN OR IS)).group_by{|x|x}.map{|k,v|[-v.size,k]}.sort[0,22]
u=proc{|m|"_"*(W.map{|n,s|(76.0-s.size)/n}.max*m)}
puts" "+u[W[0][0]],W.map{|n,s|"|%s| "%u[n]+s}

The third line creates a closure or lambda that yields a correctly scaled string of underscores:

u = proc{|m|
  "_" *
    (W.map{|n,s| (76.0 - s.size)/n}.max * m)
}

.max is used instead of .min because the numbers are negative.


Implementing the full spec and still very short (213 characters at the moment according to wc -c), nice work! - ChristopheD
48
[+1] [2010-07-06 00:29:29] user384075

Bourne shell, 213/240 characters

Improving on the shell version posted earlier, I can get it down to 213 characters:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v '^(the|and|of|to|a|i|it|in|or|is)$'|uniq -c|sort -rn|sed 22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,$2}' g|tr 0 _>o 
((n++))
done
cat o

In order to get the upper outline on the top bar, I had to expand it to 240 characters:

tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$"|uniq -c|sort -r|sed 1p\;22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,NR==1?"":$2}' g|sed '1s,|, ,g'|tr 0 _>o 
((n++))
done
cat o

49
[+1] [2010-07-06 01:14:43] mvds

shell, grep, tr, grep, sort, uniq, sort, head, perl - 194 chars

Adding some -i flags may drop the overly long tr A-Z a-z| step; the spec said nothing about the case displayed, and uniq -ci drops any case differences.

egrep -oi [a-z]+|egrep -wiv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -ci|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'

That's minus 11 for the tr plus 2 for the -i's compared to the original 206 chars.

edit: minus 3 for the \\b which can be left out as pattern matching will commence on a boundary anyway.

sort gives lower case first, and uniq -ci takes the first occurence, so the only real change in output will be that Alice retains her upper case initial.


The bar length constraint isn't working. - Joey
50
[+1] [2010-07-06 01:49:12] user384098

Go, 613 chars, could probably be much smaller:

package main
import(r "regexp";. "bytes";. "io/ioutil";"os";st "strings";s "sort";. "container/vector")
type z struct{c int;w string}
func(e z)Less(o interface{})bool{return o.(z).c<e.c}
func main(){b,_:=ReadAll(os.Stdin);g:=r.MustCompile
c,m,x:=g("[A-Za-z]+").AllMatchesIter(b,0),map[string]int{},g("the|and|of|it|in|or|is|to")
for w:=range c{w=ToLower(w);if len(w)>1&&!x.Match(w){m[string(w)]++}}
o,y:=&Vector{},0
for k,v:=range m{o.Push(z{v,k});if v>y{y=v}}
s.Sort(o)
for i,v:=range *o{if i>21{break};x:=v.(z);c:=int(float(x.c)/float(y)*80)
u:=st.Repeat("_",c);if i<1{println(" "+u)};println("|"+u+"| "+x.w)}}

I feel so dirty.


51
[+1] [2010-07-06 02:46:04] mvds

perl, 188 characters

The perl version above (as well as any regexp splitting based version) can get a few bytes shorter by including the list of forbidden words as negative lookahead assertions, rather than as a separate list. Furthermore the trailing semicolon can be left out.

I also included some other suggestions (- instead of <=>, for/foreach, dropped "keys") to get to

$c{$_}++for grep{$_}map{lc=~/\b(?!(?:the|and|a|of|or|i[nts]?|to)\b)[a-z]+/g}<>;@s=sort{$c{$b}-$c{$a}}%c;$f=76-length$s[0];say$"."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "for@s[0..21]

I don't know perl, but I presume that the (?!(?:...)\b) may lose the ?: if the handling around it is fixed.


This throws a syntax error for me: »String found where operator expected at c.pl line 1, near "say"|"" syntax error at c.pl line 1, near "say"|"" Search pattern not terminated at c.pl line 1.« (Perl 5.10.1). Also the code looks like the bar length constraint isn't working. And it may also well be that strings such as foo_the_bar won't get the stop words removed (because of \b). - Joey
52
[+1] [2010-07-06 11:55:50] ShinTakezou

GNU Smalltalk (386)

I think it can be made a little bit shorter, but still no idea how.

|q s f m|q:=Bag new. f:=FileStream stdin. m:=0.[f atEnd]whileFalse:[s:=f nextLine.(s notNil)ifTrue:[(s tokenize:'\W+')do:[:i|(((i size)>1)&({'the'.'and'.'of'.'to'.'it'.'in'.'or'.'is'}includes:i)not)ifTrue:[q add:(i asLowercase)]. m:=m max:(i size)]]].(q:=q sortedByCount)from:1to:22 do:[:i|'|'display.((i key)*(77-m)//(q first key))timesRepeat:['='display].('| %1'%{i value})displayNl]

53
[+1] [2010-07-09 08:32:19] Kristofer

Lua solution: 478 characters.

t,u={},{}for l in io.lines()do
for w in l:gmatch("%a+")do
w=w:lower()if not(" the and of to a i it in or is "):find(" "..w.." ")then
t[w]=1+(t[w]or 0)end
end
end
for k,v in next,t do
u[#u+1]={k,v}end
table.sort(u,function(a,b)return a[2]>b[2]end)m,n=u[1][2],math.min(#u,22)for w=80,1,-1 do
s=""for i=1,n do
a,b=u[i][1],w*u[i][2]/m
if b+#a>=78 then s=nil break end
s2=("_"):rep(b)if i==1 then
s=s.." " ..s2.."\n"end
s=s.."|"..s2.."| "..a.."\n"end
if s then print(s)break end end

Readable version:

t,u={},{}
for line in io.lines() do
    for w in line:gmatch("%a+") do
        w = w:lower()
        if not (" the and of to a i it in or is "):find(" "..w.." ") then
            t[w] = 1 + (t[w] or 0)
        end
    end
end
for k, v in pairs(t) do
    u[#u+1]={k, v}
end

table.sort(u, function(a, b)
    return a[2] > b[2]
end)

local max = u[1][2]
local n = math.min(#u, 22)

for w = 80, 1, -1 do
    s=""
    for i = 1, n do
        f = u[i][2]
        word = u[i][1]
        width = w * f / max
        if width + #word >= 78 then
            s=nil
            break
        end
        s2=("_"):rep(width)
        if i==1 then
            s=s.." " .. s2 .."\n"
        end
        s=s.."|" .. s2 .. "| " .. word.."\n"
    end
    if s then
        print(s)
        break
    end
end

54
[+1] [2010-07-11 05:34:07] RHSeeger

TCL 554 Strict

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {if {[lsearch {the and of to it in or is a i} $w]>=0} {continue};if {[catch {incr Ws($w)}]} {set Ws($w) 1}}
set T [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get Ws]] 0 43]
foreach {w c} $T {lappend L [string length $w];lappend C $c}
set N [tcl::mathfunc::max {*}$L]
set C [lsort -integer $C]
set M [lindex $C end]
puts " [string repeat _ [expr {int((76-$N) * [lindex $T 1] / $M)}]] "
foreach {w c} $T {puts "|[string repeat _ [expr {int((76-$N) * $c / $M)}]]| $w"}

Or, more legibly

foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {
    if {[lsearch {the and of to a i it in or is} $w] >= 0} { continue }
    if {[catch {incr words($w)}]} {
        set words($w) 1
    }
}
set topwords [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get words]] 0 43]
foreach {word count} $topwords {
    lappend lengths [string length $word]
    lappend counts $count
}
set maxlength [lindex [lsort -integer $lengths] end]
set counts [lsort -integer $counts]
set mincount [lindex $counts 0].0
set maxcount [lindex $counts end].0
puts " [string repeat _ [expr {int((76-$maxlength) * [lindex $topwords 1] / $maxcount)}]] "
foreach {word count} $topwords {
    set barlength [expr {int((76-$maxlength) * $count / $maxcount)}]
    puts "|[string repeat _ $barlength]| $word"
}

55
[+1] [2012-02-13 01:19:40] RichardTheKiwi

Another T-SQL solution borrowing some ideas from Martin's solution [1] (min76- etc).

declare @ varchar(max),@w real,@j int;select s=@ into[ ]set @=(select*
from openrowset(bulk'a',single_blob)a)while @>''begin set @=stuff(@,1,
patindex('%[a-z]%',@)-1,'')+'.'set @j=patindex('%[^a-z]%',@)if @j>2insert[ ]
select lower(left(@,@j-1))set @=stuff(@,1,@j,'')end;select top(22)s,count(*)
c into # from[ ]where',the,and,of,to,it,in,or,is,'not like'%,'+s+',%'
group by s order by 2desc;select @w=min((76.-len(s))/c),@=' '+replicate(
'_',max(c)*@w)from #;select @=@+'
|'+replicate('_',c*@w)+'| '+s+' 'from #;print @

The entire solution should be on two lines (concatenate first 7), although you can cut, paste and run it as-is. Total characters = 507 (counting the line break as 1 if you save it in Unix format and execute using SQLCMD)

Assumptions:

  1. There isn't a temp table #
  2. There isn't a table named [ ]
  3. The input is in the default system folder, e.g. C:\windows\system32\a
  4. Your query window has 'set nocount on' active (prevent spurious "rows affected" msgs)

And to get onto the list of solutions (<500 char), here's the "relaxed" edition at 483 characters (No vertical bars / No top bar / No trailing space after word)

declare @ varchar(max),@w real,@j int;select s=@ into[ ]set @=(select*
from openrowset(bulk'b',single_blob)a)while @>''begin set @=stuff(@,1,
patindex('%[a-z]%',@)-1,'')+'.'set @j=patindex('%[^a-z]%',@)if @j>2insert[ ]
select lower(left(@,@j-1))set @=stuff(@,1,@j,'')end;select top(22)s,count(*)
c into # from[ ]where',the,and,of,to,it,in,or,is,'not like'%,'+s+',%'
group by s order by 2desc;select @w=min((78.-len(s))/c),@=''from #;select @=@+'
'+replicate('_',c*@w)+' '+s from #;print @

Readable version

declare @ varchar(max), @w real, @j int
select s=@ into[ ] -- shortcut to create table; use defined variable to specify column type
-- openrowset reads an entire file
set @=(select * from openrowset(bulk'a',single_blob) a) -- a bit shorter than naming 'BulkColumn'

while @>'' begin -- loop until input is empty
    set @=stuff(@,1,patindex('%[a-z]%',@)-1,'')+'.' -- remove lead up to first A-Z char *
    set @j=patindex('%[^a-z]%',@) -- find first non A-Z char. The +'.' above makes sure there is one
    if @j>2insert[ ] select lower(left(@,@j-1)) -- insert only words >1 char
    set @=stuff(@,1,@j,'') -- remove word and trailing non A-Z char
end;

select top(22)s,count(*)c
into #
from[ ]
where ',the,and,of,to,it,in,or,is,' not like '%,'+s+',%' -- exclude list
group by s
order by 2desc; -- highest occurence, assume no ties at 22!

-- 80 - 2 vertical bars - 2 spaces = 76
-- @w = weighted frequency
-- this produces a line equal to the length of the max occurence (max(c))
select @w=min((76.-len(s))/c),@=' '+replicate('_',max(c)*@w)
from #;

-- for each word, append it as a new line. note: embedded newline
select @=@+'
|'+replicate('_',c*@w)+'| '+s+' 'from #;
-- note: 22 words in a table should always fit on an 8k page
--       the order of processing should always be the same as the insert-orderby
--       thereby producing the correct output

print @ -- output
[1] https://stackoverflow.com/a/3173246/573261

56
[0] [2010-07-03 23:04:36] Will

Python, 250 chars

borrowing from all the other Python snippets

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

If you're cheeky and put the words to avoid as arguments, 223 chars

import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set(sys.argv[1:]))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w

Output is:

$ python alice4.py  the and of to a i it in or is < 11.txt 
 _________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so

(1) This doesn't handle the problem of having the scale determined by a word that is not the most frequent one. - 6502
57
[0] [2011-11-30 11:49:04] Armand

Groovy, 250

Code:

m=[:]
(new URL(args[0]).text.toLowerCase()=~/\w+/).each{it==~/(the|and|of|to|a|i[tns]?|or)/?:(m[it]=1+(m[it]?:0))}
k=m.keySet().sort{a,b->m[b]<=>m[a]}
b={d,c,b->println d+'_'*c+d+' '+b}
b' ',z=77-k[0].size(),''
k[0..21].each{b'|',m[it]*z/m[k[0]],it}

Execution:

$ groovy wordcount.groovy http://www.gutenberg.org/files/11/11.txt

Output:

 __________________________________________________________________________  
|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so

N.B. this follows relaxed rules re: long strings


58
[0] [2012-04-07 23:23:26] tmartin

Q,194

{t::y;{(-1')t#(.:)[b],'(!:)[b:"|",/:(((_)70*x%(*:)x)#\:"_"),\:"|"];}desc(#:')(=)($)(`$inter\:[(,/)" "vs'" "sv/:"'"vs'a(&)0<(#:')a:(_:')read0 -1!x;52#.Q.an])except`the`and`of`to`a`i`it`in`or`is`}

the function takes two arguments: one a file containing the text and the other is the number of lines of the chart to display

q){t::y;{(-1')t#(.:)[b],'(!:)[b:"|",/:(((_)70*x%(*:)x)#\:"_"),\:"|"];}desc(#:')(=)($)(`$inter\:[(,/)" "vs'" "sv/:"'"vs'a(&)0<(#:')a:(_:')read0 -1!x;52#.Q.an])except`the`and`of`to`a`i`it`in`or`is`}[`a.txt;20]

output

|______________________________________________________________________|she
|____________________________________________________________|you
|__________________________________________________________|said
|___________________________________________________|alice
|_____________________________________________|was
|_________________________________________|that
|__________________________________|as
|_______________________________|her
|_____________________________|with
|____________________________|at
|___________________________|t
|___________________________|s
|_________________________|on
|_________________________|all
|_______________________|this
|______________________|for
|______________________|had
|_____________________|but
|_____________________|be
|_____________________|not

59