Build an ASCII chart of the most commonly used words in a given text.
The rules:
a-z
and A-Z
(alphabetic characters) as part of a word. She
== she
for our purpose). the, and, of, to, a, i, it, in, or, is
Clarification: considering don't
: this would be taken as 2 different 'words' in the ranges a-z
and A-Z
: (don
and t
).
Optionally (it's too late to be formally changing the specifications now) you may choose to drop all single-letter 'words' (this could potentially make for a shortening of the ignore list too).
Parse a given text
(read a file specified via command line arguments or piped in; presume us-ascii
) and build us a word frequency chart
with the following characteristics:
width
represents the number of occurences (frequency) of the word (proportionally). Append one space and print the word.bar
+ [space]
+ word
+ [space]
should be always <= 80
characters (make sure you account for possible differing bar and word lengths: e.g.: the second most common word could be a lot longer then the first while not differing so much in frequency). Maximize bar width within these constraints and scale the bars appropriately (according to the frequencies they represent).An example:
The text for the example can be found here [1] (Alice's Adventures in Wonderland, by Lewis Carroll).
This specific text would yield the following chart:
_________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |____________________________________________________| alice |______________________________________________| was |__________________________________________| that |___________________________________| as |_______________________________| her |____________________________| with |____________________________| at |___________________________| s |___________________________| t |_________________________| on |_________________________| all |______________________| this |______________________| for |______________________| had |_____________________| but |____________________| be |____________________| not |___________________| they |__________________| so
For your information: these are the frequencies the above chart is built upon:
[('she', 553), ('you', 481), ('said', 462), ('alice', 403), ('was', 358), ('that ', 330), ('as', 274), ('her', 248), ('with', 227), ('at', 227), ('s', 219), ('t' , 218), ('on', 204), ('all', 200), ('this', 181), ('for', 179), ('had', 178), (' but', 175), ('be', 167), ('not', 166), ('they', 155), ('so', 152)]
A second example (to check if you implemented the complete spec):
Replace every occurence of you
in the linked Alice in Wonderland file with superlongstringstring
:
________________________________________________________________ |________________________________________________________________| she |_______________________________________________________| superlongstringstring |_____________________________________________________| said |______________________________________________| alice |________________________________________| was |_____________________________________| that |______________________________| as |___________________________| her |_________________________| with |_________________________| at |________________________| s |________________________| t |______________________| on |_____________________| all |___________________| this |___________________| for |___________________| had |__________________| but |_________________| be |_________________| not |________________| they |________________| so
The winner:
Shortest solution (by character count, per language). Have fun!
Edit: Table summarizing the results so far (2012-02-15) (originally added by user Nas Banov):
Language Relaxed Strict ========= ======= ====== GolfScript 130 143 Perl 185 Windows PowerShell 148 199 Mathematica 199 Ruby 185 205 Unix Toolchain 194 228 Python 183 243 Clojure 282 Scala 311 Haskell 333 Awk 336 R 298 Javascript 304 354 Groovy 321 Matlab 404 C# 422 Smalltalk 386 PHP 450 F# 452 TSQL 483 507
The numbers represent the length of the shortest solution in a specific language. "Strict" refers to a solution that implements the spec completely (draws |____|
bars, closes the first bar on top with a ____
line, accounts for the possibility of long words with high frequency etc). "Relaxed" means some liberties were taken to shorten to solution.
Only solutions shorter then 500 characters are included. The list of languages is sorted by the length of the 'strict' solution. 'Unix Toolchain' is used to signify various solutions that use traditional *nix shell plus a mix of tools (like grep, tr, sort, uniq, head, perl, awk).
Teaching the elephant to tap-dance is never pretty. I'll, ah, skip the character count.
The program flows from left to right:
(heavily based on the other Ruby solutions)
w=($<.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort[0,22]
k,l=w[0]
puts [?\s+?_*m=76-l.size,w.map{|f,x|?|+?_*(f*m/k)+"| "+x}]
Instead of using any command line switches like the other solutions, you can simply pass the filename as argument. (i.e. ruby1.9 wordfrequency.rb Alice.txt
)
Since I'm using character-literals here, this solution only works in Ruby 1.9.
Edit: Replaced semicolons by line breaks for "readability". :P
Edit 2: Shtééf pointed out I forgot the trailing space - fixed that.
Edit 3: Removed the trailing space again ;)
Slow - 3 minutes for the sample text (130)
{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*' '\@{"
|"\~1*2/0*'| '@}/
Explanation:
{ #loop through all characters
32|. #convert to uppercase and duplicate
123%97< #determine if is a letter
n@if #return either the letter or a newline
}% #return an array (of ints)
]''* #convert array to a string with magic
n% #split on newline, removing blanks (stack is an array of words now)
"oftoitinorisa" #push this string
2/ #split into groups of two, i.e. ["of" "to" "it" "in" "or" "is" "a"]
- #remove any occurrences from the text
"theandi"3/-#remove "the", "and", and "i"
$ #sort the array of words
(1@ #takes the first word in the array, pushes a 1, reorders stack
#the 1 is the current number of occurrences of the first word
{ #loop through the array
.3$>1{;)}if#increment the count or push the next word and a 1
}/
]2/ #gather stack into an array and split into groups of 2
{~~\;}$ #sort by the latter element - the count of occurrences of each word
22< #take the first 22 elements
.0=~:2; #store the highest count
,76\-:1 #store the length of the first line
'_':0*' '\@ #make the first line
{ #loop through each word
"
|"\~ #start drawing the bar
1*2/0 #divide by zero
*'| '@ #finish drawing the bar
}/
"Correct" (hopefully). (143)
{32|.123%97<n@if}%]''*n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<..0=1=:^;{~76@,-^*\/}%$0=:1'_':0*' '\@{"
|"\~1*^/0*'| '@}/
Less slow - half a minute. (162)
'"'/' ':S*n/S*'"#{%q
'\+"
.downcase.tr('^a-z','
')}\""+~n%"oftoitinorisa"2/-"theandi"3/-$(1@{.3$>1{;)}if}/]2/{~~\;}$22<.0=~:2;,76\-:1'_':0*S\@{"
|"\~1*2/0*'| '@}/
Output visible in revision logs.
~ % wc -c wfg
209 wfg
~ % cat wfg
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|of|to|a|i|it|in|or|is'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
~ % # usage:
~ % sh wfg < 11.txt
hm, just seen above: sort -nr
-> sort -n
and then head
-> tail
=> 208 :)
update2: erm, of course the above is silly, as it will be reversed then. So, 209.
update3: optimized the exclusion regexp -> 206
egrep -oi \\b[a-z]+|tr A-Z a-z|egrep -wv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -c|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
for fun, here's a perl-only version (much faster):
~ % wc -c pgolf
204 pgolf
~ % cat pgolf
perl -lne'$1=~/^(the|and|o[fr]|to|.|i[tns])$/i||$f{lc$1}++while/\b([a-z]+)/gi}{@w=(sort{$f{$b}<=>$f{$a}}keys%f)[0..21];$Q=$f{$_=$w[0]};$B=76-y///c;print" "."_"x$B;print"|"."_"x($B*$f{$_}/$Q)."| $_"for@w'
~ % # usage:
~ % sh pgolf < 11.txt
Thanks to Gabe for some useful suggestions to reduce the character count.
NB: Line breaks added to avoid scrollbars only the last line break is required.
DECLARE @ VARCHAR(MAX),@F REAL SELECT @=BulkColumn FROM OPENROWSET(BULK'A',
SINGLE_BLOB)x;WITH N AS(SELECT 1 i,LEFT(@,1)L UNION ALL SELECT i+1,SUBSTRING
(@,i+1,1)FROM N WHERE i<LEN(@))SELECT i,L,i-RANK()OVER(ORDER BY i)R INTO #D
FROM N WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)SELECT TOP 22 W,-COUNT(*)C
INTO # FROM(SELECT DISTINCT R,(SELECT''+L FROM #D WHERE R=b.R FOR XML PATH
(''))W FROM #D b)t WHERE LEN(W)>1 AND W NOT IN('the','and','of','to','it',
'in','or','is')GROUP BY W ORDER BY C SELECT @F=MIN(($76-LEN(W))/-C),@=' '+
REPLICATE('_',-MIN(C)*@F)+' 'FROM # SELECT @=@+'
|'+REPLICATE('_',-C*@F)+'| '+W FROM # ORDER BY C PRINT @
Readable Version
DECLARE @ VARCHAR(MAX),
@F REAL
SELECT @=BulkColumn
FROM OPENROWSET(BULK'A',SINGLE_BLOB)x; /* Loads text file from path
C:\WINDOWS\system32\A */
/*Recursive common table expression to
generate a table of numbers from 1 to string length
(and associated characters)*/
WITH N AS
(SELECT 1 i,
LEFT(@,1)L
UNION ALL
SELECT i+1,
SUBSTRING(@,i+1,1)
FROM N
WHERE i<LEN(@)
)
SELECT i,
L,
i-RANK()OVER(ORDER BY i)R
/*Will group characters
from the same word together*/
INTO #D
FROM N
WHERE L LIKE'[A-Z]'OPTION(MAXRECURSION 0)
/*Assuming case insensitive accent sensitive collation*/
SELECT TOP 22 W,
-COUNT(*)C
INTO #
FROM (SELECT DISTINCT R,
(SELECT ''+L
FROM #D
WHERE R=b.R FOR XML PATH('')
)W
/*Reconstitute the word from the characters*/
FROM #D b
)
T
WHERE LEN(W)>1
AND W NOT IN('the',
'and',
'of' ,
'to' ,
'it' ,
'in' ,
'or' ,
'is')
GROUP BY W
ORDER BY C
/*Just noticed this looks risky as it relies on the order of evaluation of the
variables. I'm not sure that's guaranteed but it works on my machine :-) */
SELECT @F=MIN(($76-LEN(W))/-C),
@ =' ' +REPLICATE('_',-MIN(C)*@F)+' '
FROM #
SELECT @=@+'
|'+REPLICATE('_',-C*@F)+'| '+W
FROM #
ORDER BY C
PRINT @
Output
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| You
|____________________________________________________________| said
|_____________________________________________________| Alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| This
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| So
|___________________| very
|__________________| what
And with the long string
_______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| Alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| at
|_________________________| with
|_______________________| on
|______________________| all
|____________________| This
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| So
|________________| very
|________________| what
0.000
to just 0
, then using -C
instead of 1.0/C
. And making FLOAT
into REAL
will save a stroke too. The biggest thing, though, is that it looks like you have lots of AS
instances that should be optional. - Gabe
SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM # ORDER BY 1
- Gabe
$
thing thanks. The problem is though that it returns an additional; column to the output that isn't part of the spec (Hence the need for the additional temp table step) - Martin Smith
SELECT [ ] FROM (SELECT $0 O, ' '+REPLICATE('_', MAX(C)*@F)+' ' [ ] FROM # UNION SELECT $1/C, '|'+REPLICATE('_',C*@F)+'| '+W FROM #)X ORDER BY O
? - Gabe
@F
where it's used. You can declare it up with @
and save a whole DECLARE
worth of chars. - Gabe
i-ROW_NUMBER
the same as RANK
? Can the second CTE be moved into the FROM
clause where it's used? Can the #D table query be made into a CTE? Can the #t table query be made into a CTE, or at least put into the FROM
clause of the SELECT TOP 22
query? - Gabe
An improvement on Anurag, incorporating suggestion from rfusca. Also removes argument to sort and a few other minor golfings.
w=(STDIN.read.downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).group_by{|x|x}.map{|x,y|[-y.size,x]}.sort.take 22;k,l=w[0];m=76.0-l.size;puts' '+'_'*m;w.map{|f,x|puts"|#{'_'*(m*f/k)}| #{x} "}
Execute as:
ruby GolfedWordFrequencies.rb < Alice.txt
Edit: put 'puts' back in, needs to be there to avoid having quotes in output.
Edit2: Changed File->IO
Edit3: removed /i
Edit4: Removed parentheses around (f*1.0), recounted
Edit5: Use string addition for the first line; expand s
in-place.
Edit6: Made m float, removed 1.0. EDIT: Doesn't work, changes lengths. EDIT: No worse than before
Edit7: Use STDIN.read
.
Look Mamma ... no vars, no hands, .. no head
Edit 1> some shorthands defined (284 chars)
f[x_, y_] := Flatten[Take[x, All, y]];
BarChart[f[{##}, -1],
BarOrigin -> Left,
ChartLabels -> Placed[f[{##}, 1], After],
Axes -> None
]
& @@
Take[
SortBy[
Tally[
Select[
StringSplit[ToLowerCase[Import[i]], RegularExpression["\\W+"]],
!MemberQ[{"the", "and", "of", "to", "a", "i", "it", "in", "or","is"}, #]&]
],
Last],
-22]
Some explanations
Import[]
# Get The File
ToLowerCase []
# To Lower Case :)
StringSplit[ STRING , RegularExpression["\\W+"]]
# Split By Words, getting a LIST
Select[ LIST, !MemberQ[{LIST_TO_AVOID}, #]&]
# Select from LIST except those words in LIST_TO_AVOID
# Note that !MemberQ[{LIST_TO_AVOID}, #]& is a FUNCTION for the test
Tally[LIST]
# Get the LIST {word,word,..}
and produce another {{word,counter},{word,counter}...}
SortBy[ LIST ,Last]
# Get the list produced bt tally and sort by counters
Note that counters are the LAST element of {word,counter}
Take[ LIST ,-22]
# Once sorted, get the biggest 22 counters
BarChart[f[{##}, -1], ChartLabels -> Placed[f[{##}, 1], After]] &@@ LIST
# Get the list produced by Take as input and produce a bar chart
f[x_, y_] := Flatten[Take[x, All, y]]
# Auxiliary to get the list of the first or second element of lists of lists x_
dependending upon y
# So f[{##}, -1] is the list of counters
# and f[{##}, 1] is the list of words (labels for the chart)
Output
alt text http://i49.tinypic.com/2n8mrer.jpg [1]
Mathematica is not well suited for golfing, and that is just because of the long, descriptive function names. Functions like "RegularExpression[]" or "StringSplit[]" just make me sob :(.
The Zipf's law [2] predicts that for a natural language text, the Log (Rank) vs Log (occurrences) Plot follows a linear relationship.
The law is used in developing algorithms for criptography and data compression. (But it's NOT the "Z" in the LZW algorithm).
In our text, we can test it with the following
f[x_, y_] := Flatten[Take[x, All, y]];
ListLogLogPlot[
Reverse[f[{##}, -1]],
AxesLabel -> {"Log (Rank)", "Log Counter"},
PlotLabel -> "Testing Zipf's Law"]
& @@
Take[
SortBy[
Tally[
StringSplit[ToLowerCase[b], RegularExpression["\\W+"]]
],
Last],
-1000]
The result is (pretty well linear)
alt text http://i46.tinypic.com/33fcmdk.jpg [3]
Refactoring the Regex (no Select function anymore)
Dropping 1 char words
More efficient definition for function "f"
f = Flatten[Take[#1, All, #2]]&;
BarChart[
f[{##}, -1],
BarOrigin -> Left,
ChartLabels -> Placed[f[{##}, 1], After],
Axes -> None]
& @@
Take[
SortBy[
Tally[
StringSplit[ToLowerCase[Import[i]],
RegularExpression["(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"]]
],
Last],
-22]
BarChart[#2, BarOrigin->Left, ChartLabels->Placed[#1, After], Axes->None]&@@
Transpose@Take[SortBy[Tally@StringSplit[ToLowerCase@Import@i,
RegularExpression@"(\\W|\\b(.|the|and|of|to|i[tns]|or)\\b)+"],Last], -22]
f
with Transpose
and Slot
(#1
/#2
) arguments. f@x
instead of f[x]
where possible)|i|
is redundant in your regex because you already have .|
. - Gabe
the_foo
according to spec, right? - Joey
\w
as something like [a-zA-Z0-9_]
or maybe [\p{L}\p{Nd}_]
for Unicode-aware engines. And since \b
is considered a boundary between \w
and \W
this doesn't work according to the spec here. But many solutions have that problem and it took me quite a few characters to get that part right in my solution. As for the why I think it fits with what many programming languages allow as identifiers. You can simply match them with \w+
(doesn't quite work, but close enough for most hackish solutions). - Joey
Not that short, but now probably correct! Note, the previous version did not show the first line of the bars, did not scale the bars correctly, downloaded the file instead of getting it from stdin, and did not include all the required C# verbosity. You could easily shave many strokes if C# didn't need so much extra crap. Maybe Powershell could do better.
using C=System.Console; // alias for Console
using System.Linq; // for Split, GroupBy, Select, OrderBy, etc.
class Class // must define a class
{
static void Main() // must define a Main
{
// split into words
var allwords = System.Text.RegularExpressions.Regex.Split(
// convert stdin to lowercase
C.In.ReadToEnd().ToLower(),
// eliminate stopwords and non-letters
@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+")
.GroupBy(x => x) // group by words
.OrderBy(x => -x.Count()) // sort descending by count
.Take(22); // take first 22 words
// compute length of longest bar + word
var lendivisor = allwords.Max(y => y.Count() / (76.0 - y.Key.Length));
// prepare text to print
var toPrint = allwords.Select(x=>
new {
// remember bar pseudographics (will be used in two places)
Bar = new string('_',(int)(x.Count()/lendivisor)),
Word=x.Key
})
.ToList(); // convert to list so we can index into it
// print top of first bar
C.WriteLine(" " + toPrint[0].Bar);
toPrint.ForEach(x => // for each word, print its bar and the word
C.WriteLine("|" + x.Bar + "| " + x.Word));
}
}
422 chars with lendivisor inlined (which makes it 22 times slower) in the below form (newlines used for select spaces):
using System.Linq;using C=System.Console;class M{static void Main(){var
a=System.Text.RegularExpressions.Regex.Split(C.In.ReadToEnd().ToLower(),@"(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|\W)+").GroupBy(x=>x).OrderBy(x=>-x.Count()).Take(22);var
b=a.Select(x=>new{p=new string('_',(int)(x.Count()/a.Max(y=>y.Count()/(76d-y.Key.Length)))),t=x.Key}).ToList();C.WriteLine(" "+b[0].p);b.ForEach(x=>C.WriteLine("|"+x.p+"| "+x.t));}}
using C=System.Console;
and then in your code C.WriteLine(..)
, or a different char since you already have C as a class name. - John K
(Updated again to beat the Ruby version with more dirty golf tricks, replacing split/[^a-z/,lc
with lc=~/[a-z]+/g
, and eliminating a check for empty string in another place. These were inspired by the Ruby version, so credit where credit is due.)
Update: now with Perl 5.10! Replace print
with say
, and use ~~
to avoid a map
. This has to be invoked on the command line as perl -E '<one-liner>' alice.txt
. Since the entire script is on one line, writing it as a one-liner shouldn't present any difficulty :).
@s=qw/the and of to a i it in or is/;$c{$_}++foreach grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>;@s=sort{$c{$b}<=>$c{$a}}keys%c;$f=76-length$s[0];say" "."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "foreach@s[0..21];
Note that this version normalizes for case. This doesn't shorten the solution any, since removing ,lc
(for lower-casing) requires you to add A-Z
to the split regex, so it's a wash.
If you're on a system where a newline is one character and not two, you can shorten this by another two chars by using a literal newline in place of \n
. However, I haven't written the above sample that way, since it's "clearer" (ha!) that way.
Here is a mostly correct, but not remotely short enough, perl solution:
use strict;
use warnings;
my %short = map { $_ => 1 } qw/the and of to a i it in or is/;
my %count = ();
$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-zA-Z]/ } (<>);
my @sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
my $widest = 76 - (length $sorted[0]);
print " " . ("_" x $widest) . "\n";
foreach (@sorted)
{
my $width = int(($count{$_} / $count{$sorted[0]}) * $widest);
print "|" . ("_" x $width) . "| $_ \n";
}
The following is about as short as it can get while remaining relatively readable. (392 chars).
%short = map { $_ => 1 } qw/the and of to a i it in or is/;
%count;
$count{$_}++ foreach grep { $_ && !$short{$_} } map { split /[^a-z]/, lc } (<>);
@sorted = (sort { $count{$b} <=> $count{$a} } keys %count)[0..21];
$widest = 76 - (length $sorted[0]);
print " " . "_" x $widest . "\n";
print"|" . "_" x int(($count{$_} / $count{$sorted[0]}) * $widest) . "| $_ \n" foreach @sorted;
foreach
s can be written as for
s. That's 8 chars down. Then you have the grep{!($_~~@s)}map{lc=~/[a-z]+/g}<>
, which I believe could be written as grep{!(/$_/i~~@s)}<>=~/[a-z]+/g
to go 4 more down. Replace the " "
with $"
and you're down 1 more... - Zaid
sort{$c{$b}-$c{$a}}...
to save two more. You can also just pass %c
instead of keys %c
to the sort
function and save four more. - mob
$x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *
filter f($w){' '+'_'*$w
$x[-1..-22]|%{"|$('_'*($w*$_.Count/$x[-1].Count))| "+$_.Name}}
f(76..1|?{!((f $_)-match'.'*80)})[0]
(The last line break isn't necessary, but included here for readability.)
(Current code and my test files available in my SVN repository [1]. I hope my test cases catch most common errors (bar length, problems with regex matching and a few others))
Assumptions:
History [2]
Relaxed version (137), since that's counted separately by now, apparently:
($x=$input-split'\P{L}'-notmatch'^(the|and|of|to|.?|i[tns]|or)$'|group|sort *)[-1..-22]|%{"|$('_'*(76*$_.Count/$x[-1].Count))| "+$_.Name}
Variations of the bar lengths of one character compared to other solutions is due to PowerShell using rounding instead of truncation when converting floating-point numbers into integers. Since the task required only proportional bar length this should be fine, though.
Compared to other solutions I took a slightly different approach in determining the longest bar length by simply trying out and taking the highest such length where no line is longer than 80 characters.
An older version explained can be found here [3].
[1] http://svn.lando.us/joey/Public/SO/3169051-split("\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]")
? It works for me. - Gabe
"|$('_'*($w*$_.count/$x[0].count))| $($_.name) "
(or eliminate the last space, as it's sort of automatic). And you can use -split("(?:\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z])+")
to save a few more by not including blanks (or use [-2..-23]
). - Gabe
.{80}
. And you can guarantee that blanks will always be first like this: "\b(?:the|and|of|to|a|i[tns]?|or)\b|[^a-z]()"
(the empty capturing group ensures a blank for every word) - Gabe
"$input"
do you still need the ()
? Also, now that you eliminated the trailing space, you can save a couple strokes by not interpolating the name: "|$('_'*($w*$_.count/$x[0].count))| "+$_.name
. We'll get to 200 yet! - Gabe
function
vs. filter
. I thought about using filter
, I just didn't think of the fact that filters can take arguments too. For me it was a comparison between function f($w){...}
and filter{$w=$_;...}
(since I definitely need a loop in the function and therefore can't leave the argument as $_
. Nice trick to remember, thanks :-). Still, I think this approach has been golfed almost to death by now. [And I notice we killed it somewhere in between ... my other two test cases don't run anymore – debugging ...] - Joey
\b
problem was clearly against the spec and only happened to work for the test input. - Joey
update 1: Hurray! It's a tie with JS Bangs [1]' solution [2]. Can't think of a way to cut down any more :)
update 2: Played a dirty golf trick. Changed each
to map
to save 1 character :)
update 3: Changed File.read
to IO.read
+2. Array.group_by
wasn't very fruitful, changed to reduce
+6. Case insensitive check is not needed after lower casing with downcase
in regex +1. Sorting in descending order is easily done by negating the value +6. Total savings +15
update 4: [0]
rather than .first
, +3. (@Shtééf)
update 5: Expand variable l
in-place, +1. Expand variable s
in-place, +2. (@Shtééf)
update 6: Use string addition rather than interpolation for the first line, +2. (@Shtééf)
w=(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take 22;m=76-w[0][0].size;puts' '+'_'*m;w.map{|x,f|puts"|#{'_'*(f*1.0/w[0][1]*m)}| #{x} "}
update 7: I went through a whole lot of hoopla to detect the first iteration inside the loop, using instance variables. All I got is +1, though perhaps there is potential. Preserving the previous version, because I believe this one is black magic. (@Shtééf)
(IO.read($_).downcase.scan(/[a-z]+/)-%w{the and of to a i it in or is}).reduce(Hash.new 0){|m,o|m[o]+=1;m}.sort_by{|k,v|-v}.take(22).map{|x,f|@f||(@f=f;puts' '+'_'*(@m=76-x.size));puts"|#{'_'*(f*1.0/@f*@m)}| #{x} "}
Readable version
string = File.read($_).downcase
words = string.scan(/[a-z]+/i)
allowed_words = words - %w{the and of to a i it in or is}
sorted_words = allowed_words.group_by{ |x| x }.map{ |x,y| [x, y.size] }.sort{ |a,b| b[1] <=> a[1] }.take(22)
highest_frequency = sorted_words.first
highest_frequency_count = highest_frequency[1]
highest_frequency_word = highest_frequency[0]
word_length = highest_frequency_word.size
widest = 76 - word_length
puts " #{'_' * widest}"
sorted_words.each do |word, freq|
width = (freq * 1.0 / highest_frequency_count) * widest
puts "|#{'_' * width}| #{word} "
end
To use:
echo "Alice.txt" | ruby -ln GolfedWordFrequencies.rb
Output:
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
[1] https://stackoverflow.com/users/8078/js-bangsscan
, though, gave me a better idea, so I got ahead again :). - JSBձոգչ
p
puts quotes around the output so it wouldn't match OP's. - Anurag
import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)if w not in'andithetoforinis')[:22]
for l,w in r:print(78-len(r[0][1]))*l/r[0][0]*'=',w
Allowing for freedom in the implementation, I constructed a string concatenation that contains all the words requested for exclusion (the, and, of, to, a, i, it, in, or, is
) - plus it also excludes the two infamous "words" s
and t
from the example - and I threw in for free the exclusion for an, for, he
. I tried all concatenations of those words against corpus of the words from Alice, King James' Bible and the Jargon file to see if there are any words that will be mis-excluded by the string. And that is how I ended with two exclusion strings:itheandtoforinis
and andithetoforinis
.
PS. borrowed from other solutions to shorten the code.
=========================================================================== she
================================================================= you
============================================================== said
====================================================== alice
================================================ was
============================================ that
===================================== as
================================= her
============================== at
============================== with
=========================== on
=========================== all
======================== this
======================== had
======================= but
====================== be
====================== not
===================== they
==================== so
=================== very
=================== what
================= little
Regarding words to ignore, one would think those would be taken from list of the most used words in English. That list depends on the
text corpus
[1] used. Per one of the most popular lists (http://en.wikipedia.org/wiki/Most_common_words_in_English, http://www.english-for-students.com/Frequently-Used-Words.html, http://www.sporcle.com/games/common_english_words.php), top 10 words are: the be(am/are/is/was/were) to of and a in that have I
The top 10 words from the Alice in Wonderland text are the and to a of it she i you said
The top 10 words from the Jargon File (v4.4.7) are the a of to and in is that or for
So question is why or
was included in the problem's ignore list, where it's ~30th in popularity when the word that
(8th most used) is not. etc, etc. Hence I believe the ignore list should be provided dynamically (or could be omitted).
Alternative idea would be simply to skip the top 10 words from the result - which actually would shorten the solution (elementary - have to show only the 11th to 32nd entries).
The chart drawn in the above code is simplified (using only one character for the bars). If one wants to reproduce exactly the chart from the problem description (which was not required), this code will do it:
import sys,re
t=re.split('\W+',sys.stdin.read().lower())
r=sorted((-t.count(w),w)for w in set(t)-set(sys.argv))[:22]
h=min(9*l/(77-len(w))for l,w in r)
print'',9*r[0][0]/h*'_'
for l,w in r:print'|'+9*l/h*'_'+'|',w
I take an issue with the somewhat random choice of the 10 words to exclude the, and, of, to, a, i, it, in, or, is
so those are to be passed as command line parameters, like so:
python WordFrequencyChart.py the and of to a i it in or is <"Alice's Adventures in Wonderland.txt"
This is 213 chars + 30 if we account for the "original" ignore list passed on command line = 243
PS. The second code also does "adjustment" for the lengths of all top words, so none of them will overflow in degenerate case.
_______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|_____________________________________________________| said
|______________________________________________| alice
|_________________________________________| was
|______________________________________| that
|_______________________________| as
|____________________________| her
|__________________________| at
|__________________________| with
|_________________________| s
|_________________________| t
|_______________________| on
|_______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|___________________| not
|_________________| they
|_________________| so
[1] http://en.wikipedia.org/wiki/Text_corpus\W
or use \b
in a regex because those are very likely not according to spec, meaning they won't split on digits or _
and they might also not remove stop words from strings such as the_foo_or123bar
. They may not appear in the test text but the specification is pretty clear on that case. - Joey
sys.argv
hack and using: re.findall(r'\b(?!(?:the|and|.|of|to|i[tns]|or)\b)\w+',sys.stdin.read().lower())
- intgr
(One line break in main
added for readability, and no line break needed at end of last line.)
import Data.List
import Data.Char
l=length
t=filter
m=map
f c|isAlpha c=toLower c|0<1=' '
h w=(-l w,head w)
x!(q,w)='|':replicate(minimum$m(q?)x)'_'++"| "++w
q?(g,w)=q*(77-l w)`div`g
b x=m(x!)x
a(l:r)=(' ':t(=='_')l):l:r
main=interact$unlines.a.b.take 22.sort.m h.group.sort
.t(`notElem`words"the and of to a i it in or is").words.m f
How it works is best seen by reading the argument to interact
backwards:
map f
lowercases alphabetics, replaces everything else with spaces.words
produces a list of words, dropping the separating whitespace.filter (
notElemwords "the and of to a i it in or is")
discards all entries with forbidden words.group . sort
sorts the words, and groups identical ones into lists.map h
maps each list of identical words to a tuple of the form (-frequency, word)
.take 22 . sort
sorts the tuples by descending frequency (the first tuple entry), and keeps only the first 22 tuples.b
maps tuples to bars (see below).a
prepends the first line of underscores, to complete the topmost bar.unlines
joins all these lines together with newlines.The tricky bit is getting the bar length right. I assumed that only underscores counted towards the length of the bar, so ||
would be a bar of zero length. The function b
maps c x
over x
, where x
is the list of histograms. The entire list is passed to c
, so that each invocation of c
can compute the scale factor for itself by calling u
. In this way, I avoid using floating-point math or rationals, whose conversion functions and imports would eat many characters.
Note the trick of using -frequency
. This removes the need to reverse
the sort
since sorting (ascending) -frequency
will places the words with the largest frequency first. Later, in the function u
, two -frequency
values are multiplied, which will cancel the negation out.
words
discards runs of whitespace, nor that you can put |
conditions onto one line. And I can't believe I put a two-letter variable name in there... - Thomas
div
, actually! Try it- the output is wrong. The reason is that doing the div
before the *
looses precision. - MtnViewMark
?
and !
for the operator names. - Thomas Eding
x={};p='|';e=' ';z=[];c=77
while(l=readline())l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y)x[y]?x[y].c++:z.push(x[y]={w:y,c:1}))
z=z.sort(function(a,b)b.c-a.c).slice(0,22)
for each(v in z){v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c}for(k in z){v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}
Sadly, the for([k,v]in z)
from the Rhino version doesn't seem to want to work in SpiderMonkey, and readFile()
is a little easier than using readline()
but moving up to 1.8 allows us to use function closures to cut a few more lines....
Adding whitespace for readability:
x={};p='|';e=' ';z=[];c=77
while(l=readline())
l.toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,
function(y) x[y] ? x[y].c++ : z.push( x[y] = {w: y, c: 1} )
)
z=z.sort(function(a,b) b.c - a.c).slice(0,22)
for each(v in z){
v.r=v.c/z[0].c
c=c>(l=(77-v.w.length)/v.r)?l:c
}
for(k in z){
v=z[k]
s=Array(v.r*c|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)
}
Usage: js golf.js < input.txt
Output:
_________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |____________________________________________________| alice |______________________________________________| was |___________________________________________| that |___________________________________| as |________________________________| her |_____________________________| at |_____________________________| with |____________________________| s |____________________________| t |__________________________| on |_________________________| all |_______________________| this |______________________| for |______________________| had |______________________| but |_____________________| be |_____________________| not |___________________| they |___________________| so
(base version - doesn't handle bar widths correctly)
I think my sorting logic is off, but.. I duno. Brainfart fixed.
Minified (abusing \n
's interpreted as a ;
sometimes):
x={};p='|';e=' ';z=[]
readFile(arguments[0]).toLowerCase().replace(/\b(?!(the|and|of|to|a|i[tns]?|or)\b)\w+/g,function(y){x[y]?x[y].c++:z.push(x[y]={w:y,c:1})})
z=z.sort(function(a,b){return b.c-a.c}).slice(0,22)
for([k,v]in z){s=Array((v.c/z[0].c)*70|0).join('_')
if(!+k)print(e+s+e)
print(p+s+p+e+v.w)}
i[tns]?
bit. Very sneaky. - dmckee --- ex-moderator kitten
.replace(/[^\w ]/g, e).split(/\s+/).map(
can be replaced with .replace(/\w+/g,
and use the same function your .map
did... Also not sure if Rhino supports function(a,b)b.c-a.c
instead of your sort function (spidermonkey does), but that will shave {return }
... b.c-a.c
is a better sort that a.c<b.c
btw... Editing a Spidermonkey version at the bottom with these changes - gnarf
?:
Great base to work from though! - gnarf
foo_the123
where only foo
should remain. - Joey
Some parts were inspired by the earlier perl/ruby submissions, a couple similar ideas were arrived at independently, the others are original. Shorter version also incorporates some things I saw/learned from other submissions.
Original:
$k{$_}++for grep{$_!~/^(the|and|of|to|a|i|it|in|or|is)$/}map{lc=~/[a-z]+/g}<>;@t=sort{$k{$b}<=>$k{$a}}keys%k;$l=76-length$t[0];printf" %s
",'_'x$l;printf"|%s| $_
",'_'x int$k{$_}/$k{$t[0]}*$l for@t[0..21];
Latest version down to 191 characters:
/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-y///c)/$k{$_=$e[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@e[0,0..21]
Latest version down to 189 characters:
/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;@_=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";$r=(76-m//)/$k{$_=$_[0]};map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
"}@_[0,0..21]
This version (205 char) accounts for the lines with words longer than what would be found later.
/^(the|and|of|to|.|i[tns]|or)$/||$k{$_}++for map{lc=~/[a-z]+/g}<>;($r)=sort{$a<=>$b}map{(76-y///c)/$k{$_}}@e=sort{$k{$b}<=>$k{$a}}keys%k;$n=" %s
";map{printf$n,'_'x($k{$_}*$r),$_;$n="|%s| %s
";}@e[0,0..21]
I guess using Counter [1] is kind of cheating :) I just read about it about a week ago, so this was the perfect chance to see how it works.
import re,collections
o=collections.Counter([w for w in re.findall("[a-z]+",open("!").read().lower())if w not in"a and i in is it of or the to".split()]).most_common(22)
print('\n'.join('|'+76*v//o[0][1]*'_'+'| '+k for k,v in o))
Prints out:
|____________________________________________________________________________| she
|__________________________________________________________________| you
|_______________________________________________________________| said
|_______________________________________________________| alice
|_________________________________________________| was
|_____________________________________________| that
|_____________________________________| as
|__________________________________| her
|_______________________________| with
|_______________________________| at
|______________________________| s
|_____________________________| t
|____________________________| on
|___________________________| all
|________________________| this
|________________________| for
|________________________| had
|________________________| but
|______________________| be
|______________________| not
|_____________________| they
|____________________| so
Some of the code was "borrowed" from AKX's solution.
[1] http://docs.python.org/dev/library/collections.html#collections.Counteropen('!')
reads from stdin - which version/OS is that on? or do you have to name the file '!'? - Nas Banov
This solution takes into account the last requirement which most purists have conviniently chosen to ignore. That costed 170 characters!
Usage: php.exe <this.php> <file.txt>
Minified:
<?php $a=array_count_values(array_filter(preg_split('/[^a-z]/',strtolower(file_get_contents($argv[1])),-1,1),function($x){return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);}));arsort($a);$a=array_slice($a,0,22);function R($a,$F,$B){$r=array();foreach($a as$x=>$f){$l=strlen($x);$r[$x]=$b=$f*$B/$F;if($l+$b>76)return R($a,$f,76-$l);}return$r;}$c=R($a,max($a),76-strlen(key($a)));foreach($a as$x=>$f)echo '|',str_repeat('-',$c[$x]),"| $x\n";?>
Human readable:
<?php
// Read:
$s = strtolower(file_get_contents($argv[1]));
// Split:
$a = preg_split('/[^a-z]/', $s, -1, PREG_SPLIT_NO_EMPTY);
// Remove unwanted words:
$a = array_filter($a, function($x){
return !preg_match("/^(.|the|and|of|to|it|in|or|is)$/",$x);
});
// Count:
$a = array_count_values($a);
// Sort:
arsort($a);
// Pick top 22:
$a=array_slice($a,0,22);
// Recursive function to adjust bar widths
// according to the last requirement:
function R($a,$F,$B){
$r = array();
foreach($a as $x=>$f){
$l = strlen($x);
$r[$x] = $b = $f * $B / $F;
if ( $l + $b > 76 )
return R($a,$f,76-$l);
}
return $r;
}
// Apply the function:
$c = R($a,max($a),76-strlen(key($a)));
// Output:
foreach ($a as $x => $f)
echo '|',str_repeat('-',$c[$x]),"| $x\n";
?>
Output:
|-------------------------------------------------------------------------| she
|---------------------------------------------------------------| you
|------------------------------------------------------------| said
|-----------------------------------------------------| alice
|-----------------------------------------------| was
|-------------------------------------------| that
|------------------------------------| as
|--------------------------------| her
|-----------------------------| at
|-----------------------------| with
|--------------------------| on
|--------------------------| all
|-----------------------| this
|-----------------------| for
|-----------------------| had
|-----------------------| but
|----------------------| be
|---------------------| not
|--------------------| they
|--------------------| so
|-------------------| very
|------------------| what
When there is a long word, the bars are adjusted properly:
|--------------------------------------------------------| she
|---------------------------------------------------| thisisareallylongwordhere
|-------------------------------------------------| you
|-----------------------------------------------| said
|-----------------------------------------| alice
|------------------------------------| was
|---------------------------------| that
|---------------------------| as
|-------------------------| her
|-----------------------| with
|-----------------------| at
|--------------------| on
|--------------------| all
|------------------| this
|------------------| for
|------------------| had
|-----------------| but
|-----------------| be
|----------------| not
|---------------| they
|---------------| so
|--------------| very
$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&warn" "."_"x($z*$y)."\n";printf"|%.78s\n","_"x($z*$y)."| $_"}(sort{$x{$b}<=>$x{$a}}keys%x)[0..21]
Alternate, full implementation including indicated behaviour (global bar-squishing) for the pathological case in which the secondary word is both popular and long enough to combine to over 80 chars (this implementation is 231 chars):
$/=\0;/^(the|and|of|to|.|i[tns]|or)$/i||$x{lc$_}++for<>=~/[a-z]+/gi;@e=(sort{$x{$b}<=>$x{$a}}keys%x)[0..21];for(@e){$p=(76-y///c)/$x{$_};($y&&$p>$y)||($y=$p)}warn" "."_"x($x{$e[0]}*$y)."\n";for(@e){warn"|"."_"x($x{$_}*$y)."| $_\n"}
The specification didn't state anywhere that this had to go to STDOUT, so I used perl's warn() instead of print - four characters saved there. Used map instead of foreach, but I feel like there could still be some more savings in the split(join()). Still, got it down to 203 - might sleep on it. At least Perl's now under the "shell, grep, tr, grep, sort, uniq, sort, head, perl" char count for now ;)
PS: Reddit says "Hi" ;)
Update: Removed join() in favour of assignment and implicit scalar conversion join. Down to 202. Also please note I have taken advantage of the optional "ignore 1-letter words" rule to shave 2 characters off, so bear in mind the frequency count will reflect this.
Update 2: Swapped out assignment and implicit join for killing $/ to get the file in one gulp using <> in the first place. Same size, but nastier. Swapped out if(!$y){} for $y||{}&&, saved 1 more char => 201.
Update 3: Took control of lowercasing early (lc<>) by moving lc out of the map block - Swapped out both regexes to no longer use /i option, as no longer needed. Swapped explicit conditional x?y:z construct for traditional perlgolf || implicit conditional construct - /^...$/i?1:$x{$}++ for /^...$/||$x{$}++ Saved three characters! => 198, broke the 200 barrier. Might sleep soon... perhaps.
Update 4: Sleep deprivation has made me insane. Well. More insane. Figuring that this only has to parse normal happy text files, I made it give up if it hits a null. Saved two characters. Replaced "length" with the 1-char shorter (and much more golfish) y///c - you hear me, GolfScript?? I'm coming for you!!! sob
Update 5: Sleep dep made me forget about the 22row limit and subsequent-line limiting. Back up to 208 with those handled. Not too bad, 13 characters to handle it isn't the end of the world. Played around with perl's regex inline eval, but having trouble getting it to both work and save chars... lol. Updated the example to match current output.
Update 6: Removed unneeded braces protecting (...)for, since the syntactic candy ++ allows shoving it up against the for happily. Thanks to input from Chas. Owens (reminding my tired brain), got the character class i[tns] solution in there. Back down to 203.
Update 7: Added second piece of work, full implementation of specs (including the full bar-squishing behaviour for secondary long-words, instead of truncation which most people are doing, based on the original spec without the pathological example case)
Examples:
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what
Alternative implementation in pathological case example:
_______________________________________________________________
|_______________________________________________________________| she
|_______________________________________________________| superlongstringstring
|____________________________________________________| said
|______________________________________________| alice
|________________________________________| was
|_____________________________________| that
|_______________________________| as
|____________________________| her
|_________________________| with
|_________________________| at
|_______________________| on
|______________________| all
|____________________| this
|____________________| for
|____________________| had
|____________________| but
|___________________| be
|__________________| not
|_________________| they
|_________________| so
|________________| very
|________________| what
is|in|it|i
into i[snt]?
– and then there's no difference with the optional rule anymore. (Hm, I never would have thought about telling a Perl guy how to do Regex :D) – only problem now: I have to look how I can shave off three bytes from my own solution to be better than Perl again :-| - Joey
say
: perl -E '$/=\0;map{/^(the|and|of|to|.|it|in|or|is|)$/||$x{$_}++}split(/[^a-z]/,lc<>);map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&say" "."_"x($z*$y);say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x'
- Chas. Owens
map
where possible: perl -E '$/=\0;/^(the|and|of|to|.|i[tns]|or)$/||$x{$_}++for split(/[^a-z]/,lc<>);map{$z=$x{$_};$y||{$y=(76-y///c)/$z}&&say" "."_"x($z*$y);say"|"."_"x($z*$y)."| $_"}sort{$x{$b}<=>$x{$a}}keys%x'
- Chas. Owens
Strightforward: get a sequence a
of word-count pairs, find the best word-count-per-column multiplier k
, then print results.
let a=
stdin.ReadToEnd().Split(" .?!,\":;'\r\n".ToCharArray(),enum 1)
|>Seq.map(fun s->s.ToLower())|>Seq.countBy id
|>Seq.filter(fun(w,n)->not(set["the";"and";"of";"to";"a";"i";"it";"in";"or";"is"].Contains w))
|>Seq.sortBy(fun(w,n)-> -n)|>Seq.take 22
let k=a|>Seq.map(fun(w,n)->float(78-w.Length)/float n)|>Seq.min
let u n=String.replicate(int(float(n)*k)-2)"_"
printfn" %s "(u(snd(Seq.nth 0 a)))
for(w,n)in a do printfn"|%s| %s "(u n)w
Example (I have different freq counts than you, unsure why):
% app.exe < Alice.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|___________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|_________________________| all
|_______________________| this
|______________________| had
|______________________| for
|_____________________| but
|_____________________| be
|____________________| not
|___________________| they
|__________________| so
float
s. - Gabe
|>
? - Joey
import re
W,x={},"a and i in is it of or the to".split()
[W.__setitem__(w,W.get(w,0)-1)for w in re.findall("[a-z]+",file("11.txt").read().lower())if w not in x]
W=sorted(W.items(),key=lambda p:p[1])[:22]
bm=(76.-len(W[0][0]))/W[0][1]
U=lambda n:"_"*int(n*bm)
print "".join(("%s\n|%s| %s "%((""if i else" "+U(n)),U(n),w))for i,(w,n)in enumerate(W))
Output:
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| s
|____________________________| t
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
bm=(76.-len(W[0][0]))/W[0][1]
since you're only using bm once (make the next line U=lambda n:"_"*int(n*(76.-len(W[0][0]))/W[0][1])
, shaves off 5 characters. Also: why would you use a 2-character variable name in code golfing? ;-) - ChristopheD
(after fixing the output formatting; fixing the contractions thing; tweaking; tweaking again; removing a wholly unnecessary sorting step; tweaking yet again; and again (oops this one broke the formatting); tweak some more; taking up Matt's challenge I desperately tweak so more; found another place to save a few, but gave two back to fix the bar length bug)
Heh heh! I am momentarily ahead of [Matt's JavaScript][1] solutioncounter challenge! ;) and [AKX's python][2].
The problem seems to call out for a language that implements native associative arrays, so of course I've chosen one with a horribly deficient set of operators on them. In particular, you cannot control the order in which awk offers up the elements of a hash map, so I repeatedly scan the whole map to find the currently most numerous item, print it and delete it from the array.
It is all terribly inefficient, with all the golfifcations I've made it has gotten to be pretty awful, as well.
Minified:
{gsub("[^a-zA-Z]"," ");for(;NF;NF--)a[tolower($NF)]++}
END{split("the and of to a i it in or is",b," ");
for(w in b)delete a[b[w]];d=1;for(w in a){e=a[w]/(78-length(w));if(e>d)d=e}
for(i=22;i;--i){e=0;for(w in a)if(a[w]>e)e=a[x=w];l=a[x]/d-2;
t=sprintf(sprintf("%%%dc",l)," ");gsub(" ","_",t);if(i==22)print" "t;
print"|"t"| "x;delete a[x]}}
line breaks for clarity only: they are not necessary and should not be counted.
Output:
$ gawk -f wordfreq.awk.min < 11.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
$ sed 's/you/superlongstring/gI' 11.txt | gawk -f wordfreq.awk.min
______________________________________________________________________
|______________________________________________________________________| she
|_____________________________________________________________| superlongstring
|__________________________________________________________| said
|__________________________________________________| alice
|____________________________________________| was
|_________________________________________| that
|_________________________________| as
|______________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|__________________________| t
|________________________| on
|________________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|_________________| so
Readable; 633 characters (originally 949):
{
gsub("[^a-zA-Z]"," ");
for(;NF;NF--)
a[tolower($NF)]++
}
END{
# remove "short" words
split("the and of to a i it in or is",b," ");
for (w in b)
delete a[b[w]];
# Find the bar ratio
d=1;
for (w in a) {
e=a[w]/(78-length(w));
if (e>d)
d=e
}
# Print the entries highest count first
for (i=22; i; --i){
# find the highest count
e=0;
for (w in a)
if (a[w]>e)
e=a[x=w];
# Print the bar
l=a[x]/d-2;
# make a string of "_" the right length
t=sprintf(sprintf("%%%dc",l)," ");
gsub(" ","_",t);
if (i==22) print" "t;
print"|"t"| "x;
delete a[x]
}
}
This is incomplete, but for the hell of it, here's the word-frequency counting half of the problem in 192 bytes:
curl -s http://www.gutenberg.org/files/11/11.txt|sed -e 's@[^a-z]@\n@gi'|tr '[:upper:]' '[:lower:]'|egrep -v '(^[^a-z]*$|\b(the|and|of|to|a|i|it|in|or|is)\b)' |sort|uniq -c|sort -n|tail -n 22
I'm a LISP newbie, and this is an attempt using an hash table for counting (so probably not the most compact method).
(flet((r()(let((x(read-char t nil)))(and x(char-downcase x)))))(do((c(
make-hash-table :test 'equal))(w NIL)(x(r)(r))y)((not x)(maphash(lambda
(k v)(if(not(find k '("""the""and""of""to""a""i""it""in""or""is"):test
'equal))(push(cons k v)y)))c)(setf y(sort y #'> :key #'cdr))(setf y
(subseq y 0(min(length y)22)))(let((f(apply #'min(mapcar(lambda(x)(/(-
76.0(length(car x)))(cdr x)))y))))(flet((o(n)(dotimes(i(floor(* n f)))
(write-char #\_))))(write-char #\Space)(o(cdar y))(write-char #\Newline)
(dolist(x y)(write-char #\|)(o(cdr x))(format t "| ~a~%"(car x))))))
(cond((char<= #\a x #\z)(push x w))(t(incf(gethash(concatenate 'string(
reverse w))c 0))(setf w nil)))))
can be run on for example with
cat alice.txt | clisp -C golf.lisp
.
In readable form is
(flet ((r () (let ((x (read-char t nil)))
(and x (char-downcase x)))))
(do ((c (make-hash-table :test 'equal)) ; the word count map
w y ; current word and final word list
(x (r) (r))) ; iteration over all chars
((not x)
; make a list with (word . count) pairs removing stopwords
(maphash (lambda (k v)
(if (not (find k '("" "the" "and" "of" "to"
"a" "i" "it" "in" "or" "is")
:test 'equal))
(push (cons k v) y)))
c)
; sort and truncate the list
(setf y (sort y #'> :key #'cdr))
(setf y (subseq y 0 (min (length y) 22)))
; find the scaling factor
(let ((f (apply #'min
(mapcar (lambda (x) (/ (- 76.0 (length (car x)))
(cdr x)))
y))))
; output
(flet ((outx (n) (dotimes (i (floor (* n f))) (write-char #\_))))
(write-char #\Space)
(outx (cdar y))
(write-char #\Newline)
(dolist (x y)
(write-char #\|)
(outx (cdr x))
(format t "| ~a~%" (car x))))))
; add alphabetic to current word, and bump word counter
; on non-alphabetic
(cond
((char<= #\a x #\z)
(push x w))
(t
(incf (gethash (concatenate 'string (reverse w)) c 0))
(setf w nil)))))
It looks alot like obfuscated code, and uses glib for string, list and hash. Char count with wc -m
says 828 . It does not consider single-char words. To calculate the max length of the bar, it consider the longest possible word among all, not only the first 22. Is this a deviation from the spec?
It does not handle failures and it does not release used memory.
#include <glib.h>
#define S(X)g_string_##X
#define H(X)g_hash_table_##X
GHashTable*h;int m,w=0,z=0;y(const void*a,const void*b){int*A,*B;A=H(lookup)(h,a);B=H(lookup)(h,b);return*B-*A;}void p(void*d,void*u){int *v=H(lookup)(h,d);if(w<22){g_printf("|");*v=*v*(77-z)/m;while(--*v>=0)g_printf("=");g_printf("| %s\n",d);w++;}}main(c){int*v;GList*l;GString*s=S(new)(NULL);h=H(new)(g_str_hash,g_str_equal);char*n[]={"the","and","of","to","it","in","or","is"};while((c=getchar())!=-1){if(isalpha(c))S(append_c)(s,tolower(c));else{if(s->len>1){for(c=0;c<8;c++)if(!strcmp(s->str,n[c]))goto x;if((v=H(lookup)(h,s->str))!=NULL)++*v;else{z=MAX(z,s->len);v=g_malloc(sizeof(int));*v=1;H(insert)(h,g_strdup(s->str),v);}}x:S(truncate)(s,0);}}l=g_list_sort(H(get_keys)(h),y);m=*(int*)H(lookup)(h,g_list_first(l)->data);g_list_foreach(l,p,NULL);}
*v=*v*(77-lw)/m
will give 929 ... but I think it can be ok unless I find a way to do it a lot shorter) - ShinTakezou
int c
into the main
declaration and main
is implicitly int
(as are any untyped arguments, afaik): main(c){...}
. You could probably also just write 0
instead of NULL
. - Joey
-Wall
or with -std=c99
flag on... but I suppose this is pointless for a code-golf, right? - ShinTakezou
Without freeing memory stuff, it reaches 866 (removed some other unuseful space)
to something else to let not think people that the difference with the free-memory version is all in that: now the no-free-memory version has a lot of more "improvements". - ShinTakezou
200 (slightly broken)
199
197
195
193
187
185 characters. Last two newlines are significant. Complies with the spec.
map$X{+lc}+=!/^(.|the|and|to|i[nst]|o[rf])$/i,/[a-z]+/gfor<>;
$n=$n>($:=$X{$_}/(76-y+++c))?$n:$:for@w=(sort{$X{$b}-$X{$a}}%X)[0..21];
die map{$U='_'x($X{$_}/$n);" $U
"x!$z++,"|$U| $_
"}@w
First line loads counts of valid words into %X
.
The second line computes minimum scaling factor so that all output lines will be <= 80 characters.
The third line (contains two newline characters) produces the output.
Updates before first 742: improved regex, removed superfluous parameterized types, removed superfluous whitespace.
Update 742 > 744 chars: fixed the fixed-length hack. It's only dependent on the 1st word, not other words (yet). Found several places to shorten the code (\\s
in regex replaced by
and ArrayList
replaced by Vector
). I'm now looking for a short way to remove the Commons IO dependency and reading from stdin.
Update 744 > 752 chars: I removed the commons dependency. It now reads from stdin. Paste the text in stdin and hit Ctrl+Z
to get result.
Update 752 > 742 chars: I removed public
and a space, made classname 1 char instead of 2 and it's now ignoring one-letter words.
Update 742 > 714 chars: Updated as per comments of Carl: removed redundant assignment (742 > 730), replaced m.containsKey(k)
by m.get(k)!=null
(730 > 728), introduced substringing of line (728 > 714).
Update 714 > 680 chars: Updated as per comments of Rotsor: improved bar size calculation to remove unnecessary casting and improved split()
to remove unnecessary replaceAll()
.
import java.util.*;class F{public static void main(String[]a)throws Exception{StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);}}
More readable version:
import java.util.*;
class F{
public static void main(String[]a)throws Exception{
StringBuffer b=new StringBuffer();for(int c;(c=System.in.read())>0;b.append((char)c));
final Map<String,Integer>m=new HashMap();for(String w:b.toString().toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(w,m.get(w)!=null?m.get(w)+1:1);
List<String>l=new Vector(m.keySet());Collections.sort(l,new Comparator(){public int compare(Object l,Object r){return m.get(r)-m.get(l);}});
int c=76-l.get(0).length();String s=new String(new char[c]).replace('\0','_');System.out.println(" "+s);
for(String w:l.subList(0,22))System.out.println("|"+s.substring(0,m.get(w)*c/m.get(l.get(0)))+"| "+w);
}
}
Output:
_________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so |___________________| very |__________________| what
It pretty sucks that Java doesn't have String#join()
and
closures
[1] (yet).
Edit by Rotsor:
I have made several changes to your solution:
The condensed code is 688 711 684 characters long:
import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;(j=System.in.read())>0;w+=(char)j);for(String W:w.toLowerCase().split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+"))m.put(W,m.get(W)!=null?m.get(W)+1:1);l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}
The fast version (720 693 characters)
import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,x,y,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}}
More readable version:
import java.util.*;class F{public static void main(String[]l)throws Exception{
Map<String,Integer>m=new HashMap();String w="";
int i=0,k=0,j=8,x,y,g=22;
for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{
if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";
}}
l=m.keySet().toArray(l);x=l.length;if(x<g)g=x;
for(;i<g;++i)for(j=i;++j<x;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}
for(;k<g;k++){x=76-l[k].length();y=m.get(l[k]);if(k<1||y*i>x*j){i=x;j=y;}}
String s=new String(new char[m.get(l[0])*i/j]).replace('\0','_');
System.out.println(" "+s);
for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/j)+"| "+w);}}
}
The version without behaviour improvements is 615 characters:
import java.util.*;class F{public static void main(String[]l)throws Exception{Map<String,Integer>m=new HashMap();String w="";int i=0,k=0,j=8,g=22;for(;j>0;){j=System.in.read();if(j>90)j-=32;if(j>64&j<91)w+=(char)j;else{if(!w.matches("^(|.|THE|AND|OF|TO|I[TNS]|OR)$"))m.put(w,m.get(w)!=null?m.get(w)+1:1);w="";}}l=m.keySet().toArray(l);for(;i<g;++i)for(j=i;++j<l.length;)if(m.get(l[i])<m.get(l[j])){w=l[i];l[i]=l[j];l[j]=w;}i=76-l[0].length();String s=new String(new char[i]).replace('\0','_');System.out.println(" "+s);for(k=0;k<g;k++){w=l[k];System.out.println("|"+s.substring(0,m.get(w)*i/m.get(l[0]))+"| "+w);}}}
[1] http://javac.infoIOUtils
instead of importing it? As far as I can see you're using it only once anyway. - Joey
m.get(w)==null
is a shorter check than m.containsKey(w)
, and you don't actually need to declare t
- you can dump it straight into the for-each construct. I think it might also be shorter to assign the new String(new char[c]).replace...
to a string, and use substring in the second call to get a slice of it. - Carl
b
a String instead of a StringBuffer. I don't want to think about what the performance would be, though (especially since you're adding one character at a time). - Michael Myers
final
before Map
? It seems extraneous to me. - Gabe
Comparator
class. @Landei: that's not possible in Java. - BalusC
m
a Map<String,Float>
? I would expect that to save 4 strokes. - Gabe
main()
anyway to execute the code. - BalusC
final Map<String,Integer>m=new HashMap();new BufferedInputReader(new InputStreamReader(System.in)){{for(String w:readLine().toLowerCase().replaceAll("\\b(.|the|and|of|to|i[tns]|or)\\b|\\W"," ").split(" +"))m.put(w,m.get(w)!=null?m.get(w)+1:1);}};
- Carl
while(ready())
in front of the for
? - Carl
(int)(((float)m.get(w)/m.get(l.get(0)))*c)
You can do this: m.get(w)*c/m.get(l.get(0))
saving 16 characters, which will have an added benefit of being always exact and not FP-dangerous. However, l.get(0) is cheating. This solution will not work with 'superlongstring' example. - Rotsor
.replaceAll("\\b(.|the|and|of|to|i[tns]|or)\\b|\\W"," ").split(" +")
with .split("(\\b(.|the|and|of|to|i[tns]|or)\\b|\\W)+")
, saving 18 characters! - Rotsor
First, a legible version in 592 characters:
object Alice {
def main(args:Array[String]) {
val s = io.Source.fromFile(args(0))
val words = s.getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase)
val freqs = words.foldLeft(Map[String, Int]())((countmap, word) => countmap + (word -> (countmap.getOrElse(word, 0)+1)))
val sortedFreqs = freqs.toList.sort((a, b) => a._2 > b._2)
val top22 = sortedFreqs.take(22)
val highestWord = top22.head._1
val highestCount = top22.head._2
val widest = 76 - highestWord.length
println(" " + "_" * widest)
top22.foreach(t => {
val width = Math.round((t._2 * 1.0 / highestCount) * widest).toInt
println("|" + "_" * width + "| " + t._1)
})
}
}
The console output looks like this:
$ scalac alice.scala
$ scala Alice aliceinwonderland.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what
We can do some aggressive minifying and get it down to 415 characters:
object A{def main(args:Array[String]){val l=io.Source.fromFile(args(0)).getLines.flatMap("(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(_)).map(_.toLowerCase).foldLeft(Map[String, Int]())((c,w)=>c+(w->(c.getOrElse(w,0)+1))).toList.sort((a,b)=>a._2>b._2).take(22);println(" "+"_"*(76-l.head._1.length));l.foreach(t=>println("|"+"_"*Math.round((t._2*1.0/l.head._2)*(76-l.head._1.length)).toInt+"| "+t._1))}}
The console session looks like this:
$ scalac a.scala
$ scala A aliceinwonderland.txt
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what
I'm sure a Scala expert could do even better.
Update: In the comments Thomas gave an even shorter version, at 368 characters:
object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}
Legibly, at 375 characters:
object Alice {
def main(a:Array[String]) {
val t = (Map[String, Int]() /: (
for (
x <- io.Source.fromFile(a(0)).getLines
y <- "(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r.findAllIn(x)
) yield y.toLowerCase
).toList)((c, x) => c + (x -> (c.getOrElse(x, 0) + 1))).toList.sortBy(_._2).reverse.take(22)
val w = 76 - t.head._1.length
print (" "+"_"*w)
t.map(s => "\n|" + "_" * (s._2 * w / t.head._2) + "| " + s._1).foreach(print)
}
}
object A{def main(a:Array[String]){val t=(Map[String, Int]()/:(for(x<-io.Source.fromFile(a(0)).getLines;y<-"(?i)\\w+\\b(?<!\\bthe|and|of|to|a|i|it|in|or|is)".r findAllIn x) yield y.toLowerCase).toList)((c,x)=>c+(x->(c.getOrElse(x,0)+1))).toList.sortBy(_._2).reverse.take(22);val w=76-t.head._1.length;print(" "+"_"*w);t map (s=>"\n|"+"_"*(s._2*w/t.head._2)+"| "+s._1) foreach print}}
- Thomas Jung
including long word adjustment. Ideas borrowed from the other solutions.
Now as a script (a.scala
):
val t="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile(argv(0)).mkString.toLowerCase).toSeq.groupBy(w=>w).mapValues(_.size).toSeq.sortBy(-_._2)take 22
def b(p:Int)="_"*(p*(for((w,c)<-t)yield(76.0-w.size)/c).min).toInt
println(" "+b(t(0)._2))
for(p<-t)printf("|%s| %s \n",b(p._2),p._1)
Run with
scala -howtorun:script a.scala alice.txt
BTW, the edit from 314 to 311 characters actually removes only 1 character. Someone got the counting wrong before (Windows CRs?).
(let[[[_ m]:as s](->>(slurp *in*).toLowerCase(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")frequencies(sort-by val >)(take 22))[b](sort(map #(/(- 76(count(key %)))(val %))s))p #(do(print %1)(dotimes[_(* b %2)](print \_))(apply println %&))](p " " m)(doseq[[k v]s](p \| v \| k)))
Somewhat more legibly:
(let[[[_ m]:as s](->> (slurp *in*)
.toLowerCase
(re-seq #"\w+\b(?<!\bthe|and|of|to|a|i[tns]?|or)")
frequencies
(sort-by val >)
(take 22))
[b] (sort (map #(/ (- 76 (count (key %)))(val %)) s))
p #(do
(print %1)
(dotimes[_(* b %2)] (print \_))
(apply println %&))]
(p " " m)
(doseq[[k v] s] (p \| v \| k)))
Update: I have aggressively reduced the character count. Omits single-letter words per updated spec.
I envy C# and LINQ so much.
import java.util.*;import java.io.*;import static java.util.regex.Pattern.*;class g{public static void main(String[] a)throws Exception{PrintStream o=System.out;Map<String,Integer> w=new HashMap();Scanner s=new Scanner(new File(a[0])).useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));while(s.hasNext()){String z=s.next().trim().toLowerCase();if(z.equals(""))continue;w.put(z,(w.get(z)==null?0:w.get(z))+1);}List<Integer> v=new Vector(w.values());Collections.sort(v);List<String> q=new Vector();int i,m;i=m=v.size()-1;while(q.size()<22){for(String t:w.keySet())if(!q.contains(t)&&w.get(t).equals(v.get(i)))q.add(t);i--;}int r=80-q.get(0).length()-4;String l=String.format("%1$0"+r+"d",0).replace("0","_");o.println(" "+l);o.println("|"+l+"| "+q.get(0)+" ");for(i=m-1;i>m-22;i--){o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");}}}
"Readable":
import java.util.*;
import java.io.*;
import static java.util.regex.Pattern.*;
class g
{
public static void main(String[] a)throws Exception
{
PrintStream o = System.out;
Map<String,Integer> w = new HashMap();
Scanner s = new Scanner(new File(a[0]))
.useDelimiter(compile("[^a-z]+|\\b(the|and|of|to|.|it|in|or|is)\\b",2));
while(s.hasNext())
{
String z = s.next().trim().toLowerCase();
if(z.equals(""))
continue;
w.put(z,(w.get(z) == null?0:w.get(z))+1);
}
List<Integer> v = new Vector(w.values());
Collections.sort(v);
List<String> q = new Vector();
int i,m;
i = m = v.size()-1;
while(q.size()<22)
{
for(String t:w.keySet())
if(!q.contains(t)&&w.get(t).equals(v.get(i)))
q.add(t);
i--;
}
int r = 80-q.get(0).length()-4;
String l = String.format("%1$0"+r+"d",0).replace("0","_");
o.println(" "+l);
o.println("|"+l+"| "+q.get(0)+" ");
for(i = m-1; i > m-22; i--)
{
o.println("|"+l.substring(0,(int)Math.round(r*(v.get(i)*1.0)/v.get(m)))+"| "+q.get(m-i)+" ");
}
}
}
Output of Alice:
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
|___________________| very
|___________________| what
Output of Don Quixote (also from Gutenberg):
________________________________________________________________________
|________________________________________________________________________| that
|________________________________________________________| he
|______________________________________________| for
|__________________________________________| his
|________________________________________| as
|__________________________________| with
|_________________________________| not
|_________________________________| was
|________________________________| him
|______________________________| be
|___________________________| don
|_________________________| my
|_________________________| this
|_________________________| all
|_________________________| they
|________________________| said
|_______________________| have
|_______________________| me
|______________________| on
|______________________| so
|_____________________| you
|_____________________| quixote
I don't expect to score highly by using C++, but nevermind that. I'm pretty sure it hits all the requirements. Note that I used the C++0x auto
keyword for variable declaration, so adjust your complier appropriately if you decide to test my code.
Minimised version
#include <iostream>
#include <cstring>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
int main(){map<C,int>f;char d[230];int i=1,v;for(;i<256;i++)d[i<123?i-1:i-27]=i;d[229]=0;char w[99];while(cin>>w){for(i=0;w[i];i++)w[i]=tolower(w[i]);char*p=strtok(w,d);while(p)++f[p],p=strtok(0,d);}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}
Here's a second version that is more "C++" by using string
, not char[]
and strtok
. It's a bit larger, at 669 (+22 vs above), but I can't get it smaller at the moment so thought I'd post it anyway.
#include <iostream>
#include <map>
using namespace std;
#define C string
#define S(x)v=F/a,cout<<#x<<C(v,'_')
#define F t->first
#define G t->second
#define O &&F!=
#define L for(i=22;i-->0;--t)
#define E e=w.find_first_of(d,g);g=w.find_first_not_of(d,e);
int main(){map<C,int>f;int i,v;C w,x,d="abcdefghijklmnopqrstuvwxyz";while(cin>>w){for(i=w.size();i-->0;)w[i]=tolower(w[i]);unsigned g=0,E while(g-e>0){x=w.substr(e,g-e),++f[x],E}}multimap<int,C>c;for(auto t=f.end();--t!=f.begin();)if(F!="the"O"and"O"of"O"to"O"a"O"i"O"it"O"in"O"or"O"is")c.insert(pair<int,C>(G,F));auto t=--c.end();float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;t=--c.end();S( );L S(\n|)<<"| "<<G;}
I've removed the full version, because I can't be bothered to keep updating it with my tweaks to the minimised version. See edit history if you're interested in the (possibly outdated) long version.
float a=0,A;L A=F/(76.0-G.length()),a=a>A?a:A;
you can eliminate a #define and shave a few strokes. - Gabe
word
, having an arbitrary length doesn't really feel right - but I'm not sure of the best way to extract cin
into a char
array, as opposed to a string
, without the risk of breaking in the middle of a word (ie, if I just pulled it in 80-char chunks). But I've put finding a "better" solution until probably tomorrow. - DMA57361
d[i-27]=0;
the same as d[229]=0;
? - Gabe
L{A=F/(76.0-G.length()),a=a>A?a:A;}
into L A=F/(76.0-G.length()),a=a>A?a:A;
. - Gabe
I believe this one if fully compliant with the question. Ignore list is here, it fully checks for line length (see exemple where I replaced Alice
by Aliceinwonderlandbylewiscarroll
througout the text making the fifth item the longest line. Even the filename is provided from command line instead of hardcoded (hardcoding it would remove about 10 chars). It has one drawback (but I believe it's ok with the question) as it compute an integer divider to make line shorter than 80 chars, the longest line is shorter than 80 characters, not exactly 80 characters. The python 3.x version does not have this defect (but is way longer).
Also I believe it is not so hard to read.
import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
for l,w in b:print"|"+l/min(z/(78-len(e))for z,e in b)*'-'+"|",w
|----------------------------------------------------------------| she
|--------------------------------------------------------| you
|-----------------------------------------------------| said
|----------------------------------------------| aliceinwonderlandbylewiscarroll
|-----------------------------------------| was
|--------------------------------------| that
|-------------------------------| as
|----------------------------| her
|--------------------------| at
|--------------------------| with
|-------------------------| s
|-------------------------| t
|-----------------------| on
|-----------------------| all
|---------------------| this
|--------------------| for
|--------------------| had
|--------------------| but
|-------------------| be
|-------------------| not
|------------------| they
|-----------------| so
As it is not clear if we must print the max bar alone on it's line (like in sample output). Below is another one that do it, but 232 chars.
import sys,re
t=re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",sys.stdin.read().lower())
b=sorted((-t.count(x),x)for x in set(t))[:22]
f=min(z/(78-len(e))for z,e in b)
print"",b[0][0]/f*'-'
for y,w in b:print"|"+y/f*'-'+"|",w
Using Counter class from python 3.x, there was high hopes to make it shorter (as Counter does everything that we need here). It comes out it's not better. Below is my trial 266 chars:
import sys,re,collections as c
b=c.Counter(re.split("\W+(?:(?:the|and|o[fr]|to|a|i[tns]?)\W+)*",
sys.stdin.read().lower())).most_common(22)
F=lambda p,x,w:print(p+'-'*int(x/max(z/(77.-len(e))for e,z in b))+w)
F(" ",b[0][1],"")
for w,y in b:F("|",y,"| "+w)
The problem is that collections
and most_common
are very long words and even Counter
is not short... really, not using Counter
makes code only 2 characters longer ;-(
python 3.x also introduce other constraints : dividing two integers is not an integer any more (so we have to cast to int), print is now a function (must add parenthesis), etc. That's why it comes out 22 characters longer than python2.x version, but way faster. Maybe some more experimented python 3.x coder will have ideas to shorten the code.
for l,w in b:print"|"+int(l/min(z/(76.-len(e))for z,e in b))*'-'+"|",w
in the last line to align your lines correctly. This adds five characters and makes your code adhere to the rules ("Maximize bar width within these constraints and scale the bars appropriately"). You can, however, strip a few chars by importing os instead of sys and then doing os.read(0,1e9)
instead of sys.stdin.read()
. This makes for 208 chars overall, still one of the best solutions. - Colin Emonds
I took the code of @seanizer [1], fixed a bug (he omitted the 1st output line), made some improvements to make the code more 'golfy'.
import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.IOUtils;
public class WF{
public static void main(String[] a)throws Exception{
String t=IOUtils.toString(new java.net.URL(a[0]).openStream());
class W implements Comparable<W> {
String w;int f=1;W(String W){w=W;}public int compareTo(W o){return o.f-f;}
String d(float r){char[]c=new char[(int)(f/r)];Arrays.fill(c,'_');return "|"+new String(c)+"| "+w;}
}
Map<String,W>M=new HashMap<String,W>();
Matcher m=Pattern.compile("\\b\\w+\\b").matcher(t.toLowerCase());
while(m.find()){String w=m.group();W W=M.get(w);if(W==null)M.put(w,new W(w));else W.f++;}
M.keySet().removeAll(Arrays.asList("the,and,of,to,a,i,it,in,or,is".split(",")));
List<W>L=new ArrayList<W>(M.values());Collections.sort(L);int l=76-L.get(0).w.length();
System.out.println(" "+new String(new char[l]).replace('\0','_'));
for(W w:L.subList(0,22))System.out.println(w.d((float)L.get(0).f/(float)l));
}
}
Output:
_________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| with |_____________________________| at |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so[1] https://stackoverflow.com/questions/3169051/code-golf-word-frequency-chart/3169871#3169871
new String(new char[l]).replace('\0','_')
that's a nice trick to remember, thanks. - Sean Patrick Floyd
can probably get shorter...
bar <- function(w, l)
{
b <- rep("-", l)
s <- rep(" ", l)
cat(" ", b, "\n|", s, "| ", w, "\n ", b, "\n", sep="")
}
f <- "alice.txt"
e <- c("the", "and", "of", "to", "a", "i", "it", "in", "or", "is", "")
w <- unlist(lapply(readLines(file(f)), strsplit, s=" "))
w <- tolower(w)
w <- unlist(lapply(w, gsub, pa="[^a-z]", r=""))
u <- unique(w[!w %in% e])
n <- unlist(lapply(u, function(x){length(w[w==x])}))
o <- rev(order(n))
n <- n[o]
m <- 77 - max(unlist(lapply(u[1:22], nchar)))
n <- floor(m*n/n[1])
u <- u[o]
for (i in 1:22)
bar(u[i], n[i])
replaced b=map.get(a)
with b=map[a]
,
replaced split with matcher / iterator
def r,s,m=[:],n=0;def p={println it};def w={"_".multiply it};(new URL(this.args[0]).text.toLowerCase()=~/\b\w+\b/).each{s=it;if(!(s==~/(the|and|of|to|a|i[tns]?|or)/))m[s]=m[s]==null?1:m[s]+1};m.keySet().sort{a,b->m[b]<=>m[a]}.subList(0,22).each{k->if(n++<1){r=(m[k]/(76-k.length()));p" "+w(m[k]/r)};p"|"+w(m[k]/r)+"|"+k}
(executed as groovy script with the URL as cmd line arg. No imports required!)
Readable version here:
def r,s,m=[:],n=0;
def p={println it};
def w={"_".multiply it};
(new URL(this.args[0]).text.toLowerCase()
=~ /\b\w+\b/
).each{
s=it;
if (!(s ==~/(the|and|of|to|a|i[tns]?|or)/))
m[s] = m[s] == null ? 1 : m[s] + 1
};
m.keySet()
.sort{
a,b -> m[b] <=> m[a]
}
.subList(0,22).each{
k ->
if( n++ < 1 ){
r=(m[k]/(76-k.length()));
p " " + w(m[k]/r)
};
p "|" + w(m[k]/r) + "|" + k
}
(Edit: Props to ChristopheD for character-shaving suggestions)
import sys,re
t=re.findall('[a-z]+',"".join(sys.stdin).lower())
d=sorted((t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:-23:-1]
r=min((78.-len(m[1]))/m[0]for m in d)
print'','_'*(int(d[0][0]*r-2))
for(a,b)in d:print"|"+"_"*(int(a*r-2))+"|",b
Output:
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|______________________________________________| was
|__________________________________________| that
|___________________________________| as
|_______________________________| her
|____________________________| with
|____________________________| at
|___________________________| s
|___________________________| t
|_________________________| on
|_________________________| all
|______________________| this
|______________________| for
|______________________| had
|_____________________| but
|____________________| be
|____________________| not
|___________________| they
|__________________| so
r=min([(78.0-len(m[1]))/m[0] for m in d])
(shaves off 2 characters: min((78.0-len(m[1]))/m[0] for m in d)
). The same goes for the square brackets in line three: sorted([...
- ChristopheD
for
(shaves off 2 characters). - ChristopheD
print'',
to print the starting space on the first line; clever ;-) - ChristopheD
MATLAB 335 404 410 bytes 357 bytes. 390 bytes.
The updated code is now 335 characters instead of 404, and seems to do well for both examples.
Original Message (For code of 404 characters)
This version is a bit longer, however, it will properly scale the length of the bars if there is a word that is ridiculously long so that none of the columns go over 80.
So, my code is 357 bytes without re-scaling, and 410 long with re-scaling.
A=textscan(fopen('11.txt'),'%s','delimiter',' 0123456789,.!?-_*^:;=+\\/(){}[]@&#$%~`|"''');
s=lower(A{1});s(cellfun('length', s)<2)=[];s(ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);N=hist(i,max(i)); [j,k]=sort(N,'descend'); b=k(1:22); n=cellfun('length',w(b));
q=80*N(b)'/N(k(1))+n; q=floor(q*78/max(q)-n); for i=1:22, fprintf('%s| %s\n',repmat('_',1,l(i)),w{k(i)});end
Results:
___________________________________________________________________________| she
_________________________________________________________________| you
______________________________________________________________| said
_______________________________________________________| alice
________________________________________________| was
____________________________________________| that
_____________________________________| as
_________________________________| her
______________________________| at
______________________________| with
____________________________| on
___________________________| all
_________________________| this
________________________| for
________________________| had
________________________| but
_______________________| be
_______________________| not
_____________________| they
____________________| so
___________________| very
___________________| what
For example, replacing all instances of "you" in the Alice in Wonderland text with "superlongstringofridiculousness", my code will correctly scale the results:
____________________________________________________________________| she
_________________________________________________________| superlongstringstring
________________________________________________________| said
_________________________________________________| alice
____________________________________________| was
________________________________________| that
_________________________________| as
______________________________| her
___________________________| with
___________________________| at
_________________________| on
________________________| all
_____________________| this
_____________________| for
_____________________| had
_____________________| but
____________________| be
____________________| not
__________________| they
__________________| so
_________________| very
_________________| what
Here is the updated code written a little bit more legibly:
A=textscan(fopen('t'),'%s','delimiter','':'@');
s=lower(A{1});
s(cellfun('length', s)<2|ismember(s,{'the','and','of','to','it','in','or','is'}))=[];
[w,~,i]=unique(s);
N=hist(i,max(i));
[j,k]=sort(N,'descend');
n=cellfun('length',w(k));
q=80*N(k)'/N(k(1))+n;
q=floor(q*78/max(q)-n);
for i=1:22,
fprintf('%s| %s\n',repmat('_',1,q(i)),w{k(i)});
end
char([32:64 91:96 123:126])
- Amro
tr A-Z a-z|tr -Cs a-z "\n"|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$" |uniq -c|sort -r|head -22>g
n=1
while :
do
awk '{printf "|%0*s| %s\n",$1*'$n'/1e3,"",$2;}' g|tr 0 _>o
egrep -q .{80} o&&break
n=$((n+1))
done
cat o
I'm surprised nobody seems to have used the amazing * feature of printf.
cat 11-very.txt > golf.sh
|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|______________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
cat 11 | golf.sh
|_________________________________________________________________| she
|_________________________________________________________| verylongstringstring
|______________________________________________________| said
|_______________________________________________| alice
|__________________________________________| was
|_______________________________________| that
|________________________________| as
|_____________________________| her
|___________________________| with
|___________________________| at
|__________________________| s
|_________________________| t
|________________________| on
|_______________________| all
|_____________________| this
|_____________________| for
|_____________________| had
|____________________| but
|___________________| be
|___________________| not
|__________________| they
|__________________| so
Scala, 327 characters
This was adapted from mkneissl's answer [1] inspired by a Python version, though it is bigger. I'm leaving it here in case someone can make it shorter.
val f="\\w+\\b(?<!\\bthe|and|of|to|a|i[tns]?|or)".r.findAllIn(io.Source.fromFile("11.txt").mkString.toLowerCase).toSeq
val t=f.toSet[String].map(x=> -f.count(x==)->x).toSeq.sorted take 22
def b(p:Int)="_"*(-p/(for((c,w)<-t)yield-c/(76.0-w.size)).max).toInt
println(" "+b(t(0)._1))
for(p<-t)printf("|%s| %s \n",b(p._1),p._2)
[1] https://stackoverflow.com/questions/3169051/code-golf-word-frequency-chart/3182550#3182550I tried writing the code in as much idiomatic Clojure as I could so late in the night. I am not too proud of the draw-chart
function, but I guess the code will speak volumes of the succinctness of Clojure.
(ns word-freq
(:require [clojure.contrib.io :as io]))
(defn word-freq
[f]
(take 22 (->> f
io/read-lines ;;; slurp should work too, but I love map/red
(mapcat (fn [l] (map #(.toLowerCase %) (re-seq #"\w+" l))))
(remove #{"the" "and" "of" "to" "a" "i" "it" "in" "or" "is"})
(reduce #(assoc %1 %2 (inc (%1 %2 0))) {})
(sort-by (comp - val)))))
(defn draw-chart
[fs]
(let [[[w f] & _] fs]
(apply str
(interpose \newline
(map (fn [[k v]] (apply str (concat "|" (repeat (int (* (- 76 (count w)) (/ v f 1))) "_") "| " k " ")) ) fs)))))
;;; (println (draw-chart (word-freq "/Users/ghoseb/Desktop/alice.txt")))
Output:
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| with
|_____________________________| at
|____________________________| t
|____________________________| s
|__________________________| on
|__________________________| all
|_______________________| for
|_______________________| had
|_______________________| this
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
I know, this doesn't follow the spec, but hey, this is some very clean Clojure code which is already so small :)
Java, slowly getting shorter (1500 1358 1241 1020 913 890 chars)
stripped even more white space and var name length. removed generics where possible, removed inline class and try/catch block too bad, my 900 version had a bug
import java.net.*;import java.util.*;import java.util.regex.*;import org.apache.commons.io.*;public class G{public static void main(String[]a)throws Exception{String text=IOUtils.toString(new URL(a[0]).openStream()).toLowerCase().replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b","");final Map<String,Integer>p=new HashMap();Matcher m=Pattern.compile("\\b\\w+\\b").matcher(text);Integer b;while(m.find()){String w=m.group();b=p.get(w);p.put(w,b==null?1:b+1);}List<String>v=new Vector(p.keySet());Collections.sort(v,new Comparator(){public int compare(Object l,Object m){return p.get(m)-p.get(l);}});boolean t=true;float r=0;for(String w:v.subList(0,22)){if(t){t=false;r=p.get(w)/(float)(80-(w.length()+4));System.out.println(" "+new String(new char[(int)(p.get(w)/r)]).replace('\0','_'));}System.out.println("|"+new String(new char[(int)(((Integer)p.get(w))/r)]).replace('\0','_')+"|"+w);}}}
Readable version:
import java.net.*;
import java.util.*;
import java.util.regex.*;
import org.apache.commons.io.*;
public class G{
public static void main(String[] a) throws Exception{
String text =
IOUtils.toString(new URL(a[0]).openStream())
.toLowerCase()
.replaceAll("\\b(the|and|of|to|a|i[tns]?|or)\\b", "");
final Map<String, Integer> p = new HashMap();
Matcher m = Pattern.compile("\\b\\w+\\b").matcher(text);
Integer b;
while(m.find()){
String w = m.group();
b = p.get(w);
p.put(w, b == null ? 1 : b + 1);
}
List<String> v = new Vector(p.keySet());
Collections.sort(v, new Comparator(){
public int compare(Object l, Object m){
return p.get(m) - p.get(l);
}
});
boolean t = true;
float r = 0;
for(String w : v.subList(0, 22)){
if(t){
t = false;
r = p.get(w) / (float) (80 - (w.length() + 4));
System.out.println(" "
+ new String(new char[(int) (p.get(w) / r)]).replace('\0',
'_'));
}
System.out.println("|"
+ new String(new char[(int) (((Integer) p.get(w)) / r)]).replace('\0',
'_') + "|" + w);
}
}
}
After I finished mine, I stole some ideas from Matt :3
t=prompt().toLowerCase().replace(/\b(the|and|of|to|a|i[tns]?|or)\b/gm,'');r={};o=[];t.replace(/\b([a-z]+)\b/gm,function(a,w){r[w]?++r[w]:r[w]=1});for(i in r){o.push([i,r[i]])}m=o[0][1];o=o.slice(0,22);o.sort(function(F,D){return D[1]-F[1]});for(B in o){F=o[B];L=new Array(~~(F[1]/m*(76-F[0].length))).join('_');print(' '+L+'\n|'+L+'| '+F[0]+' \n')}
Requires print and prompt function support.
the_foo
, right? (Because then \b
breaks apart) - Joey
Gotta love the big ones...Objective-C (1070 931 905 chars)
#define S NSString
#define C countForObject
#define O objectAtIndex
#define U stringWithCString
main(int g,char**b){id c=[NSCountedSet set];S*d=[S stringWithContentsOfFile:[S U:b[1]]];id p=[NSPredicate predicateWithFormat:@"SELF MATCHES[cd]'(the|and|of|to|a|i[tns]?|or)|[^a-z]'"];[d enumerateSubstringsInRange:NSMakeRange(0,[d length])options:NSStringEnumerationByWords usingBlock:^(S*s,NSRange x,NSRange y,BOOL*z){if(![p evaluateWithObject:s])[c addObject:[s lowercaseString]];}];id s=[[c allObjects]sortedArrayUsingComparator:^(id a,id b){return(NSComparisonResult)([c C:b]-[c C:a]);}];g=[c C:[s O:0]];int j=76-[[s O:0]length];char*k=malloc(80);memset(k,'_',80);S*l=[S U:k length:80];printf(" %s\n",[[l substringToIndex:j]cString]),[[s subarrayWithRange:NSMakeRange(0,22)]enumerateObjectsUsingBlock:^(id a,NSUInteger x,BOOL*y){printf("|%s| %s\n",[[l substringToIndex:[c C:a]*j/g]cString],[a cString]);}];}
Switched to using a lot of depreciate APIs, removed some memory management that wasn't needed, more aggressive whitespace removal
_________________________________________________________________________
|_________________________________________________________________________| she
|______________________________________________________________| said
|__________________________________________________________| you
|____________________________________________________| alice
|________________________________________________| was
|_______________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| with
|______________________________| at
|___________________________| on
|__________________________| all
|________________________| this
|________________________| for
|________________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| so
|___________________| very
|__________________| what
|_________________| they
#define S NSString
, #define C countForObject
, and use these two appropriately. Also replace calloc(80,1)
with simply malloc(80)
, since you're setting the contents straight afterwards. Also, reuse the a
parameter, to save on an int
declaration. This should get it less than 1,000 chars... - please delete me
id
instead of NSCountedSet*
etc! - kennytm
import sys
i="the and of to a i it in or is".split()
d={}
for j in filter(lambda x:x not in i,sys.stdin.read().lower().split()):d[j]=d.get(j,0)+1
w=sorted(d.items(),key=lambda x:x[1])[:-23:-1]
m=sorted(dict(w).values())[-1]
print" %s\n"%("_"*(76-m)),"\n".join(map(lambda x:("|%s| "+x[0])%("_"*((76-m)*x[1]/w[0][1])),w))
f=scan("stdin","ch")
u=unlist
s=strsplit
a=u(s(u(s(tolower(f),"[^a-z]")),"^(the|and|of|to|it|in|or|is|.|)$"))
v=unique(a)
r=sort(sapply(v,function(i) sum(a==i)),T)[2:23] #the first item is an empty string, just skipping it
w=names(r)
q=(78-max(nchar(w)))*r/max(r)
cat(" ",rep("_",q[1])," \n",sep="")
for(i in 1:22){cat("|",rep("_",q[i]),"| ",w[i],"\n",sep="")}
The output is:
_________________________________________________________________________
|_________________________________________________________________________| she
|_______________________________________________________________| you
|____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|___________________________________________| that
|____________________________________| as
|________________________________| her
|_____________________________| at
|_____________________________| with
|__________________________| on
|__________________________| all
|_______________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|_____________________| not
|____________________| they
|____________________| so
|___________________| very
|__________________| what
And if "you" is replaced by something longer:
____________________________________________________________
|____________________________________________________________| she
|____________________________________________________| veryverylongstring
|__________________________________________________| said
|___________________________________________| alice
|______________________________________| was
|___________________________________| that
|_____________________________| as
|__________________________| her
|________________________| at
|________________________| with
|______________________| on
|_____________________| all
|___________________| this
|___________________| for
|___________________| had
|__________________| but
|__________________| be
|__________________| not
|________________| they
|________________| so
|_______________| very
|_______________| what
290 characters in python (text read from standard input)
import sys,re
c={}
for w in re.findall("[a-z]+",sys.stdin.read().lower()):c[w]=c.get(w,0)+1-(","+w+","in",a,i,the,and,of,to,it,in,or,is,")
r=sorted((-v,k)for k,v in c.items())[:22]
sf=max((76.0-len(k))/v for v,k in r)
print" "+"_"*int(r[0][0]*sf)
for v,k in r:print"|"+"_"*int(v*sf)+"| "+k
but... after reading other solutions I all of a sudden realized that efficiency was not a request; so this is another shorter and much slower one (255 characters)
import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print" "+"_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"| "+k
and after some more reading other solutions...
import sys,re
w=re.findall("\w+",sys.stdin.read().lower())
r=sorted((-w.count(x),x)for x in set(w)-set("the and of to a i it in or is".split()))[:22]
f=max((76.-len(k))/v for v,k in r)
print"","_"*int(f*r[0][0])
for v,k in r:print"|"+"_"*int(f*v)+"|",k
And now this solution is almost byte-per-byte identical to Astatine's one :-D
address rxpipe 'pipe (end ?) < Alice.txt',
'|regex split /[^a-zA-Z]/', -- split at non alphbetic character
'|locate 2', -- discard words shorter that 2 char
'|xlate lower', -- translate all words to lower case
, -- discard list words that match list
'|regex not match /^(the||and||of||to||it||in||or||is)$/',
'|l:lookup autoadd before count', -- accumulate and count words
'? l:', -- no master records to feed into lookup
'? l:', -- list of counted words comes here
, -- columns 1-10 hold count, 11-n hold word
'|sort 1.10 d', -- sort in desending order by count
'|take 22', -- take first 22 records only
'|array wordlist', -- store into a rexx array
'|count max', -- get length of longest record
'|var maxword' -- save into a rexx variable
parse value wordlist[1] with count 11 . -- get frequency of first word
barunit = count % (76-(maxword-10)) -- frequency units per chart bar char
say ' '||copies('_', (count+barunit)%barunit) -- first line of the chart
do cntwd over wordlist
parse var cntwd count 11 word -- get word frequency and the word
say '|'||copies('_', (count+barunit)%barunit)||'| '||word||' '
end
The output produced
________________________________________________________________________________ |________________________________________________________________________________| she |_____________________________________________________________________| you |___________________________________________________________________| said |__________________________________________________________| alice |____________________________________________________| was |________________________________________________| that |________________________________________| as |____________________________________| her |_________________________________| at |_________________________________| with |______________________________| on |_____________________________| all |__________________________| this |__________________________| for |__________________________| had |__________________________| but |________________________| be |________________________| not |_______________________| they |______________________| so |_____________________| very |_____________________| what[1] http://ipages.iland.net/~jimj
This Ruby version handles "superlongstringstring". (The first two lines are almost identical to the previous Ruby programs.)
It must be run this way:
ruby -n0777 golf.rb Alice.txt
W=($_.upcase.scan(/\w+/)-%w(THE AND OF TO A I IT
IN OR IS)).group_by{|x|x}.map{|k,v|[-v.size,k]}.sort[0,22]
u=proc{|m|"_"*(W.map{|n,s|(76.0-s.size)/n}.max*m)}
puts" "+u[W[0][0]],W.map{|n,s|"|%s| "%u[n]+s}
The third line creates a closure or lambda that yields a correctly scaled string of underscores:
u = proc{|m| "_" * (W.map{|n,s| (76.0 - s.size)/n}.max * m) }
.max
is used instead of .min
because the numbers are negative.
wc -c
), nice work! - ChristopheD
Improving on the shell version posted earlier, I can get it down to 213 characters:
tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v '^(the|and|of|to|a|i|it|in|or|is)$'|uniq -c|sort -rn|sed 22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,$2}' g|tr 0 _>o
((n++))
done
cat o
In order to get the upper outline on the top bar, I had to expand it to 240 characters:
tr A-Z a-z|tr -Cs a-z \\n|sort|egrep -v "^(the|and|of|to|a|i|it|in|or|is)$"|uniq -c|sort -r|sed 1p\;22q>g
n=1
>o
until egrep -q .{80} o
do
awk '{printf "|%0*d| %s\n",$1*'$n'/1e3,0,NR==1?"":$2}' g|sed '1s,|, ,g'|tr 0 _>o
((n++))
done
cat o
Adding some -i flags may drop the overly long tr A-Z a-z| step; the spec said nothing about the case displayed, and uniq -ci drops any case differences.
egrep -oi [a-z]+|egrep -wiv 'the|and|o[fr]|to|a|i[tns]?'|sort|uniq -ci|sort -nr|head -22|perl -lape'($f,$w)=@F;$.>1or($q,$x)=($f,76-length$w);$b="_"x($f/$q*$x);$_="|$b| $w ";$.>1or$_=" $b\n$_"'
That's minus 11 for the tr plus 2 for the -i's compared to the original 206 chars.
edit: minus 3 for the \\b which can be left out as pattern matching will commence on a boundary anyway.
sort gives lower case first, and uniq -ci takes the first occurence, so the only real change in output will be that Alice retains her upper case initial.
Go, 613 chars, could probably be much smaller:
package main
import(r "regexp";. "bytes";. "io/ioutil";"os";st "strings";s "sort";. "container/vector")
type z struct{c int;w string}
func(e z)Less(o interface{})bool{return o.(z).c<e.c}
func main(){b,_:=ReadAll(os.Stdin);g:=r.MustCompile
c,m,x:=g("[A-Za-z]+").AllMatchesIter(b,0),map[string]int{},g("the|and|of|it|in|or|is|to")
for w:=range c{w=ToLower(w);if len(w)>1&&!x.Match(w){m[string(w)]++}}
o,y:=&Vector{},0
for k,v:=range m{o.Push(z{v,k});if v>y{y=v}}
s.Sort(o)
for i,v:=range *o{if i>21{break};x:=v.(z);c:=int(float(x.c)/float(y)*80)
u:=st.Repeat("_",c);if i<1{println(" "+u)};println("|"+u+"| "+x.w)}}
I feel so dirty.
perl, 188 characters
The perl version above (as well as any regexp splitting based version) can get a few bytes shorter by including the list of forbidden words as negative lookahead assertions, rather than as a separate list. Furthermore the trailing semicolon can be left out.
I also included some other suggestions (- instead of <=>, for/foreach, dropped "keys") to get to
$c{$_}++for grep{$_}map{lc=~/\b(?!(?:the|and|a|of|or|i[nts]?|to)\b)[a-z]+/g}<>;@s=sort{$c{$b}-$c{$a}}%c;$f=76-length$s[0];say$"."_"x$f;say"|"."_"x($c{$_}/$c{$s[0]}*$f)."| $_ "for@s[0..21]
I don't know perl, but I presume that the (?!(?:...)\b) may lose the ?: if the handling around it is fixed.
foo_the_bar
won't get the stop words removed (because of \b
). - Joey
I think it can be made a little bit shorter, but still no idea how.
|q s f m|q:=Bag new. f:=FileStream stdin. m:=0.[f atEnd]whileFalse:[s:=f nextLine.(s notNil)ifTrue:[(s tokenize:'\W+')do:[:i|(((i size)>1)&({'the'.'and'.'of'.'to'.'it'.'in'.'or'.'is'}includes:i)not)ifTrue:[q add:(i asLowercase)]. m:=m max:(i size)]]].(q:=q sortedByCount)from:1to:22 do:[:i|'|'display.((i key)*(77-m)//(q first key))timesRepeat:['='display].('| %1'%{i value})displayNl]
Lua solution: 478 characters.
t,u={},{}for l in io.lines()do
for w in l:gmatch("%a+")do
w=w:lower()if not(" the and of to a i it in or is "):find(" "..w.." ")then
t[w]=1+(t[w]or 0)end
end
end
for k,v in next,t do
u[#u+1]={k,v}end
table.sort(u,function(a,b)return a[2]>b[2]end)m,n=u[1][2],math.min(#u,22)for w=80,1,-1 do
s=""for i=1,n do
a,b=u[i][1],w*u[i][2]/m
if b+#a>=78 then s=nil break end
s2=("_"):rep(b)if i==1 then
s=s.." " ..s2.."\n"end
s=s.."|"..s2.."| "..a.."\n"end
if s then print(s)break end end
Readable version:
t,u={},{}
for line in io.lines() do
for w in line:gmatch("%a+") do
w = w:lower()
if not (" the and of to a i it in or is "):find(" "..w.." ") then
t[w] = 1 + (t[w] or 0)
end
end
end
for k, v in pairs(t) do
u[#u+1]={k, v}
end
table.sort(u, function(a, b)
return a[2] > b[2]
end)
local max = u[1][2]
local n = math.min(#u, 22)
for w = 80, 1, -1 do
s=""
for i = 1, n do
f = u[i][2]
word = u[i][1]
width = w * f / max
if width + #word >= 78 then
s=nil
break
end
s2=("_"):rep(width)
if i==1 then
s=s.." " .. s2 .."\n"
end
s=s.."|" .. s2 .. "| " .. word.."\n"
end
if s then
print(s)
break
end
end
foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {if {[lsearch {the and of to it in or is a i} $w]>=0} {continue};if {[catch {incr Ws($w)}]} {set Ws($w) 1}}
set T [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get Ws]] 0 43]
foreach {w c} $T {lappend L [string length $w];lappend C $c}
set N [tcl::mathfunc::max {*}$L]
set C [lsort -integer $C]
set M [lindex $C end]
puts " [string repeat _ [expr {int((76-$N) * [lindex $T 1] / $M)}]] "
foreach {w c} $T {puts "|[string repeat _ [expr {int((76-$N) * $c / $M)}]]| $w"}
Or, more legibly
foreach w [regexp -all -inline {[a-z]+} [string tolower [read stdin]]] {
if {[lsearch {the and of to a i it in or is} $w] >= 0} { continue }
if {[catch {incr words($w)}]} {
set words($w) 1
}
}
set topwords [lrange [lsort -decreasing -stride 2 -index 1 -integer [array get words]] 0 43]
foreach {word count} $topwords {
lappend lengths [string length $word]
lappend counts $count
}
set maxlength [lindex [lsort -integer $lengths] end]
set counts [lsort -integer $counts]
set mincount [lindex $counts 0].0
set maxcount [lindex $counts end].0
puts " [string repeat _ [expr {int((76-$maxlength) * [lindex $topwords 1] / $maxcount)}]] "
foreach {word count} $topwords {
set barlength [expr {int((76-$maxlength) * $count / $maxcount)}]
puts "|[string repeat _ $barlength]| $word"
}
Another T-SQL solution borrowing some ideas from Martin's solution [1] (min76- etc).
declare @ varchar(max),@w real,@j int;select s=@ into[ ]set @=(select*
from openrowset(bulk'a',single_blob)a)while @>''begin set @=stuff(@,1,
patindex('%[a-z]%',@)-1,'')+'.'set @j=patindex('%[^a-z]%',@)if @j>2insert[ ]
select lower(left(@,@j-1))set @=stuff(@,1,@j,'')end;select top(22)s,count(*)
c into # from[ ]where',the,and,of,to,it,in,or,is,'not like'%,'+s+',%'
group by s order by 2desc;select @w=min((76.-len(s))/c),@=' '+replicate(
'_',max(c)*@w)from #;select @=@+'
|'+replicate('_',c*@w)+'| '+s+' 'from #;print @
The entire solution should be on two lines (concatenate first 7), although you can cut, paste and run it as-is. Total characters = 507 (counting the line break as 1 if you save it in Unix format and execute using SQLCMD)
Assumptions:
#
[ ]
C:\windows\system32\a
And to get onto the list of solutions (<500 char), here's the "relaxed" edition at 483 characters (No vertical bars / No top bar / No trailing space after word)
declare @ varchar(max),@w real,@j int;select s=@ into[ ]set @=(select*
from openrowset(bulk'b',single_blob)a)while @>''begin set @=stuff(@,1,
patindex('%[a-z]%',@)-1,'')+'.'set @j=patindex('%[^a-z]%',@)if @j>2insert[ ]
select lower(left(@,@j-1))set @=stuff(@,1,@j,'')end;select top(22)s,count(*)
c into # from[ ]where',the,and,of,to,it,in,or,is,'not like'%,'+s+',%'
group by s order by 2desc;select @w=min((78.-len(s))/c),@=''from #;select @=@+'
'+replicate('_',c*@w)+' '+s from #;print @
declare @ varchar(max), @w real, @j int
select s=@ into[ ] -- shortcut to create table; use defined variable to specify column type
-- openrowset reads an entire file
set @=(select * from openrowset(bulk'a',single_blob) a) -- a bit shorter than naming 'BulkColumn'
while @>'' begin -- loop until input is empty
set @=stuff(@,1,patindex('%[a-z]%',@)-1,'')+'.' -- remove lead up to first A-Z char *
set @j=patindex('%[^a-z]%',@) -- find first non A-Z char. The +'.' above makes sure there is one
if @j>2insert[ ] select lower(left(@,@j-1)) -- insert only words >1 char
set @=stuff(@,1,@j,'') -- remove word and trailing non A-Z char
end;
select top(22)s,count(*)c
into #
from[ ]
where ',the,and,of,to,it,in,or,is,' not like '%,'+s+',%' -- exclude list
group by s
order by 2desc; -- highest occurence, assume no ties at 22!
-- 80 - 2 vertical bars - 2 spaces = 76
-- @w = weighted frequency
-- this produces a line equal to the length of the max occurence (max(c))
select @w=min((76.-len(s))/c),@=' '+replicate('_',max(c)*@w)
from #;
-- for each word, append it as a new line. note: embedded newline
select @=@+'
|'+replicate('_',c*@w)+'| '+s+' 'from #;
-- note: 22 words in a table should always fit on an 8k page
-- the order of processing should always be the same as the insert-orderby
-- thereby producing the correct output
print @ -- output
[1] https://stackoverflow.com/a/3173246/573261Python, 250 chars
borrowing from all the other Python snippets
import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set("the and of to a i it in or is".split()))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w
If you're cheeky and put the words to avoid as arguments, 223 chars
import re,sys
t=re.findall("\w+","".join(sys.stdin).lower())
W=sorted((-t.count(w),w)for w in set(t)-set(sys.argv[1:]))[:22]
Z,U=W[0],lambda n:"_"*int(n*(76.-len(Z[1]))/Z[0])
print"",U(Z[0])
for(n,w)in W:print"|"+U(n)+"|",w
Output is:
$ python alice4.py the and of to a i it in or is < 11.txt _________________________________________________________________________ |_________________________________________________________________________| she |_______________________________________________________________| you |____________________________________________________________| said |_____________________________________________________| alice |_______________________________________________| was |___________________________________________| that |____________________________________| as |________________________________| her |_____________________________| at |_____________________________| with |____________________________| s |____________________________| t |__________________________| on |__________________________| all |_______________________| this |_______________________| for |_______________________| had |_______________________| but |______________________| be |_____________________| not |____________________| they |____________________| so
Code:
m=[:]
(new URL(args[0]).text.toLowerCase()=~/\w+/).each{it==~/(the|and|of|to|a|i[tns]?|or)/?:(m[it]=1+(m[it]?:0))}
k=m.keySet().sort{a,b->m[b]<=>m[a]}
b={d,c,b->println d+'_'*c+d+' '+b}
b' ',z=77-k[0].size(),''
k[0..21].each{b'|',m[it]*z/m[k[0]],it}
Execution:
$ groovy wordcount.groovy http://www.gutenberg.org/files/11/11.txt
Output:
__________________________________________________________________________
|__________________________________________________________________________| she
|________________________________________________________________| you
|_____________________________________________________________| said
|_____________________________________________________| alice
|_______________________________________________| was
|____________________________________________| that
|____________________________________| as
|_________________________________| her
|______________________________| at
|______________________________| with
|_____________________________| s
|_____________________________| t
|___________________________| on
|__________________________| all
|________________________| this
|_______________________| for
|_______________________| had
|_______________________| but
|______________________| be
|______________________| not
|____________________| they
|____________________| so
N.B. this follows relaxed rules re: long strings
{t::y;{(-1')t#(.:)[b],'(!:)[b:"|",/:(((_)70*x%(*:)x)#\:"_"),\:"|"];}desc(#:')(=)($)(`$inter\:[(,/)" "vs'" "sv/:"'"vs'a(&)0<(#:')a:(_:')read0 -1!x;52#.Q.an])except`the`and`of`to`a`i`it`in`or`is`}
the function takes two arguments: one a file containing the text and the other is the number of lines of the chart to display
q){t::y;{(-1')t#(.:)[b],'(!:)[b:"|",/:(((_)70*x%(*:)x)#\:"_"),\:"|"];}desc(#:')(=)($)(`$inter\:[(,/)" "vs'" "sv/:"'"vs'a(&)0<(#:')a:(_:')read0 -1!x;52#.Q.an])except`the`and`of`to`a`i`it`in`or`is`}[`a.txt;20]
output
|______________________________________________________________________|she
|____________________________________________________________|you
|__________________________________________________________|said
|___________________________________________________|alice
|_____________________________________________|was
|_________________________________________|that
|__________________________________|as
|_______________________________|her
|_____________________________|with
|____________________________|at
|___________________________|t
|___________________________|s
|_________________________|on
|_________________________|all
|_______________________|this
|______________________|for
|______________________|had
|_____________________|but
|_____________________|be
|_____________________|not
She
==she
- ChristopheDs
andt
are represented. - indivbar+word
scaling in general accounts for about 50/60 extra chars. - ChristopheDsed 's/\byou\b/superlongstring/gI' 11.txt
portion... :) - gnarf"foo_the"
in the input string correctly? According to the spec onlyfoo
needs to remain. However, a few regex-based splitting implementations (mine included) yieldfoo
andthe
(I'm going to fix it, but it might be a pretty common error as well). [And yes, bar scaling is exactly 50 additional chars here ;-)] - Joey