Counting Words
Written by: Hongwei Xi (the creator of ATS)
Intro
This example gives an implementation that counts words in a given source and then sorts these words according to their frequencies.
I regard it as a good programming style to introduce type definitions for various concepts occurring in the code:
//
typedef word = string
typedef iword = (int, string)
//
typedef words = list0(word)
typedef iwords = list0(iword)
//
The type iword
is just for each pair consisting of an int-value and a
string-value.
Let us first read all the words from the standard input (STDIN):
val words =
FILEref_streamize_word(the_stdin())
At the this point, words
is a linear stream of string-values. We can
turn this stream into a linear list:
val words = listize(words)
As we do not plan to use a linear list in this example, we turn this linear list into a non-linear list:
val words = list0_vt2t(words)
Then we can merge-sort the list of words obtained so far:
val words = mergesort(words)
At this point, we have a list of words sorted according to the standard lexicographic ordering on string-values.
The next step is to turn the sorted list of words into a list of iword-values:
val iwords =
(
loop(words)
) where
{
fun
loop(xs: words): iwords =
(
case+ xs of
| nil() => nil()
| cons(x0, xs) => loop2(xs, x0, 1, nil())
) (* end of [loop] *)
and
loop2
( xs: words
, x0: word, i0: int
, res: iwords): iwords =
(
case+ xs of
| nil() =>
cons((i0, x0), res)
| cons(x1, xs) =>
if x0 = x1
then loop2(xs, x0, i0+1, res)
else loop2(xs, x1, 1, cons((i0, x0), res))
) (* end of [loop2] *)
}
For each iword-value in iwords
, the int-value indicates the number of
times the string-value appears in words
.
We can now sort iwords
based on a specific order:
val iwords =
(
mergesort(iwords)
) where
{
implement
list0_mergesort$cmp<iword>
(iw1, iw2) =
let
val i1 = iw1.0
val i2 = iw2.0
in
if i1 > i2 then ~1 else (if i1 < i2 then 1 else compare(iw1.1, iw2.1))
end
}
Note that the specified order indicates that one iword-value iw1
is
less than another one iw2
if the int-value in iw1
is greater than
that in iw2
or the two int-values are equal and the string-value in
iw1
is less than that in iw2
.
Let print out the first 250 (or fewer) words together with their frquencies:
//
#define N 250
//
val
_(*bool*) =
(
list0_iforall(iwords)
) where
{
implement
list0_iforall$test<iword>
(i0, iw) =
if
(i0 < N)
then (println!("word#", i0+1, ": ", iw.1, "(", iw.0, ")"); true)
else false
} (* end of [val] *)
//
One can execute make
to generate WordFrqncyCnt_dats
. For instance,
the following command-line should count the words in the novel Moby Dick
by Herman Melville:
wget -q -O - "http://www.gutenberg.org/files/2701/2701-0.txt" | ./WordFrqncyCnt_dats
The following output is expected (where each word is followed by its frequency in parentheses):
word#1: the(14715)
word#2: of(6742)
word#3: and(6517)
word#4: a(4805)
word#5: to(4707)
word#6: in(4241)
word#7: that(3100)
word#8: it(2536)
word#9: his(2532)
word#10: i(2127)
word#11: he(1900)
word#12: s(1825)
word#13: but(1823)
word#14: with(1770)
word#15: as(1753)
word#16: is(1751)
word#17: was(1646)
word#18: for(1644)
word#19: all(1545)
word#20: this(1443)
word#21: at(1335)
word#22: whale(1245)
...
word#53: man(530)
word#54: up(526)
word#55: into(523)
word#56: ship(519)
word#57: ahab(517)
word#58: more(509)
word#59: if(501)
word#60: them(474)
word#61: ye(473)
word#62: we(469)
word#63: sea(455)
word#64: old(452)
word#65: would(432)
word#66: other(431)
...
word#80: boat(337)
word#81: long(334)
word#82: time(334)
word#83: her(332)
word#84: captain(329)
...
word#106: whales(272)
word#107: thou(271)
word#108: after(270)
word#109: again(263)
word#110: stubb(261)
word#111: how(259)
word#112: did(258)
word#113: your(258)
word#114: may(255)
word#115: queequeg(253)
word#116: little(249)
word#117: can(247)
word#118: round(247)
word#119: while(246)
word#120: sperm(245)
word#121: three(245)
word#122: men(244)
word#123: say(244)
word#124: first(239)
word#125: through(235)
word#126: us(234)
word#127: every(232)
word#128: well(230)
word#129: being(225)
word#130: much(224)
word#131: where(223)
word#132: off(220)
word#133: could(217)
word#134: good(216)
word#135: hand(215)
word#136: same(215)
word#137: our(211)
word#138: side(208)
word#139: ever(206)
word#140: never(206)
word#141: himself(205)
word#142: look(205)
word#143: own(205)
word#144: deck(199)
word#145: starbuck(199)
word#146: almost(197)
word#147: go(194)
word#148: even(193)
word#149: water(190)
word#150: thing(188)
word#151: away(186)
word#152: should(185)
word#153: too(185)
word#154: might(183)
word#155: come(180)
word#156: day(179)
word#157: made(178)
word#158: pequod(178)
word#159: life(176)
word#160: world(176)
word#161: sir(175)
word#162: fish(171)
word#163: many(168)
word#164: among(167)
word#165: far(165)
word#166: seen(165)
word#167: back(164)
word#168: without(164)
word#169: line(160)
word#170: let(158)
word#171: oh(157)
word#172: right(157)
word#173: cried(156)
word#174: eyes(156)
word#175: nor(156)
word#176: aye(155)
word#177: god(153)
word#178: know(153)
word#179: part(153)
word#180: night(152)
...
word#184: boats(147)
word#185: air(143)
word#186: crew(141)
...
word#196: whaling(133)
...
It is not surprising to see the word whale
as the first noun in this list
(word#22): The novel "Moby Dick" is all about whales and whaling!
Happy programming in ATS-Temptory!!!