perl -pe "tr/[A-Z]/[a-z]/; s/[\!\/\[\];?\'\",\-\$\.\(\)]/ /gs; s/ +/\n/g;" < sotu2007.text | sort | uniq -c | sort -nr > sotu2007freq.text
David Banash and I wanted to see what issues are on the radar at WIU, based on the content of Big Al’s State of the University speech. I pasted the speech content from the web to a text editor and saved as “sotu2007.text”, then ran the Unix trick to output a word frequency as “sotu2007freq.text”. (The first substantial word? “Quad,” as in “Quad Cities.”)
Use Perl to eliminate punctuation and replace spaces with newlines. (Yeah, that regular expression looks gnarly; it’s just all the backslashing. I forget how to escape a whole regex.) Pass that to sort then uniq -c, which counts the repeated lines. Sort output by number in reverse.
Here are the first few lines.