Unixtrix: Word frequency in plaintext

perl -pe "tr/[A-Z]/[a-z]/; s/[\!\/\[\];?\'\",\-\$\.\(\)]/ /gs; s/ +/\n/g;" < sotu2007.text | sort | uniq -c | sort -nr > sotu2007freq.text


David Banash and I wanted to see what issues are on the radar at WIU, based on the content of Big Al’s State of the University speech. I pasted the speech content from the web to a text editor and saved as “sotu2007.text”, then ran the Unix trick to output a word frequency as “sotu2007freq.text”. (The first substantial word? “Quad,” as in “Quad Cities.”)


Use Perl to eliminate punctuation and replace spaces with newlines. (Yeah, that regular expression looks gnarly; it’s just all the backslashing. I forget how to escape a whole regex.) Pass that to sort then uniq -c, which counts the repeated lines. Sort output by number in reverse.


Here are the first few lines.

135 the
130 and
104 we
99 to

This entry was posted in Shell tricks. Bookmark the permalink.