Unixtrix: Word frequency in plaintext

perl -pe "tr/[A-Z]/[a-z]/; s/[\!\/\[\];?\'\",\-\$\.\(\)]/ /gs; s/ +/\n/g;" < sotu2007.text | sort | uniq -c | sort -nr > sotu2007freq.text

Context:

David Banash and I wanted to see what issues are on the radar at WIU, based on the content of Big Al’s State of the University speech. I pasted the speech content from the web to a text editor and saved as “sotu2007.text”, then ran the Unix trick to output a word frequency as “sotu2007freq.text”. (The first substantial word? “Quad,” as in “Quad Cities.”)

Explanation:

Use Perl to eliminate punctuation and replace spaces with newlines. (Yeah, that regular expression looks gnarly; it’s just all the backslashing. I forget how to escape a whole regex.) Pass that to sort then uniq -c, which counts the repeated lines. Sort output by number in reverse.

Output:

Here are the first few lines.

135 the
130 and
104 we
99 to
[etc.]

This entry was posted in Shell tricks. Bookmark the permalink.