Python 2 :: word counting

Scholars,

Your assignment is to produce the top 10 words from decl.txt.  The algorthim is this:

1.
for each word in the file
if the word does not exist in the dictionary
put the word into a dictionary with a value of 1
else
increment the value associated with the word

2.

invert the dictionary
so instead of word->count  mappings, it will contain count->[list,of,words,with,this,count]

3.

sort the keys of the inverted dictionary

print out the words associated with the 10 highest values

NOTE: this is not the only way to do this in python, but I want you to have practice doing the dictionary inversion, sorting the keys, etc.

CODE FROM CLASS

1.  The code I wrote in class…more or less, with comments

#I.O. library
import os

#open a file for reading
fp = open(”path/to/file/file.txt”, “r”)

#iterate through each line

for line in fp:
line = line.replace(”\n”,”")
line = line.replace(”,”, “”)
line = line.lower()

#line.split returns a list, so you can put it right into a loop
for word in line.split(” “):
print word

# of course, you’ll do more than print the word, you’ll add it to a dictionary with the proper count, etc.

HINTS AND ALTERNATIVES:
There is a faster way to read the file.  You can just read it all out into one big string.  That way you don’t have to go through the song and dance of processing each line.  Just do this…

fp = open(”path/to/file/file.txt”, “r”)
fullText = fp.read()
fp.close()

fullText is now a string of the whole file.

Also here is a shortcut for stripping all the punctuation except the words….

import re
fullText = re.sub(”[^a-z' ]“,”",fullText.lower())

NOTE: there is a blank space after the apostrophe (’) and before the ] in the first argument.  What this says is: “for every character in fullText.lower(), if the character is NOT a through z, an apostrophe, or a blank space, then replace it with an empty string.”

Finally, to make a list of all the words, it can be accomplished in basically one line thusly:

import os
import re

fp = open(”path/to/file.txt”, “r”)
allWords = re.sub(”[^a-z' ]“,”",fp2.read().lower()).split(” “)
fp.close()

Comments are closed.