I need help with python for this task. I am creating a program that determines word frequency from a txt file. I managed to load the text in python without punctuations, all lowercase, and not including any numbers that may have been in the text.. Can anybody help me figure out how to count the words and how many times it has been repeated? Thanks.

Question

anonymous · Answer

Can't you load the .txt file as a list, then use something like: "for word in word_list: if word == given_search: count += 1" and then returning the count? I did a similar problem, and used that kind of syntax.

anonymous · Answer

Use a dictionary with the words as the keys and the counts as the values.  When you find a word, increment its count.  If it wasn't already in the dictionary, add it with a count of 1.

anonymous · Answer

@dmancine can you elaborate more on how to use the dictionary for this? do i just convert the list of words as a dictionary and then count from there? 

@bmp does the given_search mean i would be indexing the words in my list? I have converted the file into a list already..

anonymous · Answer

given_search is just a pseudo variable for a given word to search. It's somewhat easier to understand code, at least for me, although dicts are generally better than lists for search operations. What dmancine is saying is probably on the lines of the MIT's code for a problem set, i.e. : http://codepad.org/mWyxc59K . I don't really know how less costly the search operation is compared to list searching, but likely a lot, since you will have constant access time for more than one search.

anonymous · Answer

The get_frequency_dict() function is exactly the same thing I'm describing.  But it takes a sequence.  You could convert the text into a sequence (list) of words, but that's an extra step.  Since you're already iterating over the words, just add them to the dictionary as you find them.

Let's say your text is "day by day".  I'll assume you already know how to iterate over those words individually.  You get the first word, "day", and you look in your dictionary (I'll assume you created an empty dictionary) and you don't find it as a key.  So then you add "day" to your dictionary with a count of 1.  Same for "by".  When you get to the second "day" your dictionary already has the key "day" so you get the existing count, 1, increment it, and store that as the new value for "day".  Now your dictionary looks like
{ "day": 2, "by": 1 }
Those are the counts of the words in your text.

anonymous · Answer

@dmancine, i could do the dictionary without putting the words in a set right?

anonymous · Answer

You said you managed to load the text into python.  I'm assuming you get a line at a time, as a string.  You then have to break apart that string into words (delimited by whitespace).  Once you get one of those words, just put it in the dictionary as I described.  No need to use some intermediate storage for the words.  Otherwise you'll just put the words into something like a list, then just read them out and put them into the dictionary.  Cut out the middleman.

anonymous · Answer

thanks for the help!

anonymous · Answer

Here is my solution using the Natural Language Toolkit ( http://nltk.org): # wordcount.py """ Uses NLTK to create a frequency distribution of words in a string. import nltk """ a_string = "This is a string. All words in this string will be counted." def word_frequency_distribution(a_string): # Create an empty frequency count frequency_count = {} # Create a sentence list. sentences = nltk.sent_tokenize(a_string) # Tokenize the sentence list to get the individual words for sentence in sentences: tokens = nltk.word_tokenize(sentence) # Count the word occurrances for word in tokens: word = word.lower() # Make sure all counted words are lower case if word not in frequency_count: frequency_count[word] = 0 # Put the word in the count array frequency_count[word] += 1 # Count the word occurrance return frequency_count dist = word_frequency_distribution(a_string) for k,v in dist.items(): print k, '=', v

anonymous · Answer

http://nltk.org is the Natural Language ToolKit NLTK documentation: http://www.nltk.org/book