Python Forum
NLTK create corpora - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: NLTK create corpora (/thread-640.html)



NLTK create corpora - pythlang - Oct-26-2016

Hi guys,

Pretty straightforward and most likely easy question for you guys here:

I'm trying to create and use my own corpora saved as a .txt file, however, it is not being found

There are two files and their directory is as follows:

/jordanxxx/nltk_data/corpora/short_reviews/neg/neg.txt
/jordanxxx/nltk_data/corpora/short_reviews/pos/pos.txt

import nltk
import random
from nltk.corpus import movie_reviews
from nltk.classify.scikitlearn import SklearnClassifier
import pickle

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

from nltk.classify import ClassifierI
from statistics import mode

from nltk import word_tokenize

class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)

    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)

        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf

short_pos = open("short_reviews/pos.txt", "r").read
short_neg = open("short_reviews/neg.txt", "r").read

documents = []

for r in short_pos.split('\n'):
    documents.append((r, "pos"))

for r in short_neg.split('\n'):
    documents.append((r, "neg"))

all_words = []

short_pos_words = word.tokenize(short_pos)
short_neg_words = word.tokenize(short_neg)

for w in short_pos_words:
    all_words.append(w. lower())

for w in short_neg_words:
    all_words.append(w. lower())

all_words = nltk.FreqDist(all_words)
Error:
Traceback (most recent call last):  File "/Users/jordanXXX/Documents/NLP/bettertrainingdata", line 37, in <module>    short_pos = open("short_reviews/pos.txt", "r").read IOError: [Errno 2] No such file or directory: 'short_reviews/pos.txt'
I have already tried:

f=open('neg.txt', 'rU')
Error:
>>> f=open('neg.txt','rU') Traceback (most recent call last):   File "<stdin>", line 1, in <module> IOError: [Errno 2] No such file or directory: 'neg.txt'
and i'm not really trying to add a whole lot of code to append paths etc unless i have to.



any input would be great as I'd really like to use my own bodies of text in the future with something as simple as converting it to a .txt file and copy+pasting into an appropriate spot.




EDIT: I am using Homebrew if that is of any significance


RE: NLTK create corpora - Larz60+ - Oct-26-2016

You could look at the downloader.py file (source available here)
There are probably some hooks that you have to set within nltk itself so it knows about your corpus.


RE: NLTK create corpora - pythlang - Oct-26-2016

(Oct-26-2016, 03:31 AM)Larz60+ Wrote: You could look at the downloader.py file (source available here) There are probably some hooks that you have to set within nltk itself so it knows about your corpus.

thanks,

so do you think something like:


1.9   Loading your own Corpus

If you have your own collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader. Check the location of your files on your file system; in the following example, we have taken this to be the directory /usr/share/dict. Whatever the location, set this to be the value of corpus_root [1]. The second parameter of the PlaintextCorpusReader initializer [2] can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (see 3.4 for information about regular expressions).


from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict' [1]
wordlists = PlaintextCorpusReader(corpus_root, '.*') [2]
wordlists.fileids()
['README', 'connectives', 'propernames', 'web2', 'web2a', 'words']
wordlists.words('connectives')
['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]
would work, and if so how would I write that?

I can post my attempt with traceback if needed


RE: NLTK create corpora - Larz60+ - Oct-26-2016

That looks looks what you need. I did this a couple of years ago,
and not since. I'm afraid your going to have to dig into the book (also available on github)


RE: NLTK create corpora - pythlang - Oct-26-2016

(Oct-26-2016, 06:37 AM)Larz60+ Wrote: That looks looks what you need. I did this a couple of years ago,
and not since. I'm afraid your going to have to dig into the book (also available on github)

Thanks for pointing me in the right direction, Larz; sorry for the delay on the gratitude, I'm just attempting to create a successful coding to post back here for others after I've read through the book you've kindly provided.


RE: NLTK create corpora - Larz60+ - Oct-26-2016

Quote:I've read through the book you've kindly provided

Correction - Book link I provided