NLTK create corpora - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: NLTK create corpora (/thread-640.html) |
NLTK create corpora - pythlang - Oct-26-2016 Hi guys, Pretty straightforward and most likely easy question for you guys here: I'm trying to create and use my own corpora saved as a .txt file, however, it is not being found There are two files and their directory is as follows: /jordanxxx/nltk_data/corpora/short_reviews/neg/neg.txt /jordanxxx/nltk_data/corpora/short_reviews/pos/pos.txt import nltk import random from nltk.corpus import movie_reviews from nltk.classify.scikitlearn import SklearnClassifier import pickle from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB from sklearn.linear_model import LogisticRegression, SGDClassifier from sklearn.svm import SVC, LinearSVC, NuSVC from nltk.classify import ClassifierI from statistics import mode from nltk import word_tokenize class VoteClassifier(ClassifierI): def __init__(self, *classifiers): self._classifiers = classifiers def classify(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) return mode(votes) def confidence(self, features): votes = [] for c in self._classifiers: v = c.classify(features) votes.append(v) choice_votes = votes.count(mode(votes)) conf = choice_votes / len(votes) return conf short_pos = open("short_reviews/pos.txt", "r").read short_neg = open("short_reviews/neg.txt", "r").read documents = [] for r in short_pos.split('\n'): documents.append((r, "pos")) for r in short_neg.split('\n'): documents.append((r, "neg")) all_words = [] short_pos_words = word.tokenize(short_pos) short_neg_words = word.tokenize(short_neg) for w in short_pos_words: all_words.append(w. lower()) for w in short_neg_words: all_words.append(w. lower()) all_words = nltk.FreqDist(all_words) I have already tried:f=open('neg.txt', 'rU') and i'm not really trying to add a whole lot of code to append paths etc unless i have to.any input would be great as I'd really like to use my own bodies of text in the future with something as simple as converting it to a .txt file and copy+pasting into an appropriate spot. EDIT: I am using Homebrew if that is of any significance RE: NLTK create corpora - Larz60+ - Oct-26-2016 You could look at the downloader.py file (source available here) There are probably some hooks that you have to set within nltk itself so it knows about your corpus. RE: NLTK create corpora - pythlang - Oct-26-2016 (Oct-26-2016, 03:31 AM)Larz60+ Wrote: You could look at the downloader.py file (source available here) There are probably some hooks that you have to set within nltk itself so it knows about your corpus. thanks, so do you think something like: 1.9 Loading your own Corpus If you have your own collection of text files that you would like to access using the above methods, you can easily load them with the help of NLTK's PlaintextCorpusReader. Check the location of your files on your file system; in the following example, we have taken this to be the directory /usr/share/dict. Whatever the location, set this to be the value of corpus_root [1]. The second parameter of the PlaintextCorpusReader initializer [2] can be a list of fileids, like ['a.txt', 'test/b.txt'], or a pattern that matches all fileids, like '[abc]/.*\.txt' (see 3.4 for information about regular expressions). from nltk.corpus import PlaintextCorpusReader corpus_root = '/usr/share/dict' [1] wordlists = PlaintextCorpusReader(corpus_root, '.*') [2] wordlists.fileids() ['README', 'connectives', 'propernames', 'web2', 'web2a', 'words'] wordlists.words('connectives') ['the', 'of', 'and', 'to', 'a', 'in', 'that', 'is', ...]would work, and if so how would I write that? I can post my attempt with traceback if needed RE: NLTK create corpora - Larz60+ - Oct-26-2016 That looks looks what you need. I did this a couple of years ago, and not since. I'm afraid your going to have to dig into the book (also available on github) RE: NLTK create corpora - pythlang - Oct-26-2016 (Oct-26-2016, 06:37 AM)Larz60+ Wrote: That looks looks what you need. I did this a couple of years ago, Thanks for pointing me in the right direction, Larz; sorry for the delay on the gratitude, I'm just attempting to create a successful coding to post back here for others after I've read through the book you've kindly provided. RE: NLTK create corpora - Larz60+ - Oct-26-2016 Quote:I've read through the book you've kindly provided Correction - Book link I provided |