python - how to make feature vector from the lists -
i'm new python. have train data in bag of words.each line of train data article. labels of train data in file , each label equal article in train data. did stemming on train data , removed stop words. output lists of words of each article(line). want extract feature vector of , use in knn classifier in python.. don't know how it! appreciate quick answer. here's code things did:
import nltk nltk.corpus import stopwords nltk import stem stemmer=stem.porterstemmer() open('data.txt')as file: while 1: line=file.readline().split() filtered_words = [w w in line if not w in stopwords.words('english')] documents = [stemmer.stem(line) line in filtered_words] print(documents) if not line: break pass
take @ scikit-learn's countvectorizer or tfidfvectorizer. these can take list of documents (these lists of tokens, in example) input, , return feature matrix:
from sklearn.feature_extraction.text import countvectorizer count_vect = countvectorizer() x_train_counts = count_vect.fit_transform(your_list_of_documents)
you can find more information in working text data tutorial.
Comments
Post a Comment