Tell me more ×
Cross Validated is a question and answer site for statisticians, data analysts, data miners and data visualization experts. It's 100% free, no registration required.

I am working on a Named Entity Recognition (NER) project. Instead of using an existing library, I decided to implement one from scratch because I wanna learn the basics of how PGMs work under the hood. I converted the words in sentences into feature vectors. The features are manually picked by me, and I can only think of roughly ~20 features (such as: "Is the token capitalized?", "Is the token an English word?", etc.). However, I've heard good NER algorithms represent tokens using way more than 20 features, sometimes hundreds of features. How do they manage to think of so many features? Are there any recommended best practices in feature construction?

share|improve this question
is your question about the thought process that goes into feature selection, or about what other additional features for a NER algorithm might be? – David Marx Aug 6 at 20:13
Hi David, I think I need to know more about what other additional features for NER, and also what are some common approaches to find these features. Thanks – xiaoyao Aug 6 at 20:29
1  
One place to start: you might consider comparing the kinds of features you've developed with the kinds of features in the Stanford NER library (reference slides 10 and 11): nlp.stanford.edu/software/jenny-ner-2007.pdf – David Marx Aug 6 at 20:56
1  
Often times the huge numbers of features can come from sets with extremely high cardinality, like the vocabulary in your document collection, the part of speech, and so on. It's also fairly common to use features from neighboring words, so it's not necessarily the case that people are thinking of lots of unique features focused only on the target token. – lmjohns3 Aug 9 at 21:21

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.