feeding html to spacy #6157
-
|
Background: When I naively feed html (removed scripts, and converted to utf8 string) to spacy to generate BERT vectors, it crashes the cloud GPU (huge memory buildup 60GB+) - likely because the spacy tokenizer is built for normal text, not html, and doesn't really tokenize html leaving mile-long sentences. The reason I want to use html is = teach text_categorization the user interface aspects of a webpage, for example the 'download' button tells you there is something to download on a page. If I were to extract the text only, a floating word 'download' in the text stream may not offer enough context (if in fact that word is even expressed as text in the html, vs a tag). [this is a hypothesis - we are also trying better classifiers]. The question: thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
The spaCy pipeline should work just fine with UTF8 encoded text. If you do end up writing this custom tokenizer, you'll probably have to train NER/parser/... models from scratch as well (if you want them), as this type of data will look quite different than your "usual" text. Not sure whether this answers your question - if not, let me know. |
Beta Was this translation helpful? Give feedback.
The spaCy pipeline should work just fine with UTF8 encoded text. If you do end up writing this custom tokenizer, you'll probably have to train NER/parser/... models from scratch as well (if you want them), as this type of data will look quite different than your "usual" text. Not sure whether this answers your question - if not, let me know.