feeding html to spacy #6157

eip2016num1 · 2020-09-27T23:27:47Z

eip2016num1
Sep 27, 2020

Background:

When I naively feed html (removed scripts, and converted to utf8 string) to spacy to generate BERT vectors, it crashes the cloud GPU (huge memory buildup 60GB+) - likely because the spacy tokenizer is built for normal text, not html, and doesn't really tokenize html leaving mile-long sentences.

The reason I want to use html is = teach text_categorization the user interface aspects of a webpage, for example the 'download' button tells you there is something to download on a page. If I were to extract the text only, a floating word 'download' in the text stream may not offer enough context (if in fact that word is even expressed as text in the html, vs a tag). [this is a hypothesis - we are also trying better classifiers].

The question:
Assuming I should write a custom tokenizer for html, are there other gotchas I should anticipate with html - such as special characters in utf-8, etc ?

thank you!

Answered by svlandeg

Oct 16, 2020

The spaCy pipeline should work just fine with UTF8 encoded text. If you do end up writing this custom tokenizer, you'll probably have to train NER/parser/... models from scratch as well (if you want them), as this type of data will look quite different than your "usual" text. Not sure whether this answers your question - if not, let me know.

View full answer

svlandeg · 2020-10-16T12:53:16Z

svlandeg
Oct 16, 2020

The spaCy pipeline should work just fine with UTF8 encoded text. If you do end up writing this custom tokenizer, you'll probably have to train NER/parser/... models from scratch as well (if you want them), as this type of data will look quite different than your "usual" text. Not sure whether this answers your question - if not, let me know.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feeding html to spacy #6157

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

feeding html to spacy #6157

Uh oh!

eip2016num1 Sep 27, 2020

Replies: 1 comment

Uh oh!

svlandeg Oct 16, 2020

eip2016num1
Sep 27, 2020

svlandeg
Oct 16, 2020