Skip to content
Discussion options

You must be logged in to vote

The spaCy pipeline should work just fine with UTF8 encoded text. If you do end up writing this custom tokenizer, you'll probably have to train NER/parser/... models from scratch as well (if you want them), as this type of data will look quite different than your "usual" text. Not sure whether this answers your question - if not, let me know.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by ines
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
2 participants