Sentence categorization doesn't work using CLI train, json_to_tuple issue #5299

dobryy · 2020-04-13T12:27:52Z

How to reproduce the behaviour

I'm trying to check how sentence categorization works. Unfortunately, I experience an issue when trying to train my model using CLI train command:

corpus = GoldCorpus(train_path, dev_path)
  File "gold.pyx", line 225, in spacy.gold.GoldCorpus.__init__
  File "gold.pyx", line 236, in spacy.gold.GoldCorpus.write_msgpack
  File "gold.pyx", line 281, in read_tuples
  File "gold.pyx", line 604, in _json_iterate
  File "gold.pyx", line 515, in json_to_tuple
TypeError: unhashable type: 'list'

According to the GoldParse documentation, the cats attribute is

Labels for text classification. Each key in the dictionary may be a string or an int, or a (start_char, end_char, label) tuple, indicating that the label is applied to only part of the document (usually a sentence).

In my json I have cats specified like this:

"cats": [
                    {
                        "label": [
                            2223,
                            2238,
                            "INGREDIENT_LINE"
                        ],
                        "value": 1
                    },
                    {
                        "label": [
                            2239,
                            2253,
                            "INGREDIENT_LINE"
                        ],
                        "value": 1
                    },

I checked how json_to_tuple works and found out that it doesn't handle tuples as a category label here:

spaCy/spacy/gold.pyx

Line 515 in 6a8a526

cats[cat["label"]] = cat["value"]

Not sure whether it is enough to check if cat['label'] is a list and convert it into tuple on this line.

Your Environment

Info about spaCy

spaCy version: 2.2.4
Platform: Linux-5.3.0-46-generic-x86_64-with-debian-buster-sid
Python version: 3.7.7

adrianeboyd · 2020-04-14T08:26:45Z

Sorry, the docs are out-of-date here. Supporting both versions of cats was the initial plan, but the TextCategorizer only supports whole document cats at this point.

The idea of supporting subdocument cats could make sense for training with the current implementation, but an evaluation that uses this format gets very complicated: do you evaluate every possible span in an eval document? What kind of summary score would you provide for a whole document after evaluating every single span?

Your task is reasonable, but unfortunately not a perfect match for either the current TextCategorizer or EntityRecgonizer. The EntityRecognizer probably gets closer because it handles spans, but it's better at shorter proper name-like spans than longer text spans.

See #3961 for related discussion. Suggestions/help are very welcome!

dobryy · 2020-04-14T10:41:41Z

Thank you for the reply and the link to the related discussion.

I'm new to spaCy, ML, NLP, etc and rather looking around and trying different things to figure out how it works and whether it is possible to solve the problem I have using spaCy and NLP. It is very possible that the approach I've chosen is not right.

What I'd like to achieve is to find ingredient sentences in the recipe and then in each sentence/span to find ingredient name, amount and unit of measurement. For ingredient name, amount and unit of measurement, I want to use NER to identify ingredient name, amount and unit of measurement and for ingredient sentences/lines I wanted to use sentence categorization.

If categorization doesn't work for sentences yet it seems like I can create two models. First will use NER to label ingredient sentences/spans and second will be used to identify ingredient name, amount and unit of measurement in these sentences. Please kindly advise if I'm on the right track or I have to choose another approach. Or maybe I've to use totally different tools for that :)

dobryy changed the title ~~Sentence categorization doesn't work using CLI, json_to_tuple issue~~ Sentence categorization doesn't work using CLI train, json_to_tuple issue Apr 13, 2020

svlandeg added feat / cli feat / textcat labels Apr 13, 2020

adrianeboyd added the docs label Apr 14, 2020

explosion / spaCy

Sentence categorization doesn't work using CLI train, json_to_tuple issue #5299

Sentence categorization doesn't work using CLI train, json_to_tuple issue #5299

dobryy commented Apr 13, 2020

adrianeboyd commented Apr 14, 2020

dobryy commented Apr 14, 2020

explosion / spaCy

Join GitHub today

Sentence categorization doesn't work using CLI train, json_to_tuple issue #5299

Sentence categorization doesn't work using CLI train, json_to_tuple issue #5299

Comments

dobryy commented Apr 13, 2020

How to reproduce the behaviour

Your Environment

Info about spaCy

adrianeboyd commented Apr 14, 2020

dobryy commented Apr 14, 2020