Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence categorization doesn't work using CLI train, json_to_tuple issue #5299

Open
dobryy opened this issue Apr 13, 2020 · 2 comments
Open

Sentence categorization doesn't work using CLI train, json_to_tuple issue #5299

dobryy opened this issue Apr 13, 2020 · 2 comments

Comments

@dobryy
Copy link

@dobryy dobryy commented Apr 13, 2020

How to reproduce the behaviour

I'm trying to check how sentence categorization works. Unfortunately, I experience an issue when trying to train my model using CLI train command:

corpus = GoldCorpus(train_path, dev_path)
  File "gold.pyx", line 225, in spacy.gold.GoldCorpus.__init__
  File "gold.pyx", line 236, in spacy.gold.GoldCorpus.write_msgpack
  File "gold.pyx", line 281, in read_tuples
  File "gold.pyx", line 604, in _json_iterate
  File "gold.pyx", line 515, in json_to_tuple
TypeError: unhashable type: 'list'

According to the GoldParse documentation, the cats attribute is

Labels for text classification. Each key in the dictionary may be a string or an int, or a (start_char, end_char, label) tuple, indicating that the label is applied to only part of the document (usually a sentence).

In my json I have cats specified like this:

"cats": [
                    {
                        "label": [
                            2223,
                            2238,
                            "INGREDIENT_LINE"
                        ],
                        "value": 1
                    },
                    {
                        "label": [
                            2239,
                            2253,
                            "INGREDIENT_LINE"
                        ],
                        "value": 1
                    },

I checked how json_to_tuple works and found out that it doesn't handle tuples as a category label here:

cats[cat["label"]] = cat["value"]

Not sure whether it is enough to check if cat['label'] is a list and convert it into tuple on this line.

Your Environment

Info about spaCy

  • spaCy version: 2.2.4
  • Platform: Linux-5.3.0-46-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.7
@dobryy dobryy changed the title Sentence categorization doesn't work using CLI, json_to_tuple issue Sentence categorization doesn't work using CLI train, json_to_tuple issue Apr 13, 2020
@adrianeboyd adrianeboyd added the docs label Apr 14, 2020
@adrianeboyd
Copy link
Member

@adrianeboyd adrianeboyd commented Apr 14, 2020

Sorry, the docs are out-of-date here. Supporting both versions of cats was the initial plan, but the TextCategorizer only supports whole document cats at this point.

The idea of supporting subdocument cats could make sense for training with the current implementation, but an evaluation that uses this format gets very complicated: do you evaluate every possible span in an eval document? What kind of summary score would you provide for a whole document after evaluating every single span?

Your task is reasonable, but unfortunately not a perfect match for either the current TextCategorizer or EntityRecgonizer. The EntityRecognizer probably gets closer because it handles spans, but it's better at shorter proper name-like spans than longer text spans.

See #3961 for related discussion. Suggestions/help are very welcome!

@dobryy
Copy link
Author

@dobryy dobryy commented Apr 14, 2020

Thank you for the reply and the link to the related discussion.

I'm new to spaCy, ML, NLP, etc and rather looking around and trying different things to figure out how it works and whether it is possible to solve the problem I have using spaCy and NLP. It is very possible that the approach I've chosen is not right.

What I'd like to achieve is to find ingredient sentences in the recipe and then in each sentence/span to find ingredient name, amount and unit of measurement. For ingredient name, amount and unit of measurement, I want to use NER to identify ingredient name, amount and unit of measurement and for ingredient sentences/lines I wanted to use sentence categorization.

If categorization doesn't work for sentences yet it seems like I can create two models. First will use NER to label ingredient sentences/spans and second will be used to identify ingredient name, amount and unit of measurement in these sentences. Please kindly advise if I'm on the right track or I have to choose another approach. Or maybe I've to use totally different tools for that :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.