Vocabulary min_count should error when using a nonexistent namespace #2292

nelson-liu · 2019-01-06T17:42:30Z

If you use the min_count parameter of the Vocabulary, but you specify a namespace that does not exist, the vocabulary creation will just silently proceed. It'd be great if it could error in this case, perhaps by popping off the namespaces in min_count and erroring if any are left at the end of vocab creation (would probably go at the end of https://github.com/allenai/allennlp/blob/master/allennlp/data/vocabulary.py#L454)?

schmmd · 2019-01-18T21:05:38Z

We agree it'd be great to have a fix here, since a number of us have hit it, but we're not sure what the right solution is presently.

saujasv · 2019-02-18T02:43:58Z

When iterating through token counts in the loop at line 553 (https://github.com/allenai/allennlp/blob/master/allennlp/data/vocabulary.py#L553), the get method is used to either get a min_count for that namespace, or 1 if that namespace isn't in min_count. One way to address this issue would be to have a try block that tries to access min_count[namespace]. If the namespace doesn't exist, a KeyError occurs, and a ConfigurationError could be raised.

An alternate method is iterating through all the keys in min_count before the loop at line 553, and checking if that key is in counter. If it isn't, a ConfigurationError is raised.

If you think either solution is suitable, I could implement it.

matt-gardner · 2019-02-18T06:02:09Z

It seems to me like the second solution is correct - (I think) we want to fall back to a default of 1 for namespaces that aren't mentioned, so the first solution doesn't work. @nelson-liu, what do you think?

nadgeri14 · 2020-01-16T15:25:18Z

@nelson-liu @matt-gardner If this issue is still pending, I would like to take this up. If so, should I go forward with the second approach to solve it?

matt-gardner · 2020-01-16T16:59:58Z

Yes, @nadgeri14, that would be great. The line numbers referenced above have changed, so the place to do it is probably right above this line, at the top of _extend:

allennlp/allennlp/data/vocabulary.py

Line 436 in c9bd9b2

if not isinstance(max_vocab_size, dict):

nadgeri14 · 2020-01-16T18:58:15Z

@matt-gardner Thanks, that was really helpful.

I need a small help: Could you please provide a test case, as it will help me to validate my solution before submitting the PR.

Thanks in advance for the help.

matt-gardner · 2020-01-16T20:11:31Z

You just need a counter with some set of keys, and a min_count parameter that has keys not present in the counter.

nelson-liu added Good First Issue Contributions welcome labels Jan 6, 2019

schmmd added the P2 label Jan 18, 2019

nadgeri14 mentioned this issue Jan 17, 2020

compare namespace in counter and min_count #3644

Open

allenai / allennlp

Vocabulary min_count should error when using a nonexistent namespace #2292

Vocabulary min_count should error when using a nonexistent namespace #2292

nelson-liu commented Jan 6, 2019

schmmd commented Jan 18, 2019

saujasv commented Feb 18, 2019

matt-gardner commented Feb 18, 2019

nadgeri14 commented Jan 16, 2020

matt-gardner commented Jan 16, 2020

nadgeri14 commented Jan 16, 2020

matt-gardner commented Jan 16, 2020

allenai / allennlp

Join GitHub today

Vocabulary min_count should error when using a nonexistent namespace #2292

Vocabulary min_count should error when using a nonexistent namespace #2292

Comments

nelson-liu commented Jan 6, 2019

schmmd commented Jan 18, 2019

saujasv commented Feb 18, 2019

matt-gardner commented Feb 18, 2019

nadgeri14 commented Jan 16, 2020

matt-gardner commented Jan 16, 2020

nadgeri14 commented Jan 16, 2020

matt-gardner commented Jan 16, 2020