Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary min_count should error when using a nonexistent namespace #2292

Open
nelson-liu opened this issue Jan 6, 2019 · 7 comments
Open

Vocabulary min_count should error when using a nonexistent namespace #2292

nelson-liu opened this issue Jan 6, 2019 · 7 comments

Comments

@nelson-liu
Copy link
Member

@nelson-liu nelson-liu commented Jan 6, 2019

If you use the min_count parameter of the Vocabulary, but you specify a namespace that does not exist, the vocabulary creation will just silently proceed. It'd be great if it could error in this case, perhaps by popping off the namespaces in min_count and erroring if any are left at the end of vocab creation (would probably go at the end of https://github.com/allenai/allennlp/blob/master/allennlp/data/vocabulary.py#L454)?

@schmmd
Copy link
Member

@schmmd schmmd commented Jan 18, 2019

We agree it'd be great to have a fix here, since a number of us have hit it, but we're not sure what the right solution is presently.

@saujasv
Copy link
Contributor

@saujasv saujasv commented Feb 18, 2019

When iterating through token counts in the loop at line 553 (https://github.com/allenai/allennlp/blob/master/allennlp/data/vocabulary.py#L553), the get method is used to either get a min_count for that namespace, or 1 if that namespace isn't in min_count. One way to address this issue would be to have a try block that tries to access min_count[namespace]. If the namespace doesn't exist, a KeyError occurs, and a ConfigurationError could be raised.

An alternate method is iterating through all the keys in min_count before the loop at line 553, and checking if that key is in counter. If it isn't, a ConfigurationError is raised.

If you think either solution is suitable, I could implement it.

@matt-gardner
Copy link
Member

@matt-gardner matt-gardner commented Feb 18, 2019

It seems to me like the second solution is correct - (I think) we want to fall back to a default of 1 for namespaces that aren't mentioned, so the first solution doesn't work. @nelson-liu, what do you think?

@nadgeri14
Copy link
Contributor

@nadgeri14 nadgeri14 commented Jan 16, 2020

@nelson-liu @matt-gardner If this issue is still pending, I would like to take this up. If so, should I go forward with the second approach to solve it?

@matt-gardner
Copy link
Member

@matt-gardner matt-gardner commented Jan 16, 2020

Yes, @nadgeri14, that would be great. The line numbers referenced above have changed, so the place to do it is probably right above this line, at the top of _extend:

if not isinstance(max_vocab_size, dict):

@nadgeri14
Copy link
Contributor

@nadgeri14 nadgeri14 commented Jan 16, 2020

@matt-gardner Thanks, that was really helpful.

I need a small help: Could you please provide a test case, as it will help me to validate my solution before submitting the PR.

Thanks in advance for the help.

@matt-gardner
Copy link
Member

@matt-gardner matt-gardner commented Jan 16, 2020

You just need a counter with some set of keys, and a min_count parameter that has keys not present in the counter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.