Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧩 Plugins and project ideas master thread #4338

Open
ines opened this issue Sep 29, 2019 · 8 comments
Open

🧩 Plugins and project ideas master thread #4338

ines opened this issue Sep 29, 2019 · 8 comments

Comments

@ines
Copy link
Member

@ines ines commented Sep 29, 2019

I was going though the existing enhancement issues again and though it'd be nice to collect ideas for spaCy plugins and related projects. There are always people in the community who are looking for new things to build, so here's some inspiration For existing plugins and projects, check out the spaCy universe.

If you have questions about the projects I suggested, or the spaCy plugin system in general, I should also be able to help. And if you're looking for collaborators or there's a plugin you'd love to see built, feel free to comment here as well.


Visual Studio Code extension (#2969)

I started on a little spaCy snippets extension ages ago and never really quite finished it. But I always thought it'd be cool to have a spaCy extension with some helpers and maybe some deeper pipeline, data structures and model inspection tools. I haven't really worked with VSCode plugins (yet), but maybe someone from the community has an idea and/or experience? Would be cool to work on this together!

Pandas helpers and utilities (#3702)

I think some helpers for pandas could be a nice spaCy plugin? We wouldn't want to ship anything that depends on pandas in the core library, but I can totally see a little helper library that depends on spaCy and pandas and includes useful functions to represent a spaCy Doc as a dataframe.

See: https://github.com/yash1994/dframcy

Wrappers for debugging pipeline components

Inspired by this Stack Overflow question: https://stackoverflow.com/a/57964354/6400719. Could be a helper that wraps the nlp object and logs processing time and other useful details. I also have a bunch of draft code I'm happy to share if someone wants to work on this. (Also see #3943 for related functionality we want to ship in spaCy.)

Project starter as GitHub repo template

GitHub now supports template repos, so it could be cool to have a "spaCy project starter" template that's set up as a Python package, includes some basic scaffolding around loading models and processing texts, and maybe exposes a small REST API using FastAPI.

spaCy + Apache Beam

Thread with notebook and discussion: https://twitter.com/swartchris8/status/1194192895244480512 A package could, for instance, wrap the boilerplate code so all the user has to do is pass in an nlp object and config options (what should be extracted).

Translations of the spaCy course

The spaCy course is open-source and on GitHub and the content is released under a CC BY-NC license. Translating it to other languages could be really cool, to make it easier for people to get started 🙂

Other ideas

  • #2466: Export spaCy models for use in Java environments
  • #2264: Render dependency graph with graphviz
  • #2625: visualising NER activations inside spaCy models
@GoooIce
Copy link

@GoooIce GoooIce commented Oct 17, 2019

I am translating the course into Chinese:
goooice/spacy-course
course.spacy.cn.miantu.net

@yash1994
Copy link
Contributor

@yash1994 yash1994 commented Oct 17, 2019

I've made a utility module to integrate Pandas Dataframe with spaCy. https://github.com/yash1994/dframcy

@ines
Copy link
Member Author

@ines ines commented Oct 18, 2019

@GoooIce Woooow, this is really cool! Let me know if you have questions or need help. If you give me the text, I can also make a Chinese version of the logo 😃

@yash1994 Nice, thanks for sharing! Do you want to submit it to the spaCy Universe (see here for details)?

Also, one small suggestion: I think it'd be cleaner if your custom classes like DframCy took a loaded nlp object instead of just the model name. Users often want to load their models in a custom way, decide what to enable/disable or use a blank language class instead. If you let the user load the nlp object themselves, they have full flexibility, and your wrapper won't have to consider all possible options under the hood. It also makes it easier to reuse the same nlp object.

@yash1994
Copy link
Contributor

@yash1994 yash1994 commented Oct 18, 2019

Thank you @ines, for your suggestions. I understood the point you've made, will make necessary changes in the code and submit a pull request for spaCy universe submission. Thanks again for your time.

@kabirkhan
Copy link
Contributor

@kabirkhan kabirkhan commented Oct 21, 2019

@ines It's not a Github Template Repo but it's a pretty great start with Cookiecutter.
https://github.com/microsoft/cookiecutter-spacy-fastapi

The API follows the rather opinionated API request/response format of Azure Search Cognitive Skills cause Microsoft

PR to add to universe is here:
#4498

@kabirkhan
Copy link
Contributor

@kabirkhan kabirkhan commented Oct 21, 2019

If there's interest in a Template Repo I can also contribute that pretty easily.

@ines I'm super interested in working on the debugging of pipeline components. I wrote a quick wrapper around the Language class to time pipeline steps and I've found it to be really useful despite its hacky nature. Would love to see the draft code you mentioned and I can start working on that.

@ines
Copy link
Member Author

@ines ines commented Oct 22, 2019

@kabirkhan Thanks – just shared the cookiecutter template on Twitter!

And here's one draft of a SpacyDebugger – it's actually more comments and TODOs than actual code, but it outlines a few ideas I've had (like, having the debugger store the metadata via extension attributes on the Doc so you can process a bunch of objects and then analyse them later).

import copy
import datetime
from spacy.tokens import Doc


class SpacyDebugger(object):
    def __init__(self, nlp):
        self.orig_nlp = nlp
        self.nlp = self.wrap_pipeline(nlp)
        Doc.set_extension("debug_start_times", default={})
        # TODO: add extension method that calculated execution time based on
        # start and end for given component, e.g. doc._.debug_exec_time("ner")
        # TODO: method on Doc that writes everything to a log file?

    def make_debug_component(self, name):
        def debug_component(doc):
            # TODO: add option to not print but store timestamp in extension
            # attributes on the Doc (for each component) in 
            # doc._.debug_start_times
            # TODO: use logging module instead of print
            # TODO: option to generate visualization?
            print(f"Before '{name}'", datetime.datetime.now().timestamp())
            return doc

        return debug_component

    def wrap_pipeline(self, nlp):
        nlp = copy.deepcopy(nlp)
        # We don't want to modify this while we're looping over it
        pipeline = list(nlp.pipeline)
        for name, pipe in pipeline:
            debug_component = self.make_debug_component(name)
            nlp.add_pipe(debug_component, before=name, name=f"debug_{name}")
            # TODO: add component after and also log end times
        return nlp
@skvrahul
Copy link

@skvrahul skvrahul commented Feb 28, 2020

@ines I also would like to work on contributing to the Debugging Wrapper. Could I take that up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.