🧩 Plugins and project ideas master thread #4338

ines · 2019-09-29T16:34:21Z

I was going though the existing enhancement issues again and though it'd be nice to collect ideas for spaCy plugins and related projects. There are always people in the community who are looking for new things to build, so here's some inspiration ✨ For existing plugins and projects, check out the spaCy universe.

If you have questions about the projects I suggested, or the spaCy plugin system in general, I should also be able to help. And if you're looking for collaborators or there's a plugin you'd love to see built, feel free to comment here as well.

Visual Studio Code extension (#2969)

I started on a little spaCy snippets extension ages ago and never really quite finished it. But I always thought it'd be cool to have a spaCy extension with some helpers and maybe some deeper pipeline, data structures and model inspection tools. I haven't really worked with VSCode plugins (yet), but maybe someone from the community has an idea and/or experience? Would be cool to work on this together!

Pandas helpers and utilities (#3702)

I think some helpers for pandas could be a nice spaCy plugin? We wouldn't want to ship anything that depends on pandas in the core library, but I can totally see a little helper library that depends on spaCy and pandas and includes useful functions to represent a spaCy Doc as a dataframe.

✅ See: https://github.com/yash1994/dframcy

Wrappers for debugging pipeline components

Inspired by this Stack Overflow question: https://stackoverflow.com/a/57964354/6400719. Could be a helper that wraps the nlp object and logs processing time and other useful details. I also have a bunch of draft code I'm happy to share if someone wants to work on this. (Also see #3943 for related functionality we want to ship in spaCy.)

Project starter as GitHub repo template

GitHub now supports template repos, so it could be cool to have a "spaCy project starter" template that's set up as a Python package, includes some basic scaffolding around loading models and processing texts, and maybe exposes a small REST API using FastAPI.

microsoft/cookiecutter-spacy-fastapi by @kabirkhan

spaCy + Apache Beam

Thread with notebook and discussion: https://twitter.com/swartchris8/status/1194192895244480512 A package could, for instance, wrap the boilerplate code so all the user has to do is pass in an nlp object and config options (what should be extracted).

Translations of the spaCy course

The spaCy course is open-source and on GitHub and the content is released under a CC BY-NC license. Translating it to other languages could be really cool, to make it easier for people to get started 🙂

Chinese: in progress (@GoooIce)

Other ideas

#2466: Export spaCy models for use in Java environments
#2264: Render dependency graph with graphviz
#2625: visualising NER activations inside spaCy models

GoooIce · 2019-10-17T03:48:13Z

I am translating the course into Chinese:
goooice/spacy-course
course.spacy.cn.miantu.net

yash1994 · 2019-10-17T08:47:08Z

I've made a utility module to integrate Pandas Dataframe with spaCy. https://github.com/yash1994/dframcy

ines · 2019-10-18T09:05:13Z

@GoooIce Woooow, this is really cool! Let me know if you have questions or need help. If you give me the text, I can also make a Chinese version of the logo 😃

@yash1994 Nice, thanks for sharing! Do you want to submit it to the spaCy Universe (see here for details)?

Also, one small suggestion: I think it'd be cleaner if your custom classes like DframCy took a loaded nlp object instead of just the model name. Users often want to load their models in a custom way, decide what to enable/disable or use a blank language class instead. If you let the user load the nlp object themselves, they have full flexibility, and your wrapper won't have to consider all possible options under the hood. It also makes it easier to reuse the same nlp object.

yash1994 · 2019-10-18T09:25:18Z

Thank you @ines, for your suggestions. I understood the point you've made, will make necessary changes in the code and submit a pull request for spaCy universe submission. Thanks again for your time.

kabirkhan · 2019-10-21T21:44:57Z

@ines It's not a Github Template Repo but it's a pretty great start with Cookiecutter.
https://github.com/microsoft/cookiecutter-spacy-fastapi

The API follows the rather opinionated API request/response format of Azure Search Cognitive Skills cause Microsoft

PR to add to universe is here:
#4498

kabirkhan · 2019-10-21T21:46:47Z

If there's interest in a Template Repo I can also contribute that pretty easily.

@ines I'm super interested in working on the debugging of pipeline components. I wrote a quick wrapper around the Language class to time pipeline steps and I've found it to be really useful despite its hacky nature. Would love to see the draft code you mentioned and I can start working on that.

ines · 2019-10-22T13:19:03Z

@kabirkhan Thanks – just shared the cookiecutter template on Twitter!

And here's one draft of a SpacyDebugger – it's actually more comments and TODOs than actual code, but it outlines a few ideas I've had (like, having the debugger store the metadata via extension attributes on the Doc so you can process a bunch of objects and then analyse them later).

import copy
import datetime
from spacy.tokens import Doc


class SpacyDebugger(object):
    def __init__(self, nlp):
        self.orig_nlp = nlp
        self.nlp = self.wrap_pipeline(nlp)
        Doc.set_extension("debug_start_times", default={})
        # TODO: add extension method that calculated execution time based on
        # start and end for given component, e.g. doc._.debug_exec_time("ner")
        # TODO: method on Doc that writes everything to a log file?

    def make_debug_component(self, name):
        def debug_component(doc):
            # TODO: add option to not print but store timestamp in extension
            # attributes on the Doc (for each component) in 
            # doc._.debug_start_times
            # TODO: use logging module instead of print
            # TODO: option to generate visualization?
            print(f"Before '{name}'", datetime.datetime.now().timestamp())
            return doc

        return debug_component

    def wrap_pipeline(self, nlp):
        nlp = copy.deepcopy(nlp)
        # We don't want to modify this while we're looping over it
        pipeline = list(nlp.pipeline)
        for name, pipe in pipeline:
            debug_component = self.make_debug_component(name)
            nlp.add_pipe(debug_component, before=name, name=f"debug_{name}")
            # TODO: add component after and also log end times
        return nlp

skvrahul · 2020-02-28T04:26:33Z

@ines I also would like to work on contributing to the Debugging Wrapper. Could I take that up?

ines added enhancement help wanted help wanted (easy) project idea labels Sep 29, 2019

ines pinned this issue Sep 29, 2019

ines changed the title ~~💫 Plugins and project ideas master thread~~ 🧩 Plugins and project ideas master thread Sep 29, 2019

explosion / spaCy

🧩 Plugins and project ideas master thread #4338

🧩 Plugins and project ideas master thread #4338

ines commented Sep 29, 2019 •

edited

GoooIce commented Oct 17, 2019

yash1994 commented Oct 17, 2019 •

edited

ines commented Oct 18, 2019

yash1994 commented Oct 18, 2019

kabirkhan commented Oct 21, 2019

kabirkhan commented Oct 21, 2019

ines commented Oct 22, 2019

skvrahul commented Feb 28, 2020

explosion / spaCy

Join GitHub today

🧩 Plugins and project ideas master thread #4338

🧩 Plugins and project ideas master thread #4338

Comments

ines commented Sep 29, 2019 • edited

Visual Studio Code extension (#2969)

Pandas helpers and utilities (#3702)

Wrappers for debugging pipeline components

Project starter as GitHub repo template

spaCy + Apache Beam

Translations of the spaCy course

Other ideas

GoooIce commented Oct 17, 2019

yash1994 commented Oct 17, 2019 • edited

ines commented Oct 18, 2019

yash1994 commented Oct 18, 2019

kabirkhan commented Oct 21, 2019

kabirkhan commented Oct 21, 2019

ines commented Oct 22, 2019

skvrahul commented Feb 28, 2020

ines commented Sep 29, 2019 •

edited

yash1994 commented Oct 17, 2019 •

edited