Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign up🧩 Plugins and project ideas master thread #4338
Comments
|
I am translating the course into Chinese: |
|
I've made a utility module to integrate Pandas Dataframe with spaCy. https://github.com/yash1994/dframcy |
|
@GoooIce Woooow, this is really cool! Let me know if you have questions or need help. If you give me the text, I can also make a Chinese version of the logo @yash1994 Nice, thanks for sharing! Do you want to submit it to the spaCy Universe (see here for details)? Also, one small suggestion: I think it'd be cleaner if your custom classes like |
|
Thank you @ines, for your suggestions. I understood the point you've made, will make necessary changes in the code and submit a pull request for spaCy universe submission. Thanks again for your time. |
|
@ines It's not a Github Template Repo but it's a pretty great start with Cookiecutter. The API follows the rather opinionated API request/response format of Azure Search Cognitive Skills cause Microsoft PR to add to universe is here: |
|
If there's interest in a Template Repo I can also contribute that pretty easily. @ines I'm super interested in working on the debugging of pipeline components. I wrote a quick wrapper around the Language class to time pipeline steps and I've found it to be really useful despite its hacky nature. Would love to see the draft code you mentioned and I can start working on that. |
|
@kabirkhan Thanks – just shared the cookiecutter template on Twitter! And here's one draft of a import copy
import datetime
from spacy.tokens import Doc
class SpacyDebugger(object):
def __init__(self, nlp):
self.orig_nlp = nlp
self.nlp = self.wrap_pipeline(nlp)
Doc.set_extension("debug_start_times", default={})
# TODO: add extension method that calculated execution time based on
# start and end for given component, e.g. doc._.debug_exec_time("ner")
# TODO: method on Doc that writes everything to a log file?
def make_debug_component(self, name):
def debug_component(doc):
# TODO: add option to not print but store timestamp in extension
# attributes on the Doc (for each component) in
# doc._.debug_start_times
# TODO: use logging module instead of print
# TODO: option to generate visualization?
print(f"Before '{name}'", datetime.datetime.now().timestamp())
return doc
return debug_component
def wrap_pipeline(self, nlp):
nlp = copy.deepcopy(nlp)
# We don't want to modify this while we're looping over it
pipeline = list(nlp.pipeline)
for name, pipe in pipeline:
debug_component = self.make_debug_component(name)
nlp.add_pipe(debug_component, before=name, name=f"debug_{name}")
# TODO: add component after and also log end times
return nlp |
|
@ines I also would like to work on contributing to the Debugging Wrapper. Could I take that up? |
I was going though the existing enhancement issues again and though it'd be nice to collect ideas for spaCy plugins and related projects. There are always people in the community who are looking for new things to build, so here's some inspiration✨ For existing plugins and projects, check out the spaCy universe.
If you have questions about the projects I suggested, or the spaCy plugin system in general, I should also be able to help. And if you're looking for collaborators or there's a plugin you'd love to see built, feel free to comment here as well.
Visual Studio Code extension (#2969)
Pandas helpers and utilities (#3702)
Wrappers for debugging pipeline components
Inspired by this Stack Overflow question: https://stackoverflow.com/a/57964354/6400719. Could be a helper that wraps the
nlpobject and logs processing time and other useful details. I also have a bunch of draft code I'm happy to share if someone wants to work on this. (Also see #3943 for related functionality we want to ship in spaCy.)Project starter as GitHub repo template
GitHub now supports template repos, so it could be cool to have a "spaCy project starter" template that's set up as a Python package, includes some basic scaffolding around loading models and processing texts, and maybe exposes a small REST API using FastAPI.
microsoft/cookiecutter-spacy-fastapiby @kabirkhanspaCy + Apache Beam
Thread with notebook and discussion: https://twitter.com/swartchris8/status/1194192895244480512 A package could, for instance, wrap the boilerplate code so all the user has to do is pass in an
nlpobject and config options (what should be extracted).Translations of the spaCy course
The spaCy course is open-source and on GitHub and the content is released under a CC BY-NC license. Translating it to other languages could be really cool, to make it easier for people to get started🙂
Other ideas