Skip to content

Open-Source Pre-Processing Tools for Unstructured Data

Welcome to Unstructured Technologies! We're delivering the first ever open-source toolkit designed to make it easy to prepare unstructured data like PDFs, HTML and Word Documents for downstream data science tasks. Check out our core repos:

  • unstructured - Core library with pre-processing components for unstructured data, including partitioning, cleaning, and staging bricks.
  • unstructured-api - Project that provides unstructured's core partitioning capability as an API, able to process many types of raw documents.
  • unstructured-api-tools - Library that converts pipeline notebooks to REST APIs for easy consumption in data science and machine learning workflows.
  • unstructured-inference - Library with inferenced code that can be used locally in unstructured or as a hosted service.
  • pipeline-oer - A document pre-processing pipeline for US Army Officer Evaluation Reports (OERs).
  • pipeline-paddleocr - A pipeline for running images through PaddleOCR, an an open-source, multilingual OCR tool.
  • pipeline-sec-filings - A document pre-processing pipeline for SEC filings focused on 10-Ks, 10-Qs, and S-1s.
  • pipeline-template - Use this template-driven utility when creating a new pipeline- project.

See community for more general documentation about pipeline- family API's and contributing across all Unstructured's repos.

Learn more

Section Description
Company Website Unstructured.io product and company info
Documentation Full unstructured documentation

Popular repositories

  1. Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

    HTML 1.5k 96

  2. Preprocessing pipeline notebooks and API supporting text extraction from SEC documents

    Jupyter Notebook 61 13

  3. Jupyter Notebook 51 8

  4. Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

    8 4

Repositories

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…