Unstructured

Open-Source Pre-Processing Tools for Unstructured Data

Welcome to Unstructured Technologies! We're delivering the first ever open-source toolkit designed to make it easy to prepare unstructured data like PDFs, HTML and Word Documents for downstream data science tasks. Check out our core repos:

unstructured - Core library with pre-processing components for unstructured data, including partitioning, cleaning, and staging bricks.
unstructured-api - Project that provides unstructured's core partitioning capability as an API, able to process many types of raw documents.
unstructured-api-tools - Library that converts pipeline notebooks to REST APIs for easy consumption in data science and machine learning workflows.
unstructured-inference - Library with inferenced code that can be used locally in unstructured or as a hosted service.
pipeline-oer - A document pre-processing pipeline for US Army Officer Evaluation Reports (OERs).
pipeline-paddleocr - A pipeline for running images through PaddleOCR, an an open-source, multilingual OCR tool.
pipeline-sec-filings - A document pre-processing pipeline for SEC filings focused on 10-Ks, 10-Qs, and S-1s.
pipeline-template - Use this template-driven utility when creating a new pipeline- project.

See community for more general documentation about pipeline- family API's and contributing across all Unstructured's repos.

Learn more

Section	Description
Company Website	Unstructured.io product and company info
Documentation	Full `unstructured` documentation

Unstructured

Open-Source Pre-Processing Tools for Unstructured Data

Learn more

Popular repositories

Repositories

People

Top languages

Most used topics