Open-Source Pre-Processing Tools for Unstructured Data
Welcome to Unstructured Technologies! We're delivering the first ever open-source toolkit designed to make it easy to prepare unstructured data like PDFs, HTML and Word Documents for downstream data science tasks. Check out our core repos:
unstructured- Core library with pre-processing components for unstructured data, including partitioning, cleaning, and staging bricks.unstructured-api- Project that providesunstructured's core partitioning capability as an API, able to process many types of raw documents.unstructured-api-tools- Library that converts pipeline notebooks to REST APIs for easy consumption in data science and machine learning workflows.unstructured-inference- Library with inferenced code that can be used locally inunstructuredor as a hosted service.pipeline-oer- A document pre-processing pipeline for US Army Officer Evaluation Reports (OERs).pipeline-paddleocr- A pipeline for running images through PaddleOCR, an an open-source, multilingual OCR tool.pipeline-sec-filings- A document pre-processing pipeline for SEC filings focused on 10-Ks, 10-Qs, and S-1s.pipeline-template- Use this template-driven utility when creating a newpipeline-project.
See community for more general documentation about pipeline- family API's and contributing across all Unstructured's repos.
Learn more
| Section | Description |
|---|---|
| Company Website | Unstructured.io product and company info |
| Documentation | Full unstructured documentation |
