Quick post about using the unstructured Python library for working with documents. (e.g. ETL processes)
- https://github.com/Unstructured-IO/unstructured
- https://docs.unstructured.io/open-source/introduction/overview
When using this library, you’ll need to install/import the library at some point.
- pip install unstructured[all-docs]
- pip install unstructured[docx]
If you are only working with Word (docx) files, you will want to only install the unstructured lib using the second command above.
Why only install what you need?
- unstructured[docx]: 56 packages installed
- unstructured[all-docs]: 129 packages installed
- Difference: 73 additional packages (130% more)
You’ll import the unstructured lib into your Python script with commands like the following.
- from unstructured.partition.docx import partition_docx
- from unstructured.partition.auto import partition
Enjoy!
Leave a comment