API Reference¶
This page shows the primary APIs, minimal runnable examples, expected outputs, and the dependencies required for optional features. Hyperlinks are included when include_cell_links=True (or when using mode="verbose"). Auto page-break areas are COM-only and appear when auto page-break extraction/output is enabled.
TOC¶
- API Reference
- TOC
- Quick Examples
- Dependencies
- Auto-generated API (mkdocstrings)
- Models
- Error Handling
- Tuning Examples
Quick Examples¶
from exstruct import extract, export
wb = extract("sample.xlsx", mode="standard")
export(wb, "out.json") # compact JSON by default
Expected JSON snippet (links appear when enabled):
{
"book_name": "sample.xlsx",
"sheets": {
"Sheet1": {
"rows": [{ "r": 1, "c": { "0": "Name", "1": "Age" }, "links": null }],
"shapes": [
{
"text": "note",
"l": 10,
"t": 20,
"w": 80,
"h": 24,
"type": "TextBox"
}
],
"charts": [],
"table_candidates": ["A1:B5"]
}
}
}
CLI-equivalent flow via Python:
from pathlib import Path
from exstruct import process_excel
process_excel(
file_path=Path("input.xlsx"),
output_path=None, # default: stdout (redirect if you want a file)
sheets_dir=Path("out_sheets"), # optional per-sheet outputs
out_fmt="json",
image=True,
pdf=True,
mode="standard",
pretty=True,
)
# Same as: exstruct input.xlsx --format json --pdf --image --mode standard --pretty --sheets-dir out_sheets > out.json
Dependencies¶
- Core extraction: pandas, openpyxl (installed with the package).
- YAML export:
pyyaml(lazy import; missing module raisesMissingDependencyError). - TOON export:
python-toon(lazy import; missing module raisesMissingDependencyError). - Auto page-break extraction/export: Excel + COM required (feature is skipped when COM is unavailable).
- Rendering (PDF/PNG): Excel + COM +
pypdfium2are mandatory. Missing Excel/COM orpypdfium2surfaces asRenderError/MissingDependencyError.
Auto-generated API (mkdocstrings)¶
Python APIの最新情報は以下の自動生成セクションを参照してください(docstringベースで同期)。
Core functions¶
exstruct.extract ¶
extract(file_path: str | Path, mode: ExtractionMode = 'standard') -> WorkbookData
Extract an Excel workbook into WorkbookData.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str | Path
|
Path to .xlsx/.xlsm/.xls. |
required |
mode
|
ExtractionMode
|
"light" / "standard" / "verbose" - light: cells + table detection only (no COM, shapes/charts empty). Print areas via openpyxl. - standard: texted shapes + arrows + charts (COM if available), print areas included. Shape/chart size is kept but hidden by default in output. - verbose: all shapes (including textless) with size, charts with size. |
'standard'
|
Returns:
| Type | Description |
|---|---|
WorkbookData
|
WorkbookData containing sheets, rows, shapes, charts, and print areas. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an invalid mode is provided. |
Examples:
Extract with hyperlinks (verbose) and inspect table candidates:
>>> from exstruct import extract
>>> wb = extract("input.xlsx", mode="verbose")
>>> wb.sheets["Sheet1"].table_candidates
['A1:B5']
exstruct.export ¶
export(data: WorkbookData, path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, *, pretty: bool = False, indent: int | None = None) -> None
Save WorkbookData to a file (format inferred from extension).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
WorkbookData
|
WorkbookData from |
required |
path
|
str | Path
|
destination path; extension is used to infer format |
required |
fmt
|
Literal['json', 'yaml', 'yml', 'toon'] | None
|
explicitly set format if desired (json/yaml/yml/toon) |
None
|
pretty
|
bool
|
pretty-print JSON |
False
|
indent
|
int | None
|
JSON indent width (defaults to 2 when pretty=True and indent is None) |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the format is unsupported. |
Examples:
Write pretty JSON and YAML (requires pyyaml):
>>> from exstruct import export, extract
>>> wb = extract("input.xlsx")
>>> export(wb, "out.json", pretty=True)
>>> export(wb, "out.yaml", fmt="yaml")
exstruct.export_sheets ¶
export_sheets(data: WorkbookData, dir_path: str | Path) -> dict[str, Path]
Export each sheet as an individual JSON file.
- Payload: {book_name, sheet_name, sheet: SheetData}
- Returns: {sheet_name: Path}
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
WorkbookData
|
WorkbookData to split by sheet. |
required |
dir_path
|
str | Path
|
Output directory. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
Mapping from sheet name to written JSON path. |
Examples:
>>> from exstruct import export_sheets, extract
>>> wb = extract("input.xlsx")
>>> paths = export_sheets(wb, "out_sheets")
>>> "Sheet1" in paths
True
exstruct.export_sheets_as ¶
export_sheets_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None) -> dict[str, Path]
Export each sheet in the given format (json/yaml/toon); returns sheet name to path map.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
WorkbookData
|
WorkbookData to split by sheet. |
required |
dir_path
|
str | Path
|
Output directory. |
required |
fmt
|
Literal['json', 'yaml', 'yml', 'toon']
|
Output format; inferred defaults to json. |
'json'
|
pretty
|
bool
|
Pretty-print JSON. |
False
|
indent
|
int | None
|
JSON indent width (defaults to 2 when pretty=True and indent is None). |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
Mapping from sheet name to written file path. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported format is passed. |
Examples:
Export per sheet as YAML (requires pyyaml):
>>> from exstruct import export_sheets_as, extract
>>> wb = extract("input.xlsx")
>>> _ = export_sheets_as(wb, "out_yaml", fmt="yaml")
exstruct.export_print_areas_as ¶
export_print_areas_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False) -> dict[str, Path]
Export each print area as a PrintAreaView.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
WorkbookData
|
WorkbookData that contains print areas |
required |
dir_path
|
str | Path
|
output directory |
required |
fmt
|
Literal['json', 'yaml', 'yml', 'toon']
|
json/yaml/yml/toon |
'json'
|
pretty
|
bool
|
Pretty-print JSON output. |
False
|
indent
|
int | None
|
JSON indent width (defaults to 2 when pretty is True and indent is None). |
None
|
normalize
|
bool
|
rebase row/col indices to the print-area origin when True |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_area1_...json) |
Examples:
Export print areas when present:
>>> from exstruct import export_print_areas_as, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> paths = export_print_areas_as(wb, "areas")
>>> isinstance(paths, dict)
True
exstruct.export_auto_page_breaks ¶
export_auto_page_breaks(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False) -> dict[str, Path]
Export auto page-break areas (COM-computed) as PrintAreaView files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
WorkbookData
|
WorkbookData containing auto_print_areas (COM extraction with auto breaks enabled) |
required |
dir_path
|
str | Path
|
output directory |
required |
fmt
|
Literal['json', 'yaml', 'yml', 'toon']
|
json/yaml/yml/toon |
'json'
|
pretty
|
bool
|
Pretty-print JSON output. |
False
|
indent
|
int | None
|
JSON indent width (defaults to 2 when pretty is True and indent is None). |
None
|
normalize
|
bool
|
rebase row/col indices to the area origin when True |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_auto_page1_...json) |
Raises:
| Type | Description |
|---|---|
PrintAreaError
|
If no auto page-break areas are present. |
Examples:
>>> from exstruct import export_auto_page_breaks, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> try:
... export_auto_page_breaks(wb, "auto_areas")
... except PrintAreaError:
... pass
exstruct.export_pdf ¶
export_pdf(excel_path: str | Path, output_pdf: str | Path) -> list[str]
Export an Excel workbook to PDF via Excel COM and return sheet names in order.
exstruct.export_sheet_images ¶
export_sheet_images(excel_path: str | Path, output_dir: str | Path, dpi: int = 144) -> list[Path]
Export each sheet as PNG (via PDF then pypdfium2 rasterization) and return paths in sheet order.
exstruct.process_excel ¶
process_excel(file_path: str | Path, output_path: str | Path | None = None, out_fmt: str = 'json', image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode = 'standard', pretty: bool = False, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None
Convenience wrapper: extract -> serialize (file or stdout) -> optional PDF/PNG.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str | Path
|
Input Excel workbook (path string or Path). |
required |
output_path
|
str | Path | None
|
None for stdout; otherwise, write to file (string or Path). |
None
|
out_fmt
|
str
|
json/yaml/yml/toon. |
'json'
|
image
|
bool
|
True to also output PNGs (requires Excel + COM + pypdfium2). |
False
|
pdf
|
bool
|
True to also output PDF (requires Excel + COM + pypdfium2). |
False
|
dpi
|
int
|
DPI for image output. |
72
|
mode
|
ExtractionMode
|
light/standard/verbose (same meaning as |
'standard'
|
pretty
|
bool
|
Pretty-print JSON. |
False
|
indent
|
int | None
|
JSON indent width. |
None
|
sheets_dir
|
str | Path | None
|
Directory to write per-sheet files (string or Path). |
None
|
print_areas_dir
|
str | Path | None
|
Directory to write per-print-area files (string or Path). |
None
|
auto_page_breaks_dir
|
str | Path | None
|
Directory to write per-auto-page-break files (COM only). |
None
|
stream
|
TextIO | None
|
IO override when output_path is None. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If an unsupported format or mode is given. |
PrintAreaError
|
When exporting auto page breaks without available data. |
RenderError
|
When rendering fails (Excel/COM/pypdfium2 issues). |
Examples:
Extract and write JSON to stdout, plus per-sheet files:
>>> from pathlib import Path
>>> from exstruct import process_excel
>>> process_excel(Path("input.xlsx"), output_path=None, sheets_dir=Path("sheets"))
Render PDF only (COM + Excel required):
>>> process_excel(Path("input.xlsx"), output_path=Path("out.json"), pdf=True)
Engine and options¶
exstruct.engine.ExStructEngine ¶
Configurable engine for ExStruct extraction and export.
Instances are immutable; override options per call if needed.
Key behaviors
- StructOptions: extraction mode and optional table detection params.
- OutputOptions: serialization format/pretty-print, include/exclude filters, per-sheet/per-print-area output dirs, etc.
- Main methods: extract(path, mode=None) -> WorkbookData - Modes: light/standard/verbose - light: COM-free; cells + tables + print areas only (shapes/charts empty) serialize(workbook, ...) -> str - Applies include_* filters, then serializes export(workbook, ...) - Writes to file/stdout; optionally per-sheet and per-print-area files process(file_path, ...) - One-shot extract->export (CLI equivalent), with optional PDF/PNG
from_defaults
staticmethod
¶
from_defaults() -> ExStructEngine
Factory to create an engine with default options.
extract ¶
extract(file_path: str | Path, *, mode: ExtractionMode | None = None) -> WorkbookData
Extract a workbook and return normalized workbook data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str | Path
|
Path to the .xlsx/.xlsm/.xls file to extract. |
required |
mode
|
ExtractionMode | None
|
Extraction mode; defaults to the engine's StructOptions.mode. - light: COM-free; cells, table candidates, and print areas only. - standard: Shapes with text/arrows plus charts; print areas included; size fields retained but hidden from default output. - verbose: All shapes (with size) and charts (with size). |
None
|
serialize ¶
serialize(data: WorkbookData, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None) -> str
Serialize a workbook after applying include/exclude filters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
WorkbookData
|
Workbook to serialize after filtering. |
required |
fmt
|
Literal['json', 'yaml', 'yml', 'toon'] | None
|
Serialization format; defaults to OutputOptions.fmt. |
None
|
pretty
|
bool | None
|
Whether to pretty-print JSON output. |
None
|
indent
|
int | None
|
Indentation to use when pretty-printing JSON. |
None
|
export ¶
export(data: WorkbookData, output_path: str | Path | None = None, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None
Write filtered workbook data to a file or stream.
Includes optional per-sheet and per-print-area outputs when destinations are provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
WorkbookData
|
Workbook to serialize and write. |
required |
output_path
|
str | Path | None
|
Target file path (str or Path); writes to stdout when None. |
None
|
fmt
|
Literal['json', 'yaml', 'yml', 'toon'] | None
|
Serialization format; defaults to OutputOptions.fmt. |
None
|
pretty
|
bool | None
|
Whether to pretty-print JSON output. |
None
|
indent
|
int | None
|
Indentation to use when pretty-printing JSON. |
None
|
sheets_dir
|
str | Path | None
|
Directory for per-sheet outputs when provided (str or Path). |
None
|
print_areas_dir
|
str | Path | None
|
Directory for per-print-area outputs when provided (str or Path). |
None
|
auto_page_breaks_dir
|
str | Path | None
|
Directory for auto page-break outputs (str or Path; COM environments only). |
None
|
stream
|
TextIO | None
|
Stream override when output_path is None. |
None
|
process ¶
process(file_path: str | Path, output_path: str | Path | None = None, *, out_fmt: str | None = None, image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None
One-shot extract->export wrapper (CLI equivalent) with optional PDF/PNG output.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str | Path
|
Input Excel workbook path (str or Path). |
required |
output_path
|
str | Path | None
|
Target file path (str or Path); writes to stdout when None. |
None
|
out_fmt
|
str | None
|
Serialization format for structured output. |
None
|
image
|
bool
|
Whether to export PNGs alongside structured output. |
False
|
pdf
|
bool
|
Whether to export a PDF snapshot alongside structured output. |
False
|
dpi
|
int
|
DPI to use when rendering images. |
72
|
mode
|
ExtractionMode | None
|
Extraction mode; defaults to the engine's StructOptions.mode. |
None
|
pretty
|
bool | None
|
Whether to pretty-print JSON output. |
None
|
indent
|
int | None
|
Indentation to use when pretty-printing JSON. |
None
|
sheets_dir
|
str | Path | None
|
Directory for per-sheet structured outputs (str or Path). |
None
|
print_areas_dir
|
str | Path | None
|
Directory for per-print-area structured outputs (str or Path). |
None
|
auto_page_breaks_dir
|
str | Path | None
|
Directory for auto page-break outputs (str or Path). |
None
|
stream
|
TextIO | None
|
Stream override when writing to stdout. |
None
|
exstruct.engine.StructOptions
dataclass
¶
Extraction-time options for ExStructEngine.
Attributes:
| Name | Type | Description |
|---|---|---|
mode |
ExtractionMode
|
Extraction mode. One of "light", "standard", "verbose". - light: cells + table candidates only (no COM, shapes/charts empty) - standard: texted shapes + arrows + charts (if COM available) - verbose: all shapes (width/height), charts, table candidates |
table_params |
TableParams | None
|
Optional dict passed to |
exstruct.engine.OutputOptions ¶
Bases: BaseModel
Output-time options for ExStructEngine.
- format: serialization format/indent.
- filters: include/exclude flags (rows/shapes/charts/tables/print_areas, size flags).
- destinations: side outputs (per-sheet, per-print-area, stream override).
Legacy flat fields (fmt, pretty, indent, include_*, sheets_dir, print_areas_dir, stream) are still accepted and normalized into the nested structures.
exstruct.engine.FormatOptions ¶
Bases: BaseModel
Formatting options for serialization.
exstruct.engine.FilterOptions ¶
Bases: BaseModel
Include/exclude filters for output.
exstruct.engine.DestinationOptions ¶
Bases: BaseModel
Destinations for optional side outputs.
Models¶
See generated/models.md for the detailed model fields (run python scripts/gen_model_docs.py to refresh).
Model helpers (SheetData / WorkbookData)¶
to_json(pretty=False, indent=None)→ JSON string (pretty when requested)to_yaml()→ YAML string (requirespyyaml)to_toon()→ TOON string (requirespython-toon)save(path, pretty=False, indent=None)→ infers format from suffix (.json/.yaml/.yml/.toon)WorkbookData.__getitem__(name)→ get a SheetData by nameWorkbookData.__iter__()→ yields(sheet_name, SheetData)in order
Example:
wb = extract("input.xlsx")
first = wb["Sheet1"]
for name, sheet in wb:
print(name, len(sheet.rows))
wb.save("out.json", pretty=True)
first.save("sheet.yaml") # requires pyyaml
Error Handling¶
- Exception types:
SerializationError: Unsupported format requested (serialize_workbook, export APIs).MissingDependencyError: Optional dependency (pyyaml/python-toon/pypdfium2) is missing; message includes install instructions.RenderError: Excel/COM is unavailable or PDF/PNG rendering fails.PrintAreaError(ValueError-compatible):export_auto_page_breaksinvoked when noauto_print_areasare available.OutputError: Writing output to disk/stream failed (original exception kept in__cause__).ValueError: Invalid inputs such as an unsupportedmode.- Excel COM unavailable: extraction falls back to cells +
table_candidates;shapes/chartsare empty, warning is logged. - No print areas:
export_print_areas_aswrites nothing and returns{}; this is not an error. - Auto page-break export:
export_auto_page_breaksraisesPrintAreaErrorif no auto page-break areas are present (enable them viaDestinationOptions.auto_page_breaks_dir). - CLI mirrors these behaviors: exits non-zero on failures, prints messages in English.
Tuning Examples¶
- Reduce false positives (layout frames):
set_table_detection_params(table_score_threshold=0.4, coverage_min=0.25)
- Recover missed tiny tables:
set_table_detection_params(density_min=0.03, min_nonempty_cells=2)