API Reference¶

This page shows the primary APIs, minimal runnable examples, expected outputs, and the dependencies required for optional features. Hyperlinks are included when include_cell_links=True (or when using mode="verbose"). Auto page-break areas are COM-only and appear when auto page-break extraction/output is enabled.

TOC¶

API Reference
TOC
Quick Examples
Dependencies
Auto-generated API (mkdocstrings)
Models
- Model helpers SheetData / WorkbookData
Error Handling
Tuning Examples

Quick Examples¶

from exstruct import extract, export

wb = extract("sample.xlsx", mode="standard")
export(wb, "out.json")  # compact JSON by default

Expected JSON snippet (links appear when enabled):

{
  "book_name": "sample.xlsx",
  "sheets": {
    "Sheet1": {
      "rows": [{ "r": 1, "c": { "0": "Name", "1": "Age" }, "links": null }],
      "shapes": [
        {
          "text": "note",
          "l": 10,
          "t": 20,
          "w": 80,
          "h": 24,
          "type": "TextBox"
        }
      ],
      "charts": [],
      "table_candidates": ["A1:B5"]
    }
  }
}

CLI-equivalent flow via Python:

from pathlib import Path
from exstruct import process_excel

process_excel(
    file_path=Path("input.xlsx"),
    output_path=None,  # default: stdout (redirect if you want a file)
    sheets_dir=Path("out_sheets"),  # optional per-sheet outputs
    out_fmt="json",
    image=True,
    pdf=True,
    mode="standard",
    pretty=True,
)
# Same as: exstruct input.xlsx --format json --pdf --image --mode standard --pretty --sheets-dir out_sheets > out.json

Dependencies¶

Core extraction: pandas, openpyxl (installed with the package).
YAML export: pyyaml (lazy import; missing module raises MissingDependencyError).
TOON export: python-toon (lazy import; missing module raises MissingDependencyError).
Auto page-break extraction/export: Excel + COM required (feature is skipped when COM is unavailable).
Rendering (PDF/PNG): Excel + COM + pypdfium2 are mandatory. Missing Excel/COM or pypdfium2 surfaces as RenderError/MissingDependencyError.

Auto-generated API (mkdocstrings)¶

Python APIの最新情報は以下の自動生成セクションを参照してください（docstringベースで同期）。

Core functions¶

exstruct.extract ¶

extract(file_path: str | Path, mode: ExtractionMode = 'standard') -> WorkbookData

Extract an Excel workbook into WorkbookData.

Parameters:

Name	Type	Description	Default
`file_path`	`str \| Path`	Path to .xlsx/.xlsm/.xls.	required
`mode`	`ExtractionMode`	"light" / "standard" / "verbose" - light: cells + table detection only (no COM, shapes/charts empty). Print areas via openpyxl. - standard: texted shapes + arrows + charts (COM if available), print areas included. Shape/chart size is kept but hidden by default in output. - verbose: all shapes (including textless) with size, charts with size.	`'standard'`

Returns:

Type	Description
`WorkbookData`	WorkbookData containing sheets, rows, shapes, charts, and print areas.

Raises:

Type	Description
`ValueError`	If an invalid mode is provided.

Examples:

Extract with hyperlinks (verbose) and inspect table candidates:

>>> from exstruct import extract
>>> wb = extract("input.xlsx", mode="verbose")
>>> wb.sheets["Sheet1"].table_candidates
['A1:B5']

exstruct.export ¶

export(data: WorkbookData, path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, *, pretty: bool = False, indent: int | None = None) -> None

Save WorkbookData to a file (format inferred from extension).

Parameters:

Name	Type	Description	Default
`data`	`WorkbookData`	WorkbookData from `extract` or similar	required
`path`	`str \| Path`	destination path; extension is used to infer format	required
`fmt`	`Literal['json', 'yaml', 'yml', 'toon'] \| None`	explicitly set format if desired (json/yaml/yml/toon)	`None`
`pretty`	`bool`	pretty-print JSON	`False`
`indent`	`int \| None`	JSON indent width (defaults to 2 when pretty=True and indent is None)	`None`

Raises:

Type	Description
`ValueError`	If the format is unsupported.

Examples:

Write pretty JSON and YAML (requires pyyaml):

>>> from exstruct import export, extract
>>> wb = extract("input.xlsx")
>>> export(wb, "out.json", pretty=True)
>>> export(wb, "out.yaml", fmt="yaml")

exstruct.export_sheets ¶

export_sheets(data: WorkbookData, dir_path: str | Path) -> dict[str, Path]

Export each sheet as an individual JSON file.

Payload: {book_name, sheet_name, sheet: SheetData}
Returns: {sheet_name: Path}

Parameters:

Name	Type	Description	Default
`data`	`WorkbookData`	WorkbookData to split by sheet.	required
`dir_path`	`str \| Path`	Output directory.	required

Returns:

Type	Description
`dict[str, Path]`	Mapping from sheet name to written JSON path.

Examples:

>>> from exstruct import export_sheets, extract
>>> wb = extract("input.xlsx")
>>> paths = export_sheets(wb, "out_sheets")
>>> "Sheet1" in paths
True

exstruct.export_sheets_as ¶

export_sheets_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None) -> dict[str, Path]

Export each sheet in the given format (json/yaml/toon); returns sheet name to path map.

Parameters:

Name	Type	Description	Default
`data`	`WorkbookData`	WorkbookData to split by sheet.	required
`dir_path`	`str \| Path`	Output directory.	required
`fmt`	`Literal['json', 'yaml', 'yml', 'toon']`	Output format; inferred defaults to json.	`'json'`
`pretty`	`bool`	Pretty-print JSON.	`False`
`indent`	`int \| None`	JSON indent width (defaults to 2 when pretty=True and indent is None).	`None`

Returns:

Type	Description
`dict[str, Path]`	Mapping from sheet name to written file path.

Raises:

Type	Description
`ValueError`	If an unsupported format is passed.

Examples:

Export per sheet as YAML (requires pyyaml):

>>> from exstruct import export_sheets_as, extract
>>> wb = extract("input.xlsx")
>>> _ = export_sheets_as(wb, "out_yaml", fmt="yaml")

exstruct.export_print_areas_as ¶

export_print_areas_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False) -> dict[str, Path]

Export each print area as a PrintAreaView.

Parameters:

Name	Type	Description	Default
`data`	`WorkbookData`	WorkbookData that contains print areas	required
`dir_path`	`str \| Path`	output directory	required
`fmt`	`Literal['json', 'yaml', 'yml', 'toon']`	json/yaml/yml/toon	`'json'`
`pretty`	`bool`	Pretty-print JSON output.	`False`
`indent`	`int \| None`	JSON indent width (defaults to 2 when pretty is True and indent is None).	`None`
`normalize`	`bool`	rebase row/col indices to the print-area origin when True	`False`

Returns:

Type	Description
`dict[str, Path]`	dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_area1_...json)

Examples:

Export print areas when present:

>>> from exstruct import export_print_areas_as, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> paths = export_print_areas_as(wb, "areas")
>>> isinstance(paths, dict)
True

exstruct.export_auto_page_breaks ¶

export_auto_page_breaks(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False) -> dict[str, Path]

Export auto page-break areas (COM-computed) as PrintAreaView files.

Parameters:

Name	Type	Description	Default
`data`	`WorkbookData`	WorkbookData containing auto_print_areas (COM extraction with auto breaks enabled)	required
`dir_path`	`str \| Path`	output directory	required
`fmt`	`Literal['json', 'yaml', 'yml', 'toon']`	json/yaml/yml/toon	`'json'`
`pretty`	`bool`	Pretty-print JSON output.	`False`
`indent`	`int \| None`	JSON indent width (defaults to 2 when pretty is True and indent is None).	`None`
`normalize`	`bool`	rebase row/col indices to the area origin when True	`False`

Returns:

Type	Description
`dict[str, Path]`	dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_auto_page1_...json)

Raises:

Type	Description
`PrintAreaError`	If no auto page-break areas are present.

Examples:

>>> from exstruct import export_auto_page_breaks, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> try:
...     export_auto_page_breaks(wb, "auto_areas")
... except PrintAreaError:
...     pass

exstruct.export_pdf ¶

export_pdf(excel_path: str | Path, output_pdf: str | Path) -> list[str]

Export an Excel workbook to PDF via Excel COM and return sheet names in order.

exstruct.export_sheet_images ¶

export_sheet_images(excel_path: str | Path, output_dir: str | Path, dpi: int = 144) -> list[Path]

Export each sheet as PNG (via PDF then pypdfium2 rasterization) and return paths in sheet order.

exstruct.process_excel ¶

process_excel(file_path: str | Path, output_path: str | Path | None = None, out_fmt: str = 'json', image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode = 'standard', pretty: bool = False, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

Convenience wrapper: extract -> serialize (file or stdout) -> optional PDF/PNG.

Parameters:

Name	Type	Description	Default
`file_path`	`str \| Path`	Input Excel workbook (path string or Path).	required
`output_path`	`str \| Path \| None`	None for stdout; otherwise, write to file (string or Path).	`None`
`out_fmt`	`str`	json/yaml/yml/toon.	`'json'`
`image`	`bool`	True to also output PNGs (requires Excel + COM + pypdfium2).	`False`
`pdf`	`bool`	True to also output PDF (requires Excel + COM + pypdfium2).	`False`
`dpi`	`int`	DPI for image output.	`72`
`mode`	`ExtractionMode`	light/standard/verbose (same meaning as `extract`).	`'standard'`
`pretty`	`bool`	Pretty-print JSON.	`False`
`indent`	`int \| None`	JSON indent width.	`None`
`sheets_dir`	`str \| Path \| None`	Directory to write per-sheet files (string or Path).	`None`
`print_areas_dir`	`str \| Path \| None`	Directory to write per-print-area files (string or Path).	`None`
`auto_page_breaks_dir`	`str \| Path \| None`	Directory to write per-auto-page-break files (COM only).	`None`
`stream`	`TextIO \| None`	IO override when output_path is None.	`None`

Raises:

Type	Description
`ValueError`	If an unsupported format or mode is given.
`PrintAreaError`	When exporting auto page breaks without available data.
`RenderError`	When rendering fails (Excel/COM/pypdfium2 issues).

Examples:

Extract and write JSON to stdout, plus per-sheet files:

>>> from pathlib import Path
>>> from exstruct import process_excel
>>> process_excel(Path("input.xlsx"), output_path=None, sheets_dir=Path("sheets"))

Render PDF only (COM + Excel required):

>>> process_excel(Path("input.xlsx"), output_path=Path("out.json"), pdf=True)

Engine and options¶

exstruct.engine.ExStructEngine ¶

Configurable engine for ExStruct extraction and export.

Instances are immutable; override options per call if needed.

Key behaviors

StructOptions: extraction mode and optional table detection params.
OutputOptions: serialization format/pretty-print, include/exclude filters, per-sheet/per-print-area output dirs, etc.
Main methods: extract(path, mode=None) -> WorkbookData - Modes: light/standard/verbose - light: COM-free; cells + tables + print areas only (shapes/charts empty) serialize(workbook, ...) -> str - Applies include_* filters, then serializes export(workbook, ...) - Writes to file/stdout; optionally per-sheet and per-print-area files process(file_path, ...) - One-shot extract->export (CLI equivalent), with optional PDF/PNG

from_defaults `staticmethod` ¶

from_defaults() -> ExStructEngine

Factory to create an engine with default options.

extract ¶

extract(file_path: str | Path, *, mode: ExtractionMode | None = None) -> WorkbookData

Extract a workbook and return normalized workbook data.

Parameters:

Name	Type	Description	Default
`file_path`	`str \| Path`	Path to the .xlsx/.xlsm/.xls file to extract.	required
`mode`	`ExtractionMode \| None`	Extraction mode; defaults to the engine's StructOptions.mode. - light: COM-free; cells, table candidates, and print areas only. - standard: Shapes with text/arrows plus charts; print areas included; size fields retained but hidden from default output. - verbose: All shapes (with size) and charts (with size).	`None`

serialize ¶

serialize(data: WorkbookData, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None) -> str

Serialize a workbook after applying include/exclude filters.

Parameters:

Name	Type	Description	Default
`data`	`WorkbookData`	Workbook to serialize after filtering.	required
`fmt`	`Literal['json', 'yaml', 'yml', 'toon'] \| None`	Serialization format; defaults to OutputOptions.fmt.	`None`
`pretty`	`bool \| None`	Whether to pretty-print JSON output.	`None`
`indent`	`int \| None`	Indentation to use when pretty-printing JSON.	`None`

export ¶

export(data: WorkbookData, output_path: str | Path | None = None, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

Write filtered workbook data to a file or stream.

Includes optional per-sheet and per-print-area outputs when destinations are provided.

Parameters:

Name	Type	Description	Default
`data`	`WorkbookData`	Workbook to serialize and write.	required
`output_path`	`str \| Path \| None`	Target file path (str or Path); writes to stdout when None.	`None`
`fmt`	`Literal['json', 'yaml', 'yml', 'toon'] \| None`	Serialization format; defaults to OutputOptions.fmt.	`None`
`pretty`	`bool \| None`	Whether to pretty-print JSON output.	`None`
`indent`	`int \| None`	Indentation to use when pretty-printing JSON.	`None`
`sheets_dir`	`str \| Path \| None`	Directory for per-sheet outputs when provided (str or Path).	`None`
`print_areas_dir`	`str \| Path \| None`	Directory for per-print-area outputs when provided (str or Path).	`None`
`auto_page_breaks_dir`	`str \| Path \| None`	Directory for auto page-break outputs (str or Path; COM environments only).	`None`
`stream`	`TextIO \| None`	Stream override when output_path is None.	`None`

process ¶

process(file_path: str | Path, output_path: str | Path | None = None, *, out_fmt: str | None = None, image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

One-shot extract->export wrapper (CLI equivalent) with optional PDF/PNG output.

Parameters:

Name	Type	Description	Default
`file_path`	`str \| Path`	Input Excel workbook path (str or Path).	required
`output_path`	`str \| Path \| None`	Target file path (str or Path); writes to stdout when None.	`None`
`out_fmt`	`str \| None`	Serialization format for structured output.	`None`
`image`	`bool`	Whether to export PNGs alongside structured output.	`False`
`pdf`	`bool`	Whether to export a PDF snapshot alongside structured output.	`False`
`dpi`	`int`	DPI to use when rendering images.	`72`
`mode`	`ExtractionMode \| None`	Extraction mode; defaults to the engine's StructOptions.mode.	`None`
`pretty`	`bool \| None`	Whether to pretty-print JSON output.	`None`
`indent`	`int \| None`	Indentation to use when pretty-printing JSON.	`None`
`sheets_dir`	`str \| Path \| None`	Directory for per-sheet structured outputs (str or Path).	`None`
`print_areas_dir`	`str \| Path \| None`	Directory for per-print-area structured outputs (str or Path).	`None`
`auto_page_breaks_dir`	`str \| Path \| None`	Directory for auto page-break outputs (str or Path).	`None`
`stream`	`TextIO \| None`	Stream override when writing to stdout.	`None`

exstruct.engine.StructOptions `dataclass` ¶

Extraction-time options for ExStructEngine.

Attributes:

Name	Type	Description
`mode`	`ExtractionMode`	Extraction mode. One of "light", "standard", "verbose". - light: cells + table candidates only (no COM, shapes/charts empty) - standard: texted shapes + arrows + charts (if COM available) - verbose: all shapes (width/height), charts, table candidates
`table_params`	`TableParams \| None`	Optional dict passed to `set_table_detection_params(**table_params)` before extraction. Use this to tweak table detection heuristics per engine instance without touching global state.

exstruct.engine.OutputOptions ¶

Bases: BaseModel

Output-time options for ExStructEngine.

format: serialization format/indent.
filters: include/exclude flags (rows/shapes/charts/tables/print_areas, size flags).
destinations: side outputs (per-sheet, per-print-area, stream override).

Legacy flat fields (fmt, pretty, indent, include_*, sheets_dir, print_areas_dir, stream) are still accepted and normalized into the nested structures.

exstruct.engine.FormatOptions ¶

Bases: BaseModel

Formatting options for serialization.

Bases: BaseModel

Include/exclude filters for output.

exstruct.engine.DestinationOptions ¶

Bases: BaseModel

Destinations for optional side outputs.

Models¶

See generated/models.md for the detailed model fields (run python scripts/gen_model_docs.py to refresh).

Model helpers (SheetData / WorkbookData)¶

to_json(pretty=False, indent=None) → JSON string (pretty when requested)
to_yaml() → YAML string (requires pyyaml)
to_toon() → TOON string (requires python-toon)
save(path, pretty=False, indent=None) → infers format from suffix (.json/.yaml/.yml/.toon)
WorkbookData.__getitem__(name) → get a SheetData by name
WorkbookData.__iter__() → yields (sheet_name, SheetData) in order

Example:

wb = extract("input.xlsx")
first = wb["Sheet1"]
for name, sheet in wb:
    print(name, len(sheet.rows))
wb.save("out.json", pretty=True)
first.save("sheet.yaml")  # requires pyyaml

Error Handling¶

Exception types:
SerializationError: Unsupported format requested (serialize_workbook, export APIs).
MissingDependencyError: Optional dependency (pyyaml / python-toon / pypdfium2) is missing; message includes install instructions.
RenderError: Excel/COM is unavailable or PDF/PNG rendering fails.
PrintAreaError (ValueError-compatible): export_auto_page_breaks invoked when no auto_print_areas are available.
OutputError: Writing output to disk/stream failed (original exception kept in __cause__).
ValueError: Invalid inputs such as an unsupported mode.
Excel COM unavailable: extraction falls back to cells + table_candidates; shapes/charts are empty, warning is logged.
No print areas: export_print_areas_as writes nothing and returns {}; this is not an error.
Auto page-break export: export_auto_page_breaks raises PrintAreaError if no auto page-break areas are present (enable them via DestinationOptions.auto_page_breaks_dir).
CLI mirrors these behaviors: exits non-zero on failures, prints messages in English.

Tuning Examples¶

Reduce false positives (layout frames):

set_table_detection_params(table_score_threshold=0.4, coverage_min=0.25)

Recover missed tiny tables:

set_table_detection_params(density_min=0.03, min_nonempty_cells=2)

API Reference¶

TOC¶

Quick Examples¶

Dependencies¶

Auto-generated API (mkdocstrings)¶

Core functions¶

exstruct.extract ¶

exstruct.export ¶

exstruct.export_sheets ¶

exstruct.export_sheets_as ¶

exstruct.export_print_areas_as ¶

exstruct.export_auto_page_breaks ¶

exstruct.export_pdf ¶

exstruct.export_sheet_images ¶

exstruct.process_excel ¶

Engine and options¶

exstruct.engine.ExStructEngine ¶

from_defaults staticmethod ¶

extract ¶

serialize ¶

export ¶

process ¶

exstruct.engine.StructOptions dataclass ¶

exstruct.engine.OutputOptions ¶

exstruct.engine.FormatOptions ¶

exstruct.engine.FilterOptions ¶

exstruct.engine.DestinationOptions ¶

Models¶

Model helpers (SheetData / WorkbookData)¶

Error Handling¶

Tuning Examples¶

from_defaults `staticmethod` ¶

exstruct.engine.StructOptions `dataclass` ¶