Skip to content

API Reference

This page shows the primary APIs, minimal runnable examples, expected outputs, and the dependencies required for optional features. Hyperlinks are included when include_cell_links=True (or when using mode="verbose"). Auto page-break areas are COM-only and appear when auto page-break extraction/output is enabled.

TOC

Quick Examples

from exstruct import extract, export

wb = extract("sample.xlsx", mode="standard")
export(wb, "out.json")  # compact JSON by default

Expected JSON snippet (links appear when enabled):

{
  "book_name": "sample.xlsx",
  "sheets": {
    "Sheet1": {
      "rows": [{ "r": 1, "c": { "0": "Name", "1": "Age" }, "links": null }],
      "shapes": [
        {
          "text": "note",
          "l": 10,
          "t": 20,
          "w": 80,
          "h": 24,
          "type": "TextBox"
        }
      ],
      "charts": [],
      "table_candidates": ["A1:B5"]
    }
  }
}

CLI-equivalent flow via Python:

from pathlib import Path
from exstruct import process_excel

process_excel(
    file_path=Path("input.xlsx"),
    output_path=None,  # default: stdout (redirect if you want a file)
    sheets_dir=Path("out_sheets"),  # optional per-sheet outputs
    out_fmt="json",
    image=True,
    pdf=True,
    mode="standard",
    pretty=True,
)
# Same as: exstruct input.xlsx --format json --pdf --image --mode standard --pretty --sheets-dir out_sheets > out.json

Dependencies

  • Core extraction: pandas, openpyxl (installed with the package).
  • YAML export: pyyaml (lazy import; missing module raises MissingDependencyError).
  • TOON export: python-toon (lazy import; missing module raises MissingDependencyError).
  • Auto page-break extraction/export: Excel + COM required (feature is skipped when COM is unavailable).
  • Rendering (PDF/PNG): Excel + COM + pypdfium2 are mandatory. Missing Excel/COM or pypdfium2 surfaces as RenderError/MissingDependencyError.

Auto-generated API (mkdocstrings)

Python APIの最新情報は以下の自動生成セクションを参照してください(docstringベースで同期)。

Core functions

exstruct.extract

extract(file_path: str | Path, mode: ExtractionMode = 'standard') -> WorkbookData

Extract an Excel workbook into WorkbookData.

Parameters:

Name Type Description Default
file_path str | Path

Path to .xlsx/.xlsm/.xls.

required
mode ExtractionMode

"light" / "standard" / "verbose" - light: cells + table detection only (no COM, shapes/charts empty). Print areas via openpyxl. - standard: texted shapes + arrows + charts (COM if available), print areas included. Shape/chart size is kept but hidden by default in output. - verbose: all shapes (including textless) with size, charts with size.

'standard'

Returns:

Type Description
WorkbookData

WorkbookData containing sheets, rows, shapes, charts, and print areas.

Raises:

Type Description
ValueError

If an invalid mode is provided.

Examples:

Extract with hyperlinks (verbose) and inspect table candidates:

>>> from exstruct import extract
>>> wb = extract("input.xlsx", mode="verbose")
>>> wb.sheets["Sheet1"].table_candidates
['A1:B5']

exstruct.export

export(data: WorkbookData, path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, *, pretty: bool = False, indent: int | None = None) -> None

Save WorkbookData to a file (format inferred from extension).

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData from extract or similar

required
path str | Path

destination path; extension is used to infer format

required
fmt Literal['json', 'yaml', 'yml', 'toon'] | None

explicitly set format if desired (json/yaml/yml/toon)

None
pretty bool

pretty-print JSON

False
indent int | None

JSON indent width (defaults to 2 when pretty=True and indent is None)

None

Raises:

Type Description
ValueError

If the format is unsupported.

Examples:

Write pretty JSON and YAML (requires pyyaml):

>>> from exstruct import export, extract
>>> wb = extract("input.xlsx")
>>> export(wb, "out.json", pretty=True)
>>> export(wb, "out.yaml", fmt="yaml")

exstruct.export_sheets

export_sheets(data: WorkbookData, dir_path: str | Path) -> dict[str, Path]

Export each sheet as an individual JSON file.

  • Payload: {book_name, sheet_name, sheet: SheetData}
  • Returns: {sheet_name: Path}

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData to split by sheet.

required
dir_path str | Path

Output directory.

required

Returns:

Type Description
dict[str, Path]

Mapping from sheet name to written JSON path.

Examples:

>>> from exstruct import export_sheets, extract
>>> wb = extract("input.xlsx")
>>> paths = export_sheets(wb, "out_sheets")
>>> "Sheet1" in paths
True

exstruct.export_sheets_as

export_sheets_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None) -> dict[str, Path]

Export each sheet in the given format (json/yaml/toon); returns sheet name to path map.

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData to split by sheet.

required
dir_path str | Path

Output directory.

required
fmt Literal['json', 'yaml', 'yml', 'toon']

Output format; inferred defaults to json.

'json'
pretty bool

Pretty-print JSON.

False
indent int | None

JSON indent width (defaults to 2 when pretty=True and indent is None).

None

Returns:

Type Description
dict[str, Path]

Mapping from sheet name to written file path.

Raises:

Type Description
ValueError

If an unsupported format is passed.

Examples:

Export per sheet as YAML (requires pyyaml):

>>> from exstruct import export_sheets_as, extract
>>> wb = extract("input.xlsx")
>>> _ = export_sheets_as(wb, "out_yaml", fmt="yaml")

exstruct.export_print_areas_as

export_print_areas_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False) -> dict[str, Path]

Export each print area as a PrintAreaView.

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData that contains print areas

required
dir_path str | Path

output directory

required
fmt Literal['json', 'yaml', 'yml', 'toon']

json/yaml/yml/toon

'json'
pretty bool

Pretty-print JSON output.

False
indent int | None

JSON indent width (defaults to 2 when pretty is True and indent is None).

None
normalize bool

rebase row/col indices to the print-area origin when True

False

Returns:

Type Description
dict[str, Path]

dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_area1_...json)

Examples:

Export print areas when present:

>>> from exstruct import export_print_areas_as, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> paths = export_print_areas_as(wb, "areas")
>>> isinstance(paths, dict)
True

exstruct.export_auto_page_breaks

export_auto_page_breaks(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False) -> dict[str, Path]

Export auto page-break areas (COM-computed) as PrintAreaView files.

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData containing auto_print_areas (COM extraction with auto breaks enabled)

required
dir_path str | Path

output directory

required
fmt Literal['json', 'yaml', 'yml', 'toon']

json/yaml/yml/toon

'json'
pretty bool

Pretty-print JSON output.

False
indent int | None

JSON indent width (defaults to 2 when pretty is True and indent is None).

None
normalize bool

rebase row/col indices to the area origin when True

False

Returns:

Type Description
dict[str, Path]

dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_auto_page1_...json)

Raises:

Type Description
PrintAreaError

If no auto page-break areas are present.

Examples:

>>> from exstruct import export_auto_page_breaks, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> try:
...     export_auto_page_breaks(wb, "auto_areas")
... except PrintAreaError:
...     pass

exstruct.export_pdf

export_pdf(excel_path: str | Path, output_pdf: str | Path) -> list[str]

Export an Excel workbook to PDF via Excel COM and return sheet names in order.

exstruct.export_sheet_images

export_sheet_images(excel_path: str | Path, output_dir: str | Path, dpi: int = 144) -> list[Path]

Export each sheet as PNG (via PDF then pypdfium2 rasterization) and return paths in sheet order.

exstruct.process_excel

process_excel(file_path: str | Path, output_path: str | Path | None = None, out_fmt: str = 'json', image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode = 'standard', pretty: bool = False, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

Convenience wrapper: extract -> serialize (file or stdout) -> optional PDF/PNG.

Parameters:

Name Type Description Default
file_path str | Path

Input Excel workbook (path string or Path).

required
output_path str | Path | None

None for stdout; otherwise, write to file (string or Path).

None
out_fmt str

json/yaml/yml/toon.

'json'
image bool

True to also output PNGs (requires Excel + COM + pypdfium2).

False
pdf bool

True to also output PDF (requires Excel + COM + pypdfium2).

False
dpi int

DPI for image output.

72
mode ExtractionMode

light/standard/verbose (same meaning as extract).

'standard'
pretty bool

Pretty-print JSON.

False
indent int | None

JSON indent width.

None
sheets_dir str | Path | None

Directory to write per-sheet files (string or Path).

None
print_areas_dir str | Path | None

Directory to write per-print-area files (string or Path).

None
auto_page_breaks_dir str | Path | None

Directory to write per-auto-page-break files (COM only).

None
stream TextIO | None

IO override when output_path is None.

None

Raises:

Type Description
ValueError

If an unsupported format or mode is given.

PrintAreaError

When exporting auto page breaks without available data.

RenderError

When rendering fails (Excel/COM/pypdfium2 issues).

Examples:

Extract and write JSON to stdout, plus per-sheet files:

>>> from pathlib import Path
>>> from exstruct import process_excel
>>> process_excel(Path("input.xlsx"), output_path=None, sheets_dir=Path("sheets"))

Render PDF only (COM + Excel required):

>>> process_excel(Path("input.xlsx"), output_path=Path("out.json"), pdf=True)

Engine and options

exstruct.engine.ExStructEngine

Configurable engine for ExStruct extraction and export.

Instances are immutable; override options per call if needed.

Key behaviors
  • StructOptions: extraction mode and optional table detection params.
  • OutputOptions: serialization format/pretty-print, include/exclude filters, per-sheet/per-print-area output dirs, etc.
  • Main methods: extract(path, mode=None) -> WorkbookData - Modes: light/standard/verbose - light: COM-free; cells + tables + print areas only (shapes/charts empty) serialize(workbook, ...) -> str - Applies include_* filters, then serializes export(workbook, ...) - Writes to file/stdout; optionally per-sheet and per-print-area files process(file_path, ...) - One-shot extract->export (CLI equivalent), with optional PDF/PNG

from_defaults staticmethod

from_defaults() -> ExStructEngine

Factory to create an engine with default options.

extract

extract(file_path: str | Path, *, mode: ExtractionMode | None = None) -> WorkbookData

Extract a workbook and return normalized workbook data.

Parameters:

Name Type Description Default
file_path str | Path

Path to the .xlsx/.xlsm/.xls file to extract.

required
mode ExtractionMode | None

Extraction mode; defaults to the engine's StructOptions.mode. - light: COM-free; cells, table candidates, and print areas only. - standard: Shapes with text/arrows plus charts; print areas included; size fields retained but hidden from default output. - verbose: All shapes (with size) and charts (with size).

None

serialize

serialize(data: WorkbookData, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None) -> str

Serialize a workbook after applying include/exclude filters.

Parameters:

Name Type Description Default
data WorkbookData

Workbook to serialize after filtering.

required
fmt Literal['json', 'yaml', 'yml', 'toon'] | None

Serialization format; defaults to OutputOptions.fmt.

None
pretty bool | None

Whether to pretty-print JSON output.

None
indent int | None

Indentation to use when pretty-printing JSON.

None

export

export(data: WorkbookData, output_path: str | Path | None = None, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

Write filtered workbook data to a file or stream.

Includes optional per-sheet and per-print-area outputs when destinations are provided.

Parameters:

Name Type Description Default
data WorkbookData

Workbook to serialize and write.

required
output_path str | Path | None

Target file path (str or Path); writes to stdout when None.

None
fmt Literal['json', 'yaml', 'yml', 'toon'] | None

Serialization format; defaults to OutputOptions.fmt.

None
pretty bool | None

Whether to pretty-print JSON output.

None
indent int | None

Indentation to use when pretty-printing JSON.

None
sheets_dir str | Path | None

Directory for per-sheet outputs when provided (str or Path).

None
print_areas_dir str | Path | None

Directory for per-print-area outputs when provided (str or Path).

None
auto_page_breaks_dir str | Path | None

Directory for auto page-break outputs (str or Path; COM environments only).

None
stream TextIO | None

Stream override when output_path is None.

None

process

process(file_path: str | Path, output_path: str | Path | None = None, *, out_fmt: str | None = None, image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

One-shot extract->export wrapper (CLI equivalent) with optional PDF/PNG output.

Parameters:

Name Type Description Default
file_path str | Path

Input Excel workbook path (str or Path).

required
output_path str | Path | None

Target file path (str or Path); writes to stdout when None.

None
out_fmt str | None

Serialization format for structured output.

None
image bool

Whether to export PNGs alongside structured output.

False
pdf bool

Whether to export a PDF snapshot alongside structured output.

False
dpi int

DPI to use when rendering images.

72
mode ExtractionMode | None

Extraction mode; defaults to the engine's StructOptions.mode.

None
pretty bool | None

Whether to pretty-print JSON output.

None
indent int | None

Indentation to use when pretty-printing JSON.

None
sheets_dir str | Path | None

Directory for per-sheet structured outputs (str or Path).

None
print_areas_dir str | Path | None

Directory for per-print-area structured outputs (str or Path).

None
auto_page_breaks_dir str | Path | None

Directory for auto page-break outputs (str or Path).

None
stream TextIO | None

Stream override when writing to stdout.

None

exstruct.engine.StructOptions dataclass

Extraction-time options for ExStructEngine.

Attributes:

Name Type Description
mode ExtractionMode

Extraction mode. One of "light", "standard", "verbose". - light: cells + table candidates only (no COM, shapes/charts empty) - standard: texted shapes + arrows + charts (if COM available) - verbose: all shapes (width/height), charts, table candidates

table_params TableParams | None

Optional dict passed to set_table_detection_params(**table_params) before extraction. Use this to tweak table detection heuristics per engine instance without touching global state.

exstruct.engine.OutputOptions

Bases: BaseModel

Output-time options for ExStructEngine.

  • format: serialization format/indent.
  • filters: include/exclude flags (rows/shapes/charts/tables/print_areas, size flags).
  • destinations: side outputs (per-sheet, per-print-area, stream override).

Legacy flat fields (fmt, pretty, indent, include_*, sheets_dir, print_areas_dir, stream) are still accepted and normalized into the nested structures.

exstruct.engine.FormatOptions

Bases: BaseModel

Formatting options for serialization.

exstruct.engine.FilterOptions

Bases: BaseModel

Include/exclude filters for output.

exstruct.engine.DestinationOptions

Bases: BaseModel

Destinations for optional side outputs.

Models

See generated/models.md for the detailed model fields (run python scripts/gen_model_docs.py to refresh).

Model helpers (SheetData / WorkbookData)

  • to_json(pretty=False, indent=None) → JSON string (pretty when requested)
  • to_yaml() → YAML string (requires pyyaml)
  • to_toon() → TOON string (requires python-toon)
  • save(path, pretty=False, indent=None) → infers format from suffix (.json/.yaml/.yml/.toon)
  • WorkbookData.__getitem__(name) → get a SheetData by name
  • WorkbookData.__iter__() → yields (sheet_name, SheetData) in order

Example:

wb = extract("input.xlsx")
first = wb["Sheet1"]
for name, sheet in wb:
    print(name, len(sheet.rows))
wb.save("out.json", pretty=True)
first.save("sheet.yaml")  # requires pyyaml

Error Handling

  • Exception types:
  • SerializationError: Unsupported format requested (serialize_workbook, export APIs).
  • MissingDependencyError: Optional dependency (pyyaml / python-toon / pypdfium2) is missing; message includes install instructions.
  • RenderError: Excel/COM is unavailable or PDF/PNG rendering fails.
  • PrintAreaError (ValueError-compatible): export_auto_page_breaks invoked when no auto_print_areas are available.
  • OutputError: Writing output to disk/stream failed (original exception kept in __cause__).
  • ValueError: Invalid inputs such as an unsupported mode.
  • Excel COM unavailable: extraction falls back to cells + table_candidates; shapes/charts are empty, warning is logged.
  • No print areas: export_print_areas_as writes nothing and returns {}; this is not an error.
  • Auto page-break export: export_auto_page_breaks raises PrintAreaError if no auto page-break areas are present (enable them via DestinationOptions.auto_page_breaks_dir).
  • CLI mirrors these behaviors: exits non-zero on failures, prints messages in English.

Tuning Examples

  • Reduce false positives (layout frames):
set_table_detection_params(table_score_threshold=0.4, coverage_min=0.25)
  • Recover missed tiny tables:
set_table_detection_params(density_min=0.03, min_nonempty_cells=2)