Skip to content

API Reference

This page shows the primary APIs, minimal runnable examples, expected outputs, and the dependencies required for optional features. Hyperlinks are included when include_cell_links=True (or when using mode="verbose"). Auto page-break areas are COM-only and appear when auto page-break extraction/output is enabled (CLI exposes the option only when COM is available).

TOC

Quick Examples

from exstruct import extract, export

wb = extract("sample.xlsx", mode="standard")
export(wb, "out.json")  # compact JSON by default

Expected JSON snippet (links appear when enabled):

{
  "book_name": "sample.xlsx",
  "sheets": {
    "Sheet1": {
      "rows": [{ "r": 1, "c": { "0": "Name", "1": "Age" }, "links": null }],
      "shapes": [
        {
          "text": "note",
          "l": 10,
          "t": 20,
          "w": 80,
          "h": 24,
          "type": "TextBox"
        }
      ],
      "charts": [],
      "table_candidates": ["A1:B5"]
    }
  }
}

CLI-equivalent flow via Python:

from pathlib import Path
from exstruct import process_excel

process_excel(
    file_path=Path("input.xlsx"),
    output_path=None,  # default: stdout (redirect if you want a file)
    sheets_dir=Path("out_sheets"),  # optional per-sheet outputs
    out_fmt="json",
    include_backend_metadata=True,
    image=True,
    pdf=True,
    mode="standard",
    pretty=True,
)
# Same as: exstruct input.xlsx --format json --include-backend-metadata --pdf --image --mode standard --pretty --sheets-dir out_sheets > out.json

Editing API

ExStruct also exposes workbook editing under exstruct.edit, but this is a secondary surface. If you are writing Python code to edit Excel directly, openpyxl / xlwings are usually simpler choices. Reach for exstruct.edit when you specifically want the same patch contract used by ExStruct's CLI and MCP integration layer.

from pathlib import Path

from exstruct.edit import PatchOp, PatchRequest, patch_workbook

result = patch_workbook(
    PatchRequest(
        xlsx_path=Path("book.xlsx"),
        ops=[PatchOp(op="set_value", sheet="Sheet1", cell="A1", value="updated")],
        backend="openpyxl",
    )
)

print(result.out_path)
print(result.engine)

Key points:

  • exstruct.edit does not require MCP PathPolicy.
  • PatchOp, PatchRequest, MakeRequest, and PatchResult keep the existing MCP patch contract in Phase 1.
  • Use list_patch_op_schemas() / get_patch_op_schema() to inspect the public operation schema programmatically.
  • The matching operational CLI is exstruct patch, exstruct make, exstruct ops, and exstruct validate.

Backend capability guide:

Backend Use it for Notes
openpyxl Basic cell/style/layout edits, plus dry_run, return_inverse_ops, and preflight_formula_check flows Pure Python path. Not valid for .xls, and not for COM-only ops such as create_chart.
com Highest-fidelity workbook editing, .xls, and COM-only ops such as create_chart Requires Excel COM. Rejects dry_run, return_inverse_ops, and preflight_formula_check.
auto Default mixed workflow Resolves to the best supported backend for the request. dry_run, return_inverse_ops, and preflight_formula_check force the openpyxl path even on COM-capable hosts, so inspect PatchResult.engine before assuming the same engine will run the real apply.

Known editing limits:

  • create_chart requires the COM-backed path.
  • .xls editing requires COM.
  • exstruct.edit does not own PathPolicy, artifact mirroring, or host approval flows.
  • Existing MCP compatibility imports remain valid.

For local shell or AI-agent edit workflows, prefer the CLI so you can do dry_run -> inspect PatchResult -> apply with an explicit backend. Use backend="openpyxl" when you want the dry run and the real apply to exercise the same engine. With backend="auto", dry runs resolve to openpyxl while the real apply may switch to COM on Windows/Excel hosts. For restricted hosts, use the MCP server, which wraps the same core and adds host policy.

Dependencies

  • Core extraction: pandas, openpyxl (installed with the package).
  • YAML export: pyyaml (lazy import; missing module raises MissingDependencyError).
  • TOON export: python-toon (lazy import; missing module raises MissingDependencyError).
  • Auto page-break extraction/export: Excel + COM required. mode="libreoffice" rejects auto page-break requests with ConfigError.
  • Rendering (PDF/PNG): Excel + COM + pypdfium2 are mandatory. Missing Excel/COM or pypdfium2 surfaces as RenderError/MissingDependencyError, and mode="libreoffice" rejects PDF/PNG requests with ConfigError.

Auto-generated API (mkdocstrings)

Python APIの最新情報は以下の自動生成セクションを参照してください(docstringベースで同期)。

Core functions

exstruct.extract

extract(file_path: str | Path, mode: ExtractionMode = 'standard', *, alpha_col: bool = False) -> WorkbookData

Extracts an Excel workbook into a WorkbookData structure.

Parameters:

Name Type Description Default
file_path str | Path

Path to the workbook file (.xlsx, .xlsm, .xls).

required
mode ExtractionMode

Extraction detail level. "light" includes cells and table detection only (no COM, shapes/charts empty; print areas via openpyxl). "libreoffice" is a best-effort non-COM mode that adds merged cells, shapes, connectors, and charts when the LibreOffice backend is available. "standard" includes texted shapes, arrows, charts (COM if available) and print areas. "verbose" also includes shape/chart sizes, cell link map, colors map, and formulas map.

'standard'
alpha_col bool

When True, convert CellRow column keys to Excel-style ABC names (A, B, ..., Z, AA, ...) instead of 0-based numeric strings.

False

Returns:

Name Type Description
WorkbookData WorkbookData

Parsed workbook representation containing sheets, rows, shapes, charts, and print areas.

exstruct.export

export(data: WorkbookData, path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, *, pretty: bool = False, indent: int | None = None, include_backend_metadata: bool = False) -> None

Save WorkbookData to a file (format inferred from extension).

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData from extract or similar

required
path str | Path

destination path; extension is used to infer format

required
fmt Literal['json', 'yaml', 'yml', 'toon'] | None

explicitly set format if desired (json/yaml/yml/toon)

None
pretty bool

pretty-print JSON

False
indent int | None

JSON indent width (defaults to 2 when pretty=True and indent is None)

None

Raises:

Type Description
ValueError

If the format is unsupported.

Examples:

Write pretty JSON and YAML (requires pyyaml):

>>> from exstruct import export, extract
>>> wb = extract("input.xlsx")
>>> export(wb, "out.json", pretty=True)
>>> export(wb, "out.yaml", fmt="yaml")

exstruct.export_sheets

export_sheets(data: WorkbookData, dir_path: str | Path, *, include_backend_metadata: bool = False) -> dict[str, Path]

Export each sheet as an individual JSON file.

  • Payload: {book_name, sheet_name, sheet: SheetData}
  • Returns: {sheet_name: Path}

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData to split by sheet.

required
dir_path str | Path

Output directory.

required

Returns:

Type Description
dict[str, Path]

Mapping from sheet name to written JSON path.

Examples:

>>> from exstruct import export_sheets, extract
>>> wb = extract("input.xlsx")
>>> paths = export_sheets(wb, "out_sheets")
>>> "Sheet1" in paths
True

exstruct.export_sheets_as

export_sheets_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, include_backend_metadata: bool = False) -> dict[str, Path]

Export each sheet in the given format (json/yaml/toon); returns sheet name to path map.

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData to split by sheet.

required
dir_path str | Path

Output directory.

required
fmt Literal['json', 'yaml', 'yml', 'toon']

Output format; inferred defaults to json.

'json'
pretty bool

Pretty-print JSON.

False
indent int | None

JSON indent width (defaults to 2 when pretty=True and indent is None).

None

Returns:

Type Description
dict[str, Path]

Mapping from sheet name to written file path.

Raises:

Type Description
ValueError

If an unsupported format is passed.

Examples:

Export per sheet as YAML (requires pyyaml):

>>> from exstruct import export_sheets_as, extract
>>> wb = extract("input.xlsx")
>>> _ = export_sheets_as(wb, "out_yaml", fmt="yaml")

exstruct.export_print_areas_as

export_print_areas_as(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False, include_backend_metadata: bool = False) -> dict[str, Path]

Export each print area as a PrintAreaView.

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData that contains print areas

required
dir_path str | Path

output directory

required
fmt Literal['json', 'yaml', 'yml', 'toon']

json/yaml/yml/toon

'json'
pretty bool

Pretty-print JSON output.

False
indent int | None

JSON indent width (defaults to 2 when pretty is True and indent is None).

None
normalize bool

rebase row/col indices to the print-area origin when True

False

Returns:

Type Description
dict[str, Path]

dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_area1_...json)

Examples:

Export print areas when present:

>>> from exstruct import export_print_areas_as, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> paths = export_print_areas_as(wb, "areas")
>>> isinstance(paths, dict)
True

exstruct.export_auto_page_breaks

export_auto_page_breaks(data: WorkbookData, dir_path: str | Path, fmt: Literal['json', 'yaml', 'yml', 'toon'] = 'json', *, pretty: bool = False, indent: int | None = None, normalize: bool = False, include_backend_metadata: bool = False) -> dict[str, Path]

Export auto page-break areas (COM-computed) as PrintAreaView files.

Parameters:

Name Type Description Default
data WorkbookData

WorkbookData containing auto_print_areas (COM extraction with auto breaks enabled)

required
dir_path str | Path

output directory

required
fmt Literal['json', 'yaml', 'yml', 'toon']

json/yaml/yml/toon

'json'
pretty bool

Pretty-print JSON output.

False
indent int | None

JSON indent width (defaults to 2 when pretty is True and indent is None).

None
normalize bool

rebase row/col indices to the area origin when True

False

Returns:

Type Description
dict[str, Path]

dict mapping area key to path (e.g., "Sheet1#1": /.../Sheet1_auto_page1_...json)

Raises:

Type Description
PrintAreaError

If no auto page-break areas are present.

Examples:

>>> from exstruct import export_auto_page_breaks, extract
>>> wb = extract("input.xlsx", mode="standard")
>>> try:
...     export_auto_page_breaks(wb, "auto_areas")
... except PrintAreaError:
...     pass

exstruct.export_pdf

export_pdf(excel_path: str | Path, output_pdf: str | Path) -> list[str]

Export an Excel workbook to PDF via Excel COM and return sheet names in order.

exstruct.export_sheet_images

export_sheet_images(excel_path: str | Path, output_dir: str | Path, dpi: int = 144, *, sheet: str | None = None, a1_range: str | None = None) -> list[Path]

Export each worksheet in the given Excel workbook to PNG files and return the image paths in workbook order.

Returns:

Name Type Description
paths list[Path]

Paths to the generated PNG files, ordered by the corresponding worksheets.

Raises:

Type Description
RenderError

If export or rendering fails.

exstruct.process_excel

process_excel(file_path: str | Path, output_path: str | Path | None = None, out_fmt: str = 'json', image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode = 'standard', pretty: bool = False, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None, *, alpha_col: bool = False, include_backend_metadata: bool = False) -> None

Convenience wrapper: extract -> serialize (file or stdout) -> optional PDF/PNG.

Parameters:

Name Type Description Default
file_path str | Path

Input Excel workbook (path string or Path).

required
output_path str | Path | None

None for stdout; otherwise, write to file (string or Path).

None
out_fmt str

json/yaml/yml/toon.

'json'
image bool

True to also output PNGs (requires Excel + COM + pypdfium2 and is not supported in mode="libreoffice").

False
pdf bool

True to also output PDF (requires Excel + COM + pypdfium2 and is not supported in mode="libreoffice").

False
dpi int

DPI for image output.

72
mode ExtractionMode

light/libreoffice/standard/verbose (same meaning as extract).

'standard'
pretty bool

Pretty-print JSON.

False
indent int | None

JSON indent width.

None
sheets_dir str | Path | None

Directory to write per-sheet files (string or Path).

None
print_areas_dir str | Path | None

Directory to write per-print-area files (string or Path).

None
auto_page_breaks_dir str | Path | None

Directory to write per-auto-page-break files (COM only and not supported in mode="libreoffice").

None
stream TextIO | None

IO override when output_path is None.

None
alpha_col bool

When True, convert CellRow column keys to Excel-style ABC names (A, B, ...) instead of 0-based numeric strings.

False
include_backend_metadata bool

When True, include shape/chart backend metadata fields (provenance, approximation_level, confidence) in output.

False

Raises:

Type Description
ConfigError

If mode="libreoffice" is combined with PDF/PNG rendering or auto page-break export.

ValueError

If an unsupported format or mode is given.

PrintAreaError

When exporting auto page breaks without available data.

RenderError

When rendering fails (Excel/COM/pypdfium2 issues).

Examples:

Extract and write JSON to stdout, plus per-sheet files:

>>> from pathlib import Path
>>> from exstruct import process_excel
>>> process_excel(Path("input.xlsx"), output_path=None, sheets_dir=Path("sheets"))

Render PDF only (COM + Excel required):

>>> process_excel(Path("input.xlsx"), output_path=Path("out.json"), pdf=True)

Editing functions

exstruct.edit.patch_workbook

patch_workbook(request: PatchRequest) -> PatchResult

Edit an existing workbook without MCP path policy enforcement.

exstruct.edit.make_workbook

make_workbook(request: MakeRequest) -> PatchResult

Create a new workbook and apply initial patch operations.

Engine and options

exstruct.engine.ExStructEngine

Configurable engine for ExStruct extraction and export.

Instances are immutable; override options per call if needed.

Key behaviors
  • StructOptions: extraction mode and optional table detection params.
  • OutputOptions: serialization format/pretty-print, include/exclude filters, per-sheet/per-print-area output dirs, etc.
  • Main methods: extract(path, mode=None) -> WorkbookData - Modes: light/libreoffice/standard/verbose - light: COM-free; cells + tables + print areas only (shapes/charts empty) serialize(workbook, ...) -> str - Applies include_* filters, then serializes export(workbook, ...) - Writes to file/stdout; optionally per-sheet and per-print-area files process(file_path, ...) - One-shot extract->export (CLI equivalent), with optional PDF/PNG

__init__

__init__(options: StructOptions | None = None, output: OutputOptions | None = None) -> None

Initialize the engine with optional struct/output options.

from_defaults staticmethod

from_defaults() -> ExStructEngine

Factory to create an engine with default options.

extract

extract(file_path: str | Path, *, mode: ExtractionMode | None = None, _auto_page_breaks_dir_override: Path | None | object = _AUTO_PAGE_BREAKS_DIR_UNSET) -> WorkbookData

Produce a normalized WorkbookData extracted from the given workbook file.

Parameters:

Name Type Description Default
file_path str | Path

Path to the .xlsx/.xlsm/.xls file to extract.

required
mode ExtractionMode | None

Extraction mode to use; if None the engine's configured mode is used. Modes: "light", "libreoffice", "standard", "verbose".

None

Returns:

Name Type Description
WorkbookData WorkbookData

Normalized workbook data extracted from the file.

serialize

serialize(data: WorkbookData, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None) -> str

Serialize a workbook after applying include/exclude filters.

Parameters:

Name Type Description Default
data WorkbookData

Workbook to serialize after filtering.

required
fmt Literal['json', 'yaml', 'yml', 'toon'] | None

Serialization format; defaults to OutputOptions.format.fmt.

None
pretty bool | None

Whether to pretty-print JSON output.

None
indent int | None

Indentation to use when pretty-printing JSON.

None

export

export(data: WorkbookData, output_path: str | Path | None = None, *, fmt: Literal['json', 'yaml', 'yml', 'toon'] | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

Write filtered workbook data to a file or stream.

Includes optional per-sheet and per-print-area outputs when destinations are provided.

Parameters:

Name Type Description Default
data WorkbookData

Workbook to serialize and write.

required
output_path str | Path | None

Target file path (str or Path); writes to stdout when None.

None
fmt Literal['json', 'yaml', 'yml', 'toon'] | None

Serialization format; defaults to OutputOptions.format.fmt.

None
pretty bool | None

Whether to pretty-print JSON output.

None
indent int | None

Indentation to use when pretty-printing JSON.

None
sheets_dir str | Path | None

Directory for per-sheet outputs when provided (str or Path).

None
print_areas_dir str | Path | None

Directory for per-print-area outputs when provided (str or Path).

None
auto_page_breaks_dir str | Path | None

Directory for auto page-break outputs (str or Path; COM environments only).

None
stream TextIO | None

Stream override when output_path is None.

None

process

process(file_path: str | Path, output_path: str | Path | None = None, *, out_fmt: str | None = None, image: bool = False, pdf: bool = False, dpi: int = 72, mode: ExtractionMode | None = None, pretty: bool | None = None, indent: int | None = None, sheets_dir: str | Path | None = None, print_areas_dir: str | Path | None = None, auto_page_breaks_dir: str | Path | None = None, stream: TextIO | None = None) -> None

One-shot extract->export wrapper (CLI equivalent) with optional PDF/PNG output.

Parameters:

Name Type Description Default
file_path str | Path

Input Excel workbook path (str or Path).

required
output_path str | Path | None

Target file path (str or Path); writes to stdout when None.

None
out_fmt str | None

Serialization format for structured output.

None
image bool

Whether to export PNGs alongside structured output. Requires Excel COM and is not supported in mode="libreoffice".

False
pdf bool

Whether to export a PDF snapshot alongside structured output. Requires Excel COM and is not supported in mode="libreoffice".

False
dpi int

DPI to use when rendering images.

72
mode ExtractionMode | None

Extraction mode; defaults to the engine's StructOptions.mode.

None
pretty bool | None

Whether to pretty-print JSON output.

None
indent int | None

Indentation to use when pretty-printing JSON.

None
sheets_dir str | Path | None

Directory for per-sheet structured outputs (str or Path).

None
print_areas_dir str | Path | None

Directory for per-print-area structured outputs (str or Path).

None
auto_page_breaks_dir str | Path | None

Directory for auto page-break outputs (str or Path). Requires Excel COM and is not supported in mode="libreoffice".

None
stream TextIO | None

Stream override when writing to stdout.

None

Raises:

Type Description
ConfigError

If mode="libreoffice" is combined with PDF/PNG rendering or auto page-break export.

exstruct.engine.StructOptions dataclass

Extraction-time options for ExStructEngine.

Attributes:

Name Type Description
mode ExtractionMode

Extraction mode. One of "light", "libreoffice", "standard", "verbose". - light: cells + table candidates only (no COM, shapes/charts empty) - libreoffice: best-effort non-COM mode using the LibreOffice backend - standard: texted shapes + arrows + charts (if COM available) - verbose: all shapes (width/height), charts, table candidates

table_params TableParams | None

Optional dict passed to set_table_detection_params(**table_params) before extraction. Use this to tweak table detection heuristics per engine instance without touching global state.

include_colors_map bool | None

Whether to extract background color maps.

include_formulas_map bool | None

Whether to extract formulas map.

include_merged_cells bool | None

Whether to extract merged cell ranges.

include_merged_values_in_rows bool

Whether to keep merged values in rows.

colors ColorsOptions

Color extraction options.

alpha_col bool

When True, convert CellRow column keys to Excel-style ABC names (A, B, ..., Z, AA, ...) instead of 0-based numeric strings.

exstruct.engine.OutputOptions

Bases: BaseModel

Output-time options for ExStructEngine.

  • format: serialization format/indent.
  • filters: include/exclude flags (rows/shapes/charts/tables/print_areas, size flags).
  • destinations: side outputs (per-sheet, per-print-area, stream override).

exstruct.engine.FormatOptions

Bases: BaseModel

Formatting options for serialization.

exstruct.engine.FilterOptions

Bases: BaseModel

Include/exclude filters for output.

exstruct.engine.DestinationOptions

Bases: BaseModel

Destinations for optional side outputs.

exstruct.engine.ColorsOptions

Bases: BaseModel

Color extraction options.

Examples:

>>> ColorsOptions(
...     include_default_background=False,
...     ignore_colors=["#FFFFFF", "AD3815", "theme:1:0.2", "indexed:64", "auto"],
... )

ignore_colors_set

ignore_colors_set() -> set[str]

Return ignore_colors as a set of normalized strings.

Returns:

Type Description
set[str]

Set of color keys to ignore.

Models

See generated/models.md for the detailed model fields (run python scripts/gen_model_docs.py to refresh).

Model helpers for SheetData and WorkbookData

  • to_json(pretty=False, indent=None, include_backend_metadata=False) → JSON string (pretty when requested)
  • to_yaml(include_backend_metadata=False) → YAML string (requires pyyaml)
  • to_toon(include_backend_metadata=False) → TOON string (requires python-toon)
  • save(path, pretty=False, indent=None, include_backend_metadata=False) → infers format from suffix (.json/.yaml/.yml/.toon)
  • WorkbookData.__getitem__(name) → get a SheetData by name
  • WorkbookData.__iter__() → yields (sheet_name, SheetData) in order

Serialized output omits shape/chart backend metadata (provenance, approximation_level, confidence) by default to reduce token usage. Set include_backend_metadata=True when you need those fields.

Example:

wb = extract("input.xlsx")
first = wb["Sheet1"]
for name, sheet in wb:
    print(name, len(sheet.rows))
wb.save("out.json", pretty=True)
first.save("sheet.yaml")  # requires pyyaml

Error Handling

  • Exception types:
  • SerializationError: Unsupported format requested (serialize_workbook, export APIs).
  • MissingDependencyError: Optional dependency (pyyaml / python-toon / pypdfium2) is missing; message includes install instructions.
  • ConfigError: Invalid option combinations such as mode="libreoffice" with PDF/PNG rendering or auto page-break export.
  • RenderError: Excel/COM is unavailable or PDF/PNG rendering fails.
  • PrintAreaError (ValueError-compatible): export_auto_page_breaks invoked when no auto_print_areas are available.
  • OutputError: Writing output to disk/stream failed (original exception kept in __cause__).
  • ValueError: Invalid inputs such as an unsupported mode.
  • Excel COM unavailable: extraction falls back to cells + table_candidates; shapes/charts are empty, warning is logged.
  • No print areas: export_print_areas_as writes nothing and returns {}; this is not an error.
  • Auto page-break export: export_auto_page_breaks raises PrintAreaError if no auto page-break areas are present (enable them via DestinationOptions.auto_page_breaks_dir).
  • CLI mirrors these behaviors: exits non-zero on failures, prints messages in English.

Tuning Examples

  • Reduce false positives (layout frames):
set_table_detection_params(table_score_threshold=0.4, coverage_min=0.25)
  • Recover missed tiny tables:
set_table_detection_params(density_min=0.03, min_nonempty_cells=2)