これは、なにをしたくて書いたもの?
Markdownを分割するのに、お手軽なライブラリーはないかなということで調べてみました。
なんとなく作った方がいいのでは?という気もしますが、ひとまずライブラリーを見てみましょうということで。
今回はLangChainに着目してみました。
実際に使ってみると、ライブラリーを使った方がいいんですね…という気分にはなりましたが。
LangChainのText splitter
LangChainにはText splitterがあります。これは大きなドキュメントをチャンクに分割する目的で使うものです。
Text splitter integrations - Docs by LangChain
Markdown向けにはMarkdownHeaderTextSplitterがあります。
Split markdown - text splitter integration - Docs by LangChain
MarkdownHeaderTextSplitter | langchain_text_splitters | LangChain Reference
これは、自分で指定したセクションで分割しつつメタデータを付与するもののようです。
さらにチャンクサイズを制限したい場合は、RecursiveCharacterTextSplitterを使うようです。
すごくシンプルな感じがしますね。ひとまず使ってみましょう。
環境
今回の環境はこちら。
$ python3 --version Python 3.12.3 $ uv --version uv 0.10.10
準備
uvプロジェクトの作成と依存関係のインストール。
$ uv init --vcs none .
LangChainのText splitterを追加。
$ uv add langchain-text-splitters
正直、目的に対して依存ライブラリーが重いとは思いますが…。
テスト用などのパッケージも追加。
$ uv add --dev pytest mypy pyright ruff
プロジェクトの定義と依存関係。
pyproject.toml
[project] name = "langchain-markdown-splitter" version = "0.1.0" description = "Add your description here" readme = "README.md" requires-python = ">=3.12" dependencies = [ "langchain-text-splitters>=1.1.1", ] [dependency-groups] dev = [ "mypy>=1.19.1", "pyright>=1.1.408", "pytest>=9.0.2", "ruff>=0.15.6", ] [tool.mypy] strict = true disallow_any_unimported = true disallow_any_expr = true disallow_any_explicit = true warn_unreachable = true pretty = true [tool.pyright] pythonVersion = "3.12" typeCheckingMode = "strict" deprecateTypingAliases = true reportImplicitOverride = "error" reportImplicitStringConcatenation = "error" reportImportCycles = "error" reportMissingSuperCall = "error" reportPropertyTypeMismatch = "error" reportUnnecessaryTypeIgnoreComment = "error" reportUnreachable = "error"
$ uv pip list Package Version ------------------------ --------- annotated-types 0.7.0 anyio 4.12.1 certifi 2026.2.25 charset-normalizer 3.4.5 h11 0.16.0 httpcore 1.0.9 httpx 0.28.1 idna 3.11 iniconfig 2.3.0 jsonpatch 1.33 jsonpointer 3.0.0 langchain-core 1.2.19 langchain-text-splitters 1.1.1 langsmith 0.7.17 librt 0.8.1 mypy 1.19.1 mypy-extensions 1.1.0 nodeenv 1.10.0 orjson 3.11.7 packaging 26.0 pathspec 1.0.4 pluggy 1.6.0 pydantic 2.12.5 pydantic-core 2.41.5 pygments 2.19.2 pyright 1.1.408 pytest 9.0.2 pyyaml 6.0.3 requests 2.32.5 requests-toolbelt 1.0.0 ruff 0.15.6 tenacity 9.1.4 typing-extensions 4.15.0 typing-inspection 0.4.2 urllib3 2.6.3 uuid-utils 0.14.1 xxhash 3.6.0 zstandard 0.25.0
LangChainのMarkdownHeaderTextSplitterを使ってみる
それでは、テストコードでLangChainのMarkdownHeaderTextSplitterを使ってみましょう。
テストコードの雛形はこちら。
markdown_test.py
from urllib.request import urlopen from langchain_text_splitters.markdown import MarkdownHeaderTextSplitter # ここにテストを書く
対象のMarkdownはどれにしようかなと思いましたが、LangChainのCLAUDE.mdにしました。
https://github.com/langchain-ai/langchain/blob/langchain%3D%3D1.1.1/CLAUDE.md
まずは#と##で分割してみました。
def test_split_markdown_h1_h2() -> None: with urlopen( "https://raw.githubusercontent.com/langchain-ai/langchain/refs/tags/langchain%3D%3D1.1.1/CLAUDE.md" ) as f: markdown = f.read().decode("utf-8") headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")] markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on) md_header_splits = markdown_splitter.split_text(markdown) assert len(md_header_splits) == 4 doc1 = md_header_splits[0] assert doc1.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo" } assert ( doc1.page_content == """This document provides context to understand the LangChain Python project and assist with development.""" ) doc2 = md_header_splits[1] assert doc2.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Project architecture and context", } assert ( doc2.page_content == """### Monorepo structure This is a Python monorepo with multiple independently versioned packages that use `uv`. ```txt langchain/ ├── libs/ │ ├── core/ # `langchain-core` primitives and base abstractions │ ├── langchain/ # `langchain-classic` (legacy, no new features) │ ├── langchain_v1/ # Actively maintained `langchain` package │ ├── partners/ # Third-party integrations │ │ ├── openai/ # OpenAI models and embeddings │ │ ├── anthropic/ # Anthropic (Claude) integration │ │ ├── ollama/ # Local model support │ │ └── ... (other integrations maintained by the LangChain team) │ ├── text-splitters/ # Document chunking utilities │ ├── standard-tests/ # Shared test suite for integrations │ ├── model-profiles/ # Model configuration profiles │ └── cli/ # Command-line interface tools ├── .github/ # CI/CD workflows and templates ├── .vscode/ # VSCode IDE standard settings and recommended extensions └── README.md # Information about LangChain ``` - **Core layer** (`langchain-core`): Base abstractions, interfaces, and protocols. Users should not need to know about this layer directly. - **Implementation layer** (`langchain`): Concrete implementations and high-level public utilities - **Integration layer** (`partners/`): Third-party service integrations. Note that this monorepo is not exhaustive of all LangChain integrations; some are maintained in separate repos, such as `langchain-ai/langchain-google` and `langchain-ai/langchain-aws`. Usually these repos are cloned at the same level as this monorepo, so if needed, you can refer to their code directly by navigating to `../langchain-google/` from this monorepo. - **Testing layer** (`standard-tests/`): Standardized integration tests for partner integrations ### Development tools & commands** - `uv` – Fast Python package installer and resolver (replaces pip/poetry) - `make` – Task runner for common development commands. Feel free to look at the `Makefile` for available commands and usage patterns. - `ruff` – Fast Python linter and formatter - `mypy` – Static type checking - `pytest` – Testing framework This monorepo uses `uv` for dependency management. Local development uses editable installs: `[tool.uv.sources]` Each package in `libs/` has its own `pyproject.toml` and `uv.lock`. ```bash # Run unit tests (no network) make test # Run specific test file uv run --group test pytest tests/unit_tests/test_specific.py ``` ```bash # Lint code make lint # Format code make format # Type checking uv run --group lint mypy . ``` #### Key config files - pyproject.toml: Main workspace configuration with dependency groups - uv.lock: Locked dependencies for reproducible builds - Makefile: Development tasks #### Commit standards Suggest PR titles that follow Conventional Commits format. Refer to .github/workflows/pr_lint for allowed types and scopes. #### Pull request guidelines - Always add a disclaimer to the PR description mentioning how AI agents are involved with the contribution. - Describe the "why" of the changes, why the proposed solution is the right one. Limit prose. - Highlight areas of the proposed changes that require careful review.""" ) doc3 = md_header_splits[2] assert doc3.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Core development principles", } assert ( doc3.page_content == '''### Maintain stable public interfaces CRITICAL: Always attempt to preserve function signatures, argument positions, and names for exported/public methods. Do not make breaking changes. **Before making ANY changes to public APIs:** - Check if the function/class is exported in `__init__.py` - Look for existing usage patterns in tests and examples - Use keyword-only arguments for new parameters: `*, new_param: str = "default"` - Mark experimental features clearly with docstring warnings (using MkDocs Material admonitions, like `!!! warning`) Ask: "Would this change break someone's code if they used it last week?" ### Code quality standards All Python code MUST include type hints and return types. ```python title="Example" def filter_unknown_users(users: list[str], known_users: set[str]) -> list[str]: """Single line description of the function. Any additional context about the function can go here. Args: users: List of user identifiers to filter. known_users: Set of known/valid user identifiers. Returns: List of users that are not in the known_users set. """ ``` - Use descriptive, self-explanatory variable names. - Follow existing patterns in the codebase you're modifying - Attempt to break up complex functions (>20 lines) into smaller, focused functions where it makes sense ### Testing requirements Every new feature or bugfix MUST be covered by unit tests. - Unit tests: `tests/unit_tests/` (no network calls allowed) - Integration tests: `tests/integration_tests/` (network calls permitted) - We use `pytest` as the testing framework; if in doubt, check other existing tests for examples. - The testing file structure should mirror the source code structure. **Checklist:** - [ ] Tests fail when your new logic is broken - [ ] Happy path is covered - [ ] Edge cases and error conditions are tested - [ ] Use fixtures/mocks for external dependencies - [ ] Tests are deterministic (no flaky tests) - [ ] Does the test suite fail if your new logic is broken? ### Security and risk assessment - No `eval()`, `exec()`, or `pickle` on user-controlled input - Proper exception handling (no bare `except:`) and use a `msg` variable for error messages - Remove unreachable/commented code before committing - Race conditions or resource leaks (file handles, sockets, threads). - Ensure proper resource cleanup (file handles, connections) ### Documentation standards Use Google-style docstrings with Args section for all public functions. ```python title="Example" def send_email(to: str, msg: str, *, priority: str = "normal") -> bool: """Send an email to a recipient with specified priority. Any additional context about the function can go here. Args: to: The email address of the recipient. msg: The message body to send. priority: Email priority level. Returns: `True` if email was sent successfully, `False` otherwise. Raises: InvalidEmailError: If the email address format is invalid. SMTPConnectionError: If unable to connect to email server. """ ``` - Types go in function signatures, NOT in docstrings - If a default is present, DO NOT repeat it in the docstring unless there is post-processing or it is set conditionally. - Focus on "why" rather than "what" in descriptions - Document all parameters, return values, and exceptions - Keep descriptions concise but clear - Ensure American English spelling (e.g., "behavior", not "behaviour")''' ) doc4 = md_header_splits[3] assert doc4.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Additional resources", } assert ( doc4.page_content == """- **Documentation:** https://docs.langchain.com/oss/python/langchain/overview and source at https://github.com/langchain-ai/docs or `../docs/`. Prefer the local install and use file search tools for best results. If needed, use the docs MCP server as defined in `.mcp.json` for programmatic access. - **Contributing Guide:** [`.github/CONTRIBUTING.md`](https://docs.langchain.com/oss/python/contributing/overview)""" )
CLAUDE.mdをダウンロードしてきて
with urlopen( "https://raw.githubusercontent.com/langchain-ai/langchain/refs/tags/langchain%3D%3D1.1.1/CLAUDE.md" ) as f: markdown = f.read().decode("utf-8")
#と##で分割します。
headers_to_split_on = [("#", "Header 1"), ("##", "Header 2")] markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on) md_header_splits = markdown_splitter.split_text(markdown)
結果は4つに分割されました。
assert len(md_header_splits) == 4
まずは#のひとつ目。メタデータとして、分割時に指定した値をキーとした辞書が入ります。中身はpage_contentです。
doc1 = md_header_splits[0] assert doc1.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo" } assert ( doc1.page_content == """This document provides context to understand the LangChain Python project and assist with development.""" )
つまり、元のデータはこうでした。
# Global development guidelines for the LangChain monorepo
This document provides context to understand the LangChain Python project and assist with development.
2つ目は、##の結果が入ります。メタデータは、上位の値も引き継ぐようです。
doc2 = md_header_splits[1]
assert doc2.metadata == {
"Header 1": "Global development guidelines for the LangChain monorepo",
"Header 2": "Project architecture and context",
}
assert (
doc2.page_content
== """### Monorepo structure
This is a Python monorepo with multiple independently versioned packages that use `uv`.
```txt
langchain/
├── libs/
│ ├── core/ # `langchain-core` primitives and base abstractions
│ ├── langchain/ # `langchain-classic` (legacy, no new features)
│ ├── langchain_v1/ # Actively maintained `langchain` package
│ ├── partners/ # Third-party integrations
│ │ ├── openai/ # OpenAI models and embeddings
│ │ ├── anthropic/ # Anthropic (Claude) integration
│ │ ├── ollama/ # Local model support
│ │ └── ... (other integrations maintained by the LangChain team)
│ ├── text-splitters/ # Document chunking utilities
│ ├── standard-tests/ # Shared test suite for integrations
│ ├── model-profiles/ # Model configuration profiles
│ └── cli/ # Command-line interface tools
├── .github/ # CI/CD workflows and templates
├── .vscode/ # VSCode IDE standard settings and recommended extensions
└── README.md # Information about LangChain
```
- **Core layer** (`langchain-core`): Base abstractions, interfaces, and protocols. Users should not need to know about this layer directly.
- **Implementation layer** (`langchain`): Concrete implementations and high-level public utilities
- **Integration layer** (`partners/`): Third-party service integrations. Note that this monorepo is not exhaustive of all LangChain integrations; some are maintained in separate repos, such as `langchain-ai/langchain-google` and `langchain-ai/langchain-aws`. Usually these repos are cloned at the same level as this monorepo, so if needed, you can refer to their code directly by navigating to `../langchain-google/` from this monorepo.
- **Testing layer** (`standard-tests/`): Standardized integration tests for partner integrations
### Development tools & commands**
- `uv` – Fast Python package installer and resolver (replaces pip/poetry)
- `make` – Task runner for common development commands. Feel free to look at the `Makefile` for available commands and usage patterns.
- `ruff` – Fast Python linter and formatter
- `mypy` – Static type checking
- `pytest` – Testing framework
This monorepo uses `uv` for dependency management. Local development uses editable installs: `[tool.uv.sources]`
Each package in `libs/` has its own `pyproject.toml` and `uv.lock`.
```bash
# Run unit tests (no network)
make test
# Run specific test file
uv run --group test pytest tests/unit_tests/test_specific.py
```
```bash
# Lint code
make lint
# Format code
make format
# Type checking
uv run --group lint mypy .
```
#### Key config files
- pyproject.toml: Main workspace configuration with dependency groups
- uv.lock: Locked dependencies for reproducible builds
- Makefile: Development tasks
#### Commit standards
Suggest PR titles that follow Conventional Commits format. Refer to .github/workflows/pr_lint for allowed types and scopes.
#### Pull request guidelines
- Always add a disclaimer to the PR description mentioning how AI agents are involved with the contribution.
- Describe the "why" of the changes, why the proposed solution is the right one. Limit prose.
- Highlight areas of the proposed changes that require careful review."""
)
###は分割対象に指定していないので、分割対象にはなりません。
つまり、元のデータはこうでした。
## Project architecture and context ### Monorepo structure This is a Python monorepo with multiple independently versioned packages that use `uv`. ```txt langchain/ ├── libs/ │ ├── core/ # `langchain-core` primitives and base abstractions │ ├── langchain/ # `langchain-classic` (legacy, no new features) │ ├── langchain_v1/ # Actively maintained `langchain` package │ ├── partners/ # Third-party integrations │ │ ├── openai/ # OpenAI models and embeddings │ │ ├── anthropic/ # Anthropic (Claude) integration │ │ ├── ollama/ # Local model support │ │ └── ... (other integrations maintained by the LangChain team) │ ├── text-splitters/ # Document chunking utilities │ ├── standard-tests/ # Shared test suite for integrations │ ├── model-profiles/ # Model configuration profiles │ └── cli/ # Command-line interface tools ├── .github/ # CI/CD workflows and templates ├── .vscode/ # VSCode IDE standard settings and recommended extensions └── README.md # Information about LangChain ``` - **Core layer** (`langchain-core`): Base abstractions, interfaces, and protocols. Users should not need to know about this layer directly. - **Implementation layer** (`langchain`): Concrete implementations and high-level public utilities - **Integration layer** (`partners/`): Third-party service integrations. Note that this monorepo is not exhaustive of all LangChain integrations; some are maintained in separate repos, such as `langchain-ai/langchain-google` and `langchain-ai/langchain-aws`. Usually these repos are cloned at the same level as this monorepo, so if needed, you can refer to their code directly by navigating to `../langchain-google/` from this monorepo. - **Testing layer** (`standard-tests/`): Standardized integration tests for partner integrations ### Development tools & commands** - `uv` – Fast Python package installer and resolver (replaces pip/poetry) - `make` – Task runner for common development commands. Feel free to look at the `Makefile` for available commands and usage patterns. - `ruff` – Fast Python linter and formatter - `mypy` – Static type checking - `pytest` – Testing framework This monorepo uses `uv` for dependency management. Local development uses editable installs: `[tool.uv.sources]` Each package in `libs/` has its own `pyproject.toml` and `uv.lock`. ```bash # Run unit tests (no network) make test
ちなみにですね、行末のスペースやコンテンツ内の空行はいろいろ調整されているようなので、元になったコンテンツを
そのまま貼り付けてテストコードを書いてもパスしません…。
残りは省略。
###まで分割対象に指定した例もつけておきましょう。
def test_split_markdown_h1_h2_3() -> None: with urlopen( "https://raw.githubusercontent.com/langchain-ai/langchain/refs/tags/langchain%3D%3D1.1.1/CLAUDE.md" ) as f: markdown = f.read().decode("utf-8") headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on) md_header_splits = markdown_splitter.split_text(markdown) assert len(md_header_splits) == 9 doc1 = md_header_splits[0] assert doc1.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo" } assert ( doc1.page_content == """This document provides context to understand the LangChain Python project and assist with development.""" ) doc2 = md_header_splits[1] assert doc2.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Project architecture and context", "Header 3": "Monorepo structure", } assert ( doc2.page_content == """This is a Python monorepo with multiple independently versioned packages that use `uv`. ```txt langchain/ ├── libs/ │ ├── core/ # `langchain-core` primitives and base abstractions │ ├── langchain/ # `langchain-classic` (legacy, no new features) │ ├── langchain_v1/ # Actively maintained `langchain` package │ ├── partners/ # Third-party integrations │ │ ├── openai/ # OpenAI models and embeddings │ │ ├── anthropic/ # Anthropic (Claude) integration │ │ ├── ollama/ # Local model support │ │ └── ... (other integrations maintained by the LangChain team) │ ├── text-splitters/ # Document chunking utilities │ ├── standard-tests/ # Shared test suite for integrations │ ├── model-profiles/ # Model configuration profiles │ └── cli/ # Command-line interface tools ├── .github/ # CI/CD workflows and templates ├── .vscode/ # VSCode IDE standard settings and recommended extensions └── README.md # Information about LangChain ``` - **Core layer** (`langchain-core`): Base abstractions, interfaces, and protocols. Users should not need to know about this layer directly. - **Implementation layer** (`langchain`): Concrete implementations and high-level public utilities - **Integration layer** (`partners/`): Third-party service integrations. Note that this monorepo is not exhaustive of all LangChain integrations; some are maintained in separate repos, such as `langchain-ai/langchain-google` and `langchain-ai/langchain-aws`. Usually these repos are cloned at the same level as this monorepo, so if needed, you can refer to their code directly by navigating to `../langchain-google/` from this monorepo. - **Testing layer** (`standard-tests/`): Standardized integration tests for partner integrations""" ) doc3 = md_header_splits[2] assert doc3.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Project architecture and context", "Header 3": "Development tools & commands**", } assert ( doc3.page_content == """- `uv` – Fast Python package installer and resolver (replaces pip/poetry) - `make` – Task runner for common development commands. Feel free to look at the `Makefile` for available commands and usage patterns. - `ruff` – Fast Python linter and formatter - `mypy` – Static type checking - `pytest` – Testing framework This monorepo uses `uv` for dependency management. Local development uses editable installs: `[tool.uv.sources]` Each package in `libs/` has its own `pyproject.toml` and `uv.lock`. ```bash # Run unit tests (no network) make test # Run specific test file uv run --group test pytest tests/unit_tests/test_specific.py ``` ```bash # Lint code make lint # Format code make format # Type checking uv run --group lint mypy . ``` #### Key config files - pyproject.toml: Main workspace configuration with dependency groups - uv.lock: Locked dependencies for reproducible builds - Makefile: Development tasks #### Commit standards Suggest PR titles that follow Conventional Commits format. Refer to .github/workflows/pr_lint for allowed types and scopes. #### Pull request guidelines - Always add a disclaimer to the PR description mentioning how AI agents are involved with the contribution. - Describe the "why" of the changes, why the proposed solution is the right one. Limit prose. - Highlight areas of the proposed changes that require careful review.""" ) doc4 = md_header_splits[3] assert doc4.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Core development principles", "Header 3": "Maintain stable public interfaces", } assert ( doc4.page_content == '''CRITICAL: Always attempt to preserve function signatures, argument positions, and names for exported/public methods. Do not make breaking changes. **Before making ANY changes to public APIs:** - Check if the function/class is exported in `__init__.py` - Look for existing usage patterns in tests and examples - Use keyword-only arguments for new parameters: `*, new_param: str = "default"` - Mark experimental features clearly with docstring warnings (using MkDocs Material admonitions, like `!!! warning`) Ask: "Would this change break someone's code if they used it last week?"''' ) doc5 = md_header_splits[4] assert doc5.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Core development principles", "Header 3": "Code quality standards", } assert ( doc5.page_content == '''All Python code MUST include type hints and return types. ```python title="Example" def filter_unknown_users(users: list[str], known_users: set[str]) -> list[str]: """Single line description of the function. Any additional context about the function can go here. Args: users: List of user identifiers to filter. known_users: Set of known/valid user identifiers. Returns: List of users that are not in the known_users set. """ ``` - Use descriptive, self-explanatory variable names. - Follow existing patterns in the codebase you're modifying - Attempt to break up complex functions (>20 lines) into smaller, focused functions where it makes sense''' ) doc6 = md_header_splits[5] assert doc6.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Core development principles", "Header 3": "Testing requirements", } assert ( doc6.page_content == """Every new feature or bugfix MUST be covered by unit tests. - Unit tests: `tests/unit_tests/` (no network calls allowed) - Integration tests: `tests/integration_tests/` (network calls permitted) - We use `pytest` as the testing framework; if in doubt, check other existing tests for examples. - The testing file structure should mirror the source code structure. **Checklist:** - [ ] Tests fail when your new logic is broken - [ ] Happy path is covered - [ ] Edge cases and error conditions are tested - [ ] Use fixtures/mocks for external dependencies - [ ] Tests are deterministic (no flaky tests) - [ ] Does the test suite fail if your new logic is broken?""" ) doc7 = md_header_splits[6] assert doc7.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Core development principles", "Header 3": "Security and risk assessment", } assert ( doc7.page_content == """- No `eval()`, `exec()`, or `pickle` on user-controlled input - Proper exception handling (no bare `except:`) and use a `msg` variable for error messages - Remove unreachable/commented code before committing - Race conditions or resource leaks (file handles, sockets, threads). - Ensure proper resource cleanup (file handles, connections)""" ) doc8 = md_header_splits[7] assert doc8.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Core development principles", "Header 3": "Documentation standards", } assert ( doc8.page_content == '''Use Google-style docstrings with Args section for all public functions. ```python title="Example" def send_email(to: str, msg: str, *, priority: str = "normal") -> bool: """Send an email to a recipient with specified priority. Any additional context about the function can go here. Args: to: The email address of the recipient. msg: The message body to send. priority: Email priority level. Returns: `True` if email was sent successfully, `False` otherwise. Raises: InvalidEmailError: If the email address format is invalid. SMTPConnectionError: If unable to connect to email server. """ ``` - Types go in function signatures, NOT in docstrings - If a default is present, DO NOT repeat it in the docstring unless there is post-processing or it is set conditionally. - Focus on "why" rather than "what" in descriptions - Document all parameters, return values, and exceptions - Keep descriptions concise but clear - Ensure American English spelling (e.g., "behavior", not "behaviour")''' ) doc9 = md_header_splits[8] assert doc9.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Additional resources", } assert ( doc9.page_content == """- **Documentation:** https://docs.langchain.com/oss/python/langchain/overview and source at https://github.com/langchain-ai/docs or `../docs/`. Prefer the local install and use file search tools for best results. If needed, use the docs MCP server as defined in `.mcp.json` for programmatic access. - **Contributing Guide:** [`.github/CONTRIBUTING.md`](https://docs.langchain.com/oss/python/contributing/overview)""" )
こうですね。
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown)
こちらは##の見出しに中身がなかったので、いきなり###の内容が現れていますが、ここまでの挙動を見ていると結果は
なんとなくわかりますね。
doc2 = md_header_splits[1] assert doc2.metadata == { "Header 1": "Global development guidelines for the LangChain monorepo", "Header 2": "Project architecture and context", "Header 3": "Monorepo structure", } assert ( doc2.page_content == """This is a Python monorepo with multiple independently versioned packages that use `uv`. ```txt langchain/ ├── libs/ │ ├── core/ # `langchain-core` primitives and base abstractions │ ├── langchain/ # `langchain-classic` (legacy, no new features) │ ├── langchain_v1/ # Actively maintained `langchain` package │ ├── partners/ # Third-party integrations │ │ ├── openai/ # OpenAI models and embeddings │ │ ├── anthropic/ # Anthropic (Claude) integration │ │ ├── ollama/ # Local model support │ │ └── ... (other integrations maintained by the LangChain team) │ ├── text-splitters/ # Document chunking utilities │ ├── standard-tests/ # Shared test suite for integrations │ ├── model-profiles/ # Model configuration profiles │ └── cli/ # Command-line interface tools ├── .github/ # CI/CD workflows and templates ├── .vscode/ # VSCode IDE standard settings and recommended extensions └── README.md # Information about LangChain ``` - **Core layer** (`langchain-core`): Base abstractions, interfaces, and protocols. Users should not need to know about this layer directly. - **Implementation layer** (`langchain`): Concrete implementations and high-level public utilities - **Integration layer** (`partners/`): Third-party service integrations. Note that this monorepo is not exhaustive of all LangChain integrations; some are maintained in separate repos, such as `langchain-ai/langchain-google` and `langchain-ai/langchain-aws`. Usually these repos are cloned at the same level as this monorepo, so if needed, you can refer to their code directly by navigating to `../langchain-google/` from this monorepo. - **Testing layer** (`standard-tests/`): Standardized integration tests for partner integrations""" )
今回はこんなところにしておきましょう。
おわりに
LangChainのMarkdownHeaderTextSplitterでMarkdownを分割してみました。
挙動自体はわかりやすかったのですが、#や##のセクションで分割しようとすると、対象によってはコードブロック内の
コメントを拾ってしまうことがあったりしました。
こういう挙動を見ていると、ちゃんとMarkdownパーサーを使って解析しないとおかしな結果になりますね…。
いい勉強になりました…。