Using HTMLSemanticPreservingSplitter for Structured HTML Splitting
Descriptionβ
The HTMLSemanticPreservingSplitter
is designed to split HTML content into manageable chunks while preserving the semantic structure of important elements like tables, lists, and other complex HTML components. This ensures that such elements are not split across chunks, which is crucial for maintaining complete context for RAG.
This splitter is designed at its heart, to create contextually relevant chunks. General Recursive splitting with HTMLHeaderSplitter
can cause tables, lists and other structered elements to be split in the middle, losing signifcant context and creating bad chunks.
IMPORTANT: max_chunk_size
is not a definite maximum size of a chunk, the calculation of max size, occurs when the preserved content is not apart of the chunk, to ensure it is not split. When we add the preserved data back in to the chunk, there is a chance the chunk size will exceed max_chunk_size
. This is crucial to ensure we maintain the structure of the original document
Usage Example: Preserving Tables and Listsβ
In this example, we will demonstrate how the HTMLSemanticPreservingSplitter
can preserve a table and a large list within an HTML document. The chunk size will be set to 50 characters to illustrate how the splitter ensures that these elements are not split, even when they exceed the maximum defined chunk size.
from langchain_core.documents import Document
from langchain_text_splitters import HTMLSemanticPreservingSplitter
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section 1</h1>
<p>This section contains an important table and list that should not be split across chunks.</p>
<table>
<tr>
<th>Item</th>
<th>Quantity</th>
<th>Price</th>
</tr>
<tr>
<td>Apples</td>
<td>10</td>
<td>$1.00</td>
</tr>
<tr>
<td>Oranges</td>
<td>5</td>
<td>$0.50</td>
</tr>
<tr>
<td>Bananas</td>
<td>50</td>
<td>$1.50</td>
</tr>
</table>
<h2>Subsection 1.1</h2>
<p>Additional text in subsection 1.1 that is separated from the table and list.</p>
<p>Here is a detailed list:</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
elements_to_preserve=["table", "ul"],
)
documents = splitter.split_text(html_string)
print(documents)
"""
[
Document(metadata={'Header 1': 'Section 1'}, page_content='This section contains an important table and list'),
Document(metadata={'Header 1': 'Section 1'}, page_content='that should not be split across chunks.'),
Document(metadata={'Header 1': 'Section 1'}, page_content='Item Quantity Price Apples 10 $1.00 Oranges 5 $0.50 Bananas 50 $1.50'),
Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='Additional text in subsection 1.1 that is'),
Document(metadata={'Header 2': 'Subsection 1.1'}, page_content='separated from the table and list. Here is a'),
Document(metadata={'Header 2': 'Subsection 1.1'}, page_content="detailed list: Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")
]
"""
Explanationβ
In this example, the HTMLSemanticPreservingSplitter
ensures that the entire table and the unordered list (<ul>
) are preserved within their respective chunks. Even though the chunk size is set to 50 characters, the splitter recognizes that these elements should not be split and keeps them intact.
This is particularly important when dealing with data tables or lists, where splitting the content could lead to loss of context or confusion. The resulting Document
objects retain the full structure of these elements, ensuring that the contextual relevance of the information is maintained.
Usage Example: Using a Custom Handlerβ
The HTMLSemanticPreservingSplitter
allows you to define custom handlers for specific HTML elements. Some platforms, have custom HTML tags that are not natively parsed by BeautifulSoup
, when this occurs, you can utilize custom handlers to add the formatting logic easily.
This can be particularly useful for elements that require special processing, such as <iframe>
tags. In this example, we'll create a custom handler for iframe
tags that converts them into Markdown-like links.
def custom_iframe_extractor(iframe_tag):
iframe_src = iframe_tag.get("src", "")
return f"[iframe:{iframe_src}]({iframe_src})"
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
separators=["\n\n", "\n", ". "],
elements_to_preserve=["table", "ul", "ol"],
custom_handlers={"iframe": custom_iframe_extractor},
)
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section with Iframe</h1>
<iframe src="https://example.com/embed"></iframe>
<p>Some text after the iframe.</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
documents = splitter.split_text(html_string)
print(documents)
"""
[
Document(metadata={'Header 1': 'Section with Iframe'}, page_content='[iframe:https://example.com/embed](https://example.com/embed) Some text after the iframe'),
Document(metadata={'Header 1': 'Section with Iframe'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")
]
"""
Explanationβ
In this example, we defined a custom handler for iframe
tags that converts them into Markdown-like links. When the splitter processes the HTML content, it uses this custom handler to transform the iframe
tags while preserving other elements like tables and lists. The resulting Document
objects show how the iframe is handled according to the custom logic you provided.
Important: When presvering items such as links, you should be mindful not to include .
in your seperators, or leave seperators blank. RecursiveCharecterTextSplitter
splits on full stop, which will cut links in half. Ensure you provide a seperator list with .
instead.
Usage Example: Using a custom handler to analyze an image with an LLMβ
With custom handler's, we can also override the default processing for any element. A great example of this, is inserting semantic analysis of an image within a document, directly in the chunking flow.
Since our function is called when the tag is discovered, we can override the <a>
tag and turn off preserve_links
to insert any content we would like to embed in our chunks.
"""This example assumes you have helper methods `load_image_from_url` and an LLM agent `llm` that can process image data."""
from langchain.agents import AgentExecutor
# This example needs to be replaced with your own agent
llm = AgentExecutor("llm")
# This method is a placeholder for loading image data from a URL and is not implemented here
def load_image_from_url(image_url: str) -> bytes:
# Assuming this method fetches the image data from the URL
return b"image_data"
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Section with Image and Link</h1>
<a href="https://example.com">
<img src="https://example.com/image.jpg" alt="An example image" />
</a>
<p>Some text after the image.</p>
<ul>
<li>Item 1: Description of item 1, which is quite detailed and important.</li>
<li>Item 2: Description of item 2, which also contains significant information.</li>
<li>Item 3: Description of item 3, another item that we don't want to split across chunks.</li>
</ul>
</div>
</body>
</html>
"""
def custom_image_handler(img_tag) -> str:
img_src = img_tag.get("src", "")
img_alt = img_tag.get("alt", "No alt text provided")
image_data = load_image_from_url(img_src)
semantic_meaning = llm.invoke(image_data)
markdown_text = f"[Image Alt Text: {img_alt} | Image Source: {img_src} | Image Semantic Meaning: {semantic_meaning}]"
return markdown_text
splitter = HTMLSemanticPreservingSplitter(
headers_to_split_on=headers_to_split_on,
max_chunk_size=50,
separators=["\n\n", "\n", ". "],
elements_to_preserve=["table", "ul", "ol"],
preserve_links=False,
custom_handlers={"a": custom_image_handler},
)
documents = splitter.split_text(html_string)
print(documents)
"""
[Document(metadata={'Header 1': 'Section with Image and Link'}, page_content='[Image Alt Text: No alt text provided | Image Source: | Image Semantic Meaning: This is an llm description of the image.] Some text after the image'),
Document(metadata={'Header 1': 'Section with Image and Link'}, page_content=". Item 1: Description of item 1, which is quite detailed and important. Item 2: Description of item 2, which also contains significant information. Item 3: Description of item 3, another item that we don't want to split across chunks.")
]
"""
Explanation:β
With our custom handler written to extract the specific fields from a <a>
element in HTML, we can further process the data with our agent, and insert the result directly into our chunk. It is important to ensure preserve_links
is set to False
otherwise the default processing of <a>
fields will take place.
This enables easy to implement text processing anywhere within the HTML document. We can write similar processors for videos, and other elements we may want formatted in specific ways.
Conclusionβ
The HTMLSemanticPreservingSplitter
is essential for splitting HTML content that includes structured elements like tables and lists, especially when it's critical to preserve these elements intact. Additionally, its ability to define custom handlers for specific HTML tags makes it a versatile tool for processing complex HTML documents. By using this splitter, you can ensure that your documents maintain their context and readability, even when split into smaller chunks.