Skip to main content
Version: 8.1

Parsing XML with Java Libraries

What is the DOM Parser?

The Document Object Model (DOM) parser provides a powerful way to parse and manipulate XML documents. It's commonly used due to its ease of use and comprehensive functionality. The DOM parser breaks down XML into accessible elements, each representing a node in the XML tree structure. For more information on interfacing with the DOM parser, refer to the Java XML DOM Parser Documentation.

Using the DOM Parser

There are several ways to import XML data using the DOM parser, depending on how it's stored. It can retrieve data from an XML file using the file path or from a string. Regardless of the method, it provides a root object representing the XML document.

Jython - Reading a File

from javax.xml.parsers import DocumentBuilderFactory
from java.io import File

# Define your XML file path
xmlFilePath = "file.xml" # Replace with your actual XML file path

# Create a DOM document builder
builderFactory = DocumentBuilderFactory.newInstance()
builder = builderFactory.newDocumentBuilder()

# Parse the XML file
file = File(xmlFilePath)
document = builder.parse(file)

# Access the root element
root = document.getDocumentElement()

Jython - Reading from a String

from javax.xml.parsers import DocumentBuilderFactory
from java.io import ByteArrayInputStream

# Define your XML string
xmlString = """
<employee id="1234">
<name>John Smith</name>
<start_date>2010-11-26</start_date>
<department>IT</department>
<title>Tech Support</title>
</employee>
""" # Replace with your actual XML string

# Create a DOM document builder
builderFactory = DocumentBuilderFactory.newInstance()
builder = builderFactory.newDocumentBuilder()

# Parse the XML string
stream = ByteArrayInputStream(xmlString.encode('utf-8'))
document = builder.parse(stream)

# Access the root element
root = document.getDocumentElement()

Each tag is considered an element object. For instance, in the given example, the root element would be the employee tag. Elements can have attributes contained within the tag itself. In the example above, the employee element has an id attribute with a value of 1234. Additionally, elements can have additional data, typically between the start and end tags. This data can be accessed using the Element object's built-in functionality.

FunctionDescriptionExampleOutput
Element.tagReturns the name of the element's tag.print(root.tag)employee
Element.attribReturns a dictionary of the element's attributes.print(root.attrib){'id': '1234'}
Element.textReturns the additional data of the element.print(root.text)N/A
for child in ElementIterates through the element's children. Each child is an element with its own tag, attributes, and text properties.python for child in root: print(child.tag, child.text)
name John Smith
start_date 2010-11-26
department IT
title Tech Support
Element[index]Allows direct reference to an element's children by index.print(root[2].tag)department

A Simple Employee Example

Using the functions above, let's parse through an XML file and extract employee data. We'll demonstrate how to access different elements and attributes and display them. Here's a simple XML string representing employee information:

XML String
<employees>
<employee id="1">
<name>John Doe</name>
<department>Engineering</department>
</employee>
<employee id="2">
<name>Jane Smith</name>
<department>Marketing</department>
</employee>
</employees>

We can then use Java libraries to parse this XML string and access the employee data. Let's iterate through the XML elements and print out the employee details:

from javax.xml.parsers import DocumentBuilderFactory
from java.io import ByteArrayInputStream

# Create a DOM document builder
builderFactory = DocumentBuilderFactory.newInstance()
builder = builderFactory.newDocumentBuilder()

# Parse the XML string
document = builder.parse(ByteArrayInputStream(xmlString.encode()))

# Access the root element
root = document.getDocumentElement()

# Iterate through employees
employees = root.getElementsByTagName("employee")
for employee in employees:
# Convert the id attribute to an integer
id = int(employee.getAttribute("id"))
print("Employee ID:", id)
print("Name:", employee.getElementsByTagName("name")[0].childNodes[0].nodeValue)
print("Department:", employee.getElementsByTagName("department")[0].childNodes[0].nodeValue)
print()

What is the SAX Parser?

The Simple API for XML (SAX) parser, available through Java libraries, provides an event-driven approach to parse XML documents. It's widely used for its efficiency, especially when handling large XML files. SAX parses XML sequentially and triggers events as it encounters elements, attributes, and other components in the XML document. For more detailed information about the SAX parser, refer to the Java XML SAX Parser Documentation.

Using the SAX Parser

The SAX parser doesn't build a tree structure like the DOM parser. Instead, it parses the XML document sequentially and triggers events that the developer can handle. Here's an example of using the SAX parser to parse an XML file:

Java - Reading a File

from javax.xml.parsers import SAXParserFactory
from org.xml.sax.helpers import DefaultHandler
from java.io import ByteArrayInputStream

# Define your XML string
xmlString = """
<employees>
<employee id="1">
<name>John Doe</name>
<department>Engineering</department>
</employee>
<employee id="2">
<name>Jane Smith</name>
<department>Marketing</department>
</employee>
</employees>
""" # Replace with your actual XML string

# Define a custom ContentHandler
class MyContentHandler(DefaultHandler):
def startElement(self, uri, localName, qName, attributes):
print("Start Element:", qName)
for i in range(attributes.getLength()):
print("Attribute:", attributes.getQName(i), "=", attributes.getValue(i))

def endElement(self, uri, localName, qName):
print("End Element:", qName)

def characters(self, ch, start, length):
print("Character Data:", ch[start:start+length])

# Create a SAX parser
saxParserFactory = SAXParserFactory.newInstance()
saxParser = saxParserFactory.newSAXParser()

# Parse the XML string
stream = ByteArrayInputStream(xmlString.encode('utf-8'))
saxParser.parse(stream, MyContentHandler())

In the above example, we define a custom ContentHandler class that extends DefaultHandler. This class overrides methods to handle idfferent events encountered during XML parsing, such as starting and ending elements, and character data.

A Simple Example Using SAX Parser

Let's parse through a simple XML file containing employee data using the SAX parser. Here's a sample XML string representing employee information:

XML String
<employees>
<employee id="1">
<name>John Doe</name>
<department>Engineering</department>
</employee>
<employee id="2">
<name>Jane Smith</name>
<department>Marketing</department>
</employee>
</employees>

We can then use the SAX parser to extract employee details and print them out:

from javax.xml.parsers import SAXParserFactory
from org.xml.sax.helpers import DefaultHandler
from java.io import ByteArrayInputStream

# Define a custom ContentHandler
class MyContentHandler(DefaultHandler):
def startElement(self, uri, localName, qName, attributes):
print("Start Element:", qName)
for attr in attributes:
print("Attribute:", attr.getName(), "=", attr.getValue())

def endElement(self, uri, localName, qName):
print("End Element:", qName)

def characters(self, content, start, length):
print("Character Data:", content[start:start+length])

# Create a SAX parser
saxParserFactory = SAXParserFactory.newInstance()
saxParser = saxParserFactory.newSAXParser()

# Parse the XML file
saxParser.parse(ByteArrayInputStream(xmlString.encode()), MyContentHandler())

What is the StAX Parser?

The Streaming API for XML (StAX) parser, available through Java libraries, offers a cursor-based approach to parse XML documents. It provides an efficient way to read and process XML sequentially without loading the entire document into memory. StAX parsers allow developers to iterate through XML elements, attributes, and other components as they are encountered in the XML stream. For more detailed information about the StAX parser, refer to the Java XML StAX Parser Documentation.

Using the StAX Parser

The StAX parser operates in a streaming fashion, allowing developers to read XML content sequentially without the need to build a complete in-memory representation of the XML document. Here's an example of using the StAX parser to parse an XML file:

Java - Reading a File

from javax.xml.stream import XMLInputFactory, XMLStreamReader
from java.io import ByteArrayInputStream

# Create an XML input factory
inputFactory = XMLInputFactory.newInstance()

# Create an XML stream reader
streamReader = inputFactory.createXMLStreamReader(ByteArrayInputStream(xmlString.encode()))

# Iterate through the XML stream
while streamReader.hasNext():
event = streamReader.next()
if event == XMLStreamReader.START_ELEMENT:
print("Start Element:", streamReader.getLocalName())
# Print attributes, if any
for i in range(streamReader.getAttributeCount()):
print("Attribute:", streamReader.getAttributeLocalName(i), "=", streamReader.getAttributeValue(i))
elif event == XMLStreamReader.END_ELEMENT:
print("End Element:", streamReader.getLocalName())
elif event == XMLStreamReader.CHARACTERS:
print("Character Data:", streamReader.getText())

The above example demonstrates how to create an XML input factory and a stream reader to parse the XML content. We iterate through the XML stream and handle different events such as starting and ending elements, as well as character data.

A Simple Example Using StAX Parser

Let's parse through a simple XML file containing employee data using the StAX parser. Here's a sample XML string representing employee information:

XML String
<employees>
<employee id="1">
<name>John Doe</name>
<department>Engineering</department>
</employee>
<employee id="2">
<name>Jane Smith</name>
<department>Marketing</department>
</employee>
</employees>
We can then use the StAX parser to extract employee details and print them out:
from javax.xml.stream import XMLInputFactory, XMLStreamReader
from java.io import ByteArrayInputStream

# Create an XML input factory
inputFactory = XMLInputFactory.newInstance()

# Create an XML stream reader
streamReader = inputFactory.createXMLStreamReader(ByteArrayInputStream(xmlString.encode()))

# Iterate through the XML stream
while streamReader.hasNext():
event = streamReader.next()
if event == XMLStreamReader.START_ELEMENT:
print("Start Element:", streamReader.getLocalName())
# Print attributes, if any
for i in range(streamReader.getAttributeCount()):
print("Attribute:", streamReader.getAttributeLocalName(i), "=", streamReader.getAttributeValue(i))
elif event == XMLStreamReader.END_ELEMENT:
print("End Element:", streamReader.getLocalName())
elif event == XMLStreamReader.CHARACTERS:
print("Character Data:", streamReader.getText())