python - parsing xml containing default namespace to get an element value using lxml -
i have xml string this
str1 = """<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc> http://www.example.org/sitemap_1.xml.gz </loc> <lastmod>2015-07-01</lastmod> </sitemap> </sitemapindex> """
i want extract urls present inside <loc>
node i.e http://www.example.org/sitemap_1.xml.gz
i tried code didn't word
from lxml import etree root = etree.fromstring(str1) urls = root.xpath("//loc/text()") print urls []
i tried check if root node formed correctly. tried , same string str1
etree.tostring(root) '<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n<sitemap>\n<loc>http://www.example.org/sitemap_1.xml.gz</loc>\n<lastmod>2015-07-01</lastmod>\n</sitemap>\n</sitemapindex>'
this common error when dealing xml having default namespace. xml has default namespace, namespace declared without prefix, here :
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
note not element default namespace declared in namespace, descendant elements inherit ancestor default namespace implicitly, unless otherwise specified (using explicit namespace prefix or local default namespace point different namespace uri). means, in case, elements including loc
in default namespace.
to select element in namespace, you'll need define prefix namespace mapping , use prefix in xpath :
from lxml import etree str1 = '''<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc> http://www.example.org/sitemap_1.xml.gz </loc> <lastmod>2015-07-01</lastmod> </sitemap> </sitemapindex>''' root = etree.fromstring(str1) ns = {"d" : "http://www.sitemaps.org/schemas/sitemap/0.9"} url = root.xpath("//d:loc", namespaces=ns)[0] print etree.tostring(url)
output :
<loc xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> http://www.example.org/sitemap_1.xml.gz </loc>
Comments
Post a Comment