Sitemap creation

According to Google Support page, a sitemap is a file where you can list the web pages of your site to tell Google and other search engines about the organization of your site content. Search engine web crawlers like Googlebot read this file to more intelligently crawl your site.

ACS Commons sitemap generator

For the purpose of sitemap creation, the easiest (and maybe the best) way is to use Sitemap Generator from ACS Commons project.

What you need to do is to create a new OSGi configuration for com.adobe.acs.commons.wcm.impl.SiteMapServlet service and you might need an additional configuration for AEM Externalizer service.

Here’s an example for SiteMapServlet configuration. Create the file /apps/mysite/config/com.adobe.acs.commons.wcm.impl.SiteMapServlet.xml with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root
xmlns:sling="http://sling.apache.org/jcr/sling/1.0"
xmlns:cq="http://www.day.com/jcr/cq/1.0"
xmlns:jcr="http://www.jcp.org/jcr/1.0"
xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
jcr:primaryType="sling:OsgiConfig"
sling.servlet.resourceTypes="[mysite/components/structure/homepage]"
externalizer.domain="sitemap"
include.lastmod="{Boolean}true"
include.inherit="{Boolean}true"/>

The key thing is in line 8: SiteMapServlet will find a site homepage (the one with sling:resourceType property set to mysite/components/structure/homepage) and generate sitemap URLs for the whole homepage content subtree.

It’s also common that for sling.servlet.resourceTypes property we can put resource type of a language page (assuming that your site has the localized variants). Because all pages are located under the language page, SiteMapServlet will easily pick up all content while generating the sitemap.

Although not relevant for the sitemap generation, just to mention that the language page could have redirectTarget property, which points to the real home page, and there’s the logic which will do the redirect based on the existence of that property (usually located inside some ‘super page’ component which is inherited by all page components in a project).

Which approach to choose, it mainly depends on the content organization of your website.

SiteMapServlet will skip hidden pages, and in order to have more control about which pages to exclude from the sitemap generation, the config option exclude.property can be used. Please take a look at the official documentation page for all config options available.

Externalizer configuration

In the sample configuration above, in line 9, there’s a reference to sitemap domain. This domain is configured in OSGi configuration for Externalizer service (eg. /apps/mysite/configs/config/com.day.cq.commons.impl.ExternalizerImpl.xml):

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0"
xmlns:jcr="http://www.jcp.org/jcr/1.0" jcr:primaryType="sling:OsgiConfig"
externalizer.domains="[local http://localhost:4502,author http://localhost:4502,publish http://localhost:4503,sitemap https://www.mysite.com"/>

The domains local, author and publish are present by default. The new one is for sitemap. Internally, SiteMapServlet will use Externalizer service to pull the domain for sitemap and all resource paths will be prefixed by the given domain (in this case https://www.mysite.com).

The externalizer configuration above could be made to be runmode specific, by taking the advantage of the Sling OSGi Installer’s runmode awareness (eg. /apps/myproject/configs/config.prod/com.day.cq.commons.impl.ExternalizerImpl.xml for Production instance).

Resource resolver mappings

In order to make the sitemap available at the root of a website domain, eg. https://www.mysite.com/sitemap.xml, resource resolver mappings have to be defined.

Here’s the sample. Under /etc/map.prod.publish, create https folder (sling:OrderedFolder) and below it, create mysite.com folder with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0"
jcr:primaryType="sling:Mapping"
sling:match="(.*)\.mysite\.com.(4503|443)">
<sitemap
jcr:primaryType="sling:Mapping"
sling:internalRedirect="[/content/mysite/en.sitemap.xml]"
sling:match="sitemap.xml$"/>
<sitemap_any
jcr:primaryType="sling:Mapping"
sling:internalRedirect="[/content/mysite/$3.sitemap.xml]"
sling:match="([a-z]{2})/sitemap.xml$"/>
<!-- other mappings -->
<robots
jcr:primaryType="sling:Mapping"
sling:internalRedirect="[/etc/designs/mysite/robots.txt]"
sling:match="robots.txt$"/>
<!-- ... -->
</jcr:root>

This is the mapping defined for Publish instance on Production.

In order to make it available, Resource Resolver service needs to be configured with the new location of the mapping entries. The config file /apps/myproject/configs/config.prod.publish/org.apache.sling.jcr.resource.internal.JcrResourceResolverFactoryImpl.xml should be created with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:cq="http://www.day.com/jcr/cq/1.0"
xmlns:jcr="http://www.jcp.org/jcr/1.0" xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
jcr:primaryType="sling:OsgiConfig"
resource.resolver.map.location="/etc/map.prod.publish"/>

Additionally, some reverse mapping entries can be defined if there’s the need to output a custom URL formats in the sitemap.

Submitting the sitemap

Assuming that a website has three localizations only (en, de, and fr), inside robots.txt the following lines have to be added:

Sitemap: https://www.mysite.com/en.sitemap.xml
Sitemap: https://www.mysite.com/de.sitemap.xml
Sitemap: https://www.mysite.com/fr.sitemap.xml

Sitemap index

If your site has many sitemaps, you can use a sitemap index file as a way to submit them at once.

In this case, you need to provide your own implementation of sitemap index generator. This could be a servlet registered under some path, eg. /bin/mysite/sitemapindex and which will consider URLs with xml extension only.

This servlet should generate the following output:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.mysite.com/en.sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.mysite.com/de.sitemap.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.mysite.com/fr.sitemap.xml</loc>
</sitemap>
</sitemapindex>

Sitemap index servlet will read a content structure below /content/mysite and generate sitemap urls as they are defined in robots.txt above. You can use com.adobe.acs.commons.wcm.impl.SiteMapServlet as the example servlet in order to write your implementation of sitemap index servlet.

robots.txt should be updated so it will contain only one line – the path to the sitemap index:

Sitemap: https://www.mysite.com/sitemapindex.xml

Off course, a proper resource mapping entry has to be defined. If we consider the mappings defined above, sitemap and sitemap_any should be removed and the new mapping needs to be defined:

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0"
jcr:primaryType="sling:Mapping"
sling:match="(.*)\.mysite\.com.(4503|443)">
<sitemapindex
jcr:primaryType="sling:Mapping"
sling:internalRedirect="[/bin/mysite/sitemapindex.xml]"
sling:match="sitemapindex.xml$"/>
<!-- other mappings -->
<robots
jcr:primaryType="sling:Mapping"
sling:internalRedirect="[/etc/designs/mysite/robots.txt]"
sling:match="robots.txt$"/>
<!-- ... -->
</jcr:root>

We also need reverse resource mappings if we’d like to have sitemap URLs in the form https://www.mysite.com/<lang>.sitemap.xml. Inside folder /etc/map.prod.publish/https, create folder www.mysite.com with the following content:

<?xml version="1.0" encoding="UTF-8"?>
<jcr:root xmlns:sling="http://sling.apache.org/jcr/sling/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0"
jcr:primaryType="sling:Mapping"
sling:internalRedirect="[/content/mysite/([a-z]{2})(.*)]"
sling:match="$1$2"/>

In order to make some things clear, here’s the code snippet which is responsible for generating sitemap URLs in proper format:

// code snippet from SitemapIndexServlet
@SlingServlet(
methods = HttpConstants.METHOD_GET,
paths = {"/bin/mysite/sitemapindex"},
extensions = {"xml"})
public class SitemapIndexServlet extends SlingAllMethodsServlet {
private static final String DOMAIN_SITEMAP = "sitemap";
// the code for creating sitemap index in XML format

private String getSitemapURL(String languagePagePath, ResourceResolver resourceResolver) {
String loc = resourceResolver.map(languagePagePath);
return String.format("%s.sitemap.xml", loc.substring(0, loc.lastIndexOf('.')));
// or
// return externalizer.externalLink(resolver, DOMAIN_SITEMAP, String.format("%s.sitemap.xml", languagePagePath));
}
}

Dispatcher configuration

You might need to change Dispatcher configuration in order to allow sitemap URLs. Assuming that your Dispatcher configuration follows whitelist strategy, you should add the following under /filter section:

# allow sitemap URLs
/0001 { /type "deny"   /glob "*.xml*"}
/0002 { /type "allow"  /glob "* *.sitemap.xml *"}
/0003 { /type "allow"  /glob "* /sitemapindex.xml *"}
/0004 { /type "allow"  /glob "* /robots.txt *"}