Most blogging systems offer a way to create user-defined groupings of content called taxanomies. While I like the feature, I do not want them showing in search engines.
2017-09-10: After migrating to use GitLab CI for building and publishing, I now use XMLStarlet in GitLab CI to achieve the same outcome.
When I was using WordPress, it was easy to find a plugin that can omit taxanomies from appearing in sitemap. With Hugo, I could not find a way to disable these from showing up on search engines. I also found that in some cases, the taxanomy links were ranked higher than the actual content page. Since my posts and pages are already indexed, there is no value in having taxanomies indexed so I had to remove them. One of the ways is by adding a post-build step.
My batch file for generating public
output folder on Hugo was:
rd /s /q public
md public
hugo.exe
The above cleans the output folder, recreates and finally publishes into it. You may not like the idea of deleting and recreating the public folder. The first two lines are optional.
I coded a simple Powershell script that cleanses the sitemap.xml
file of taxanomies.
[xml]$xml = Get-Content public\sitemap.xml
$ns = New-Object System.Xml.XmlNamespaceManager($xml.NameTable)
$ns.AddNamespace("ns", $xml.DocumentElement.NamespaceURI)
$xpathSelectCriterion = "//ns:url[contains(ns:loc, '/categories/') or contains(ns:loc, '/tags/')]"
$node = $xml.SelectSingleNode($xpathSelectCriterion, $ns)
while ($node -ne $null) {
$node.ParentNode.RemoveChild($node)
$node = $xml.SelectSingleNode($xpathSelectCriterion, $ns)
}
$xml.save("public\sitemap.xml")
Then added a line at the end of my batch file to call the script above:
rd /s /q public
md public
hugo.exe
powershell -noexit "& "".\clean-sitemap.ps1"""
I tried and found Windows batch commands too cumbersome. In the end I needed to achieve the results required using the shortest possible time and using an interpreted language so that I do not have to upload binary files into source control.
You need to tell search bots that crawling categories
and tags
taxanomies are disallowed.
User-agent: *
Disallow: /categories/
Disallow: /tags/
Sitemap: https://www.leowkahman.com/sitemap.xml
Remember to point the Sitemap
line to your sitemap.xml
.
I had to wait a few days for Google to recrawl and purge the taxanomies. To verify that Google has removed the taxanomies from search result, search for site:www.leowkahman.com
. They should disappear. You need to replace the domain with yours of course.