When creating an XML sitemap, you may want to modify it slightly before you upload it to your server. There are a number of reasons why you should manually edit your XML sitemap file which include, proper indexation, URL index limitations (although they have gotten better at indexing larger xml files), and to increase the % of pages that get indexed.
There are tons of XML sitemap generation tools like AuditMyPC, XML-Sitemaps, or the Google Sitemap Generator that will help you to create your sitemap, however many of these tools will not take the following into consideration.
- URL Duplicates: Make sure that you don’t include multiple versions of a URL. For example, if you have domain.com/services and domain.com/services/ make sure you remove one of the URLs so you don’t have any canonical issues.
- Robots.txt: Many of the sitemap generators will look at your robots.txt file and not include the directories or urls that you have already omitted. Just make sure that those directories are not being included in your sitemap.
- Error Pages: If you notice you have some 404 Error pages in your sitemap, you should either fix those pages so you can include them or remove them from the sitemap.
- Images: There is no need to include a list of all of your images in your XML sitemap. Google will come by and index the images when they index the page, so I wouldn’t try and have Google focus on your images.
- Invalid Links: Some sitemap generators will have this option, but ignoring invalid relative links will help you with submitting only valid links to the search engines.
Quick Tip: Make sure you don’t block your CSS files in your robots.txt file. Google will want to be able to read your CSS file to ensure that you are not doing anything blackhat with how you display your content.
Great article, it's important to not have CSS files blocked in the robots.txt file. Also, people make the mistake of including video files into the sitemap, these along with images will not get picked up by the crawlers. Instead, these can be included in a separate video sitemap.
I am not 100% sure what you are asking, but I think you are talking about pages that have lots of external links on a page. I wouldn't worry to much about blocking those pages, however you could restrict the pages in the robot.txt file or a meta noindex nofollow tag.
Thanks for the list. I had never thought about checking a sitemap that closely. Some, like 404 error pages and invalid links are quite obvious, but I had not thought of the Robots.txt file. Thanks.
@Mark great points and yes the duplicate and 404 are the 2 biggest items you need to flag and remove from your sitemap.Interesting point about not blocking Google from CSS to show you are following their rules, anyone else shown an impact by blocking their CSS in Robots.txt?