Author: Martin Hepp, mheppATcomputerDOTorg
Much more than any other type of rich meta-data on the Web, GoodRelations-related data is subject to change and updates, e.g.
Search engines in the traditional Web crawl your page only once in a couple of weeks or so. Usually, the pagerank or other popularity metrics are used to decide on which pages to crawl more frequently than others. So BestBuy pages may be crawled every day, while Peter Miller's Hardware Shop Site in rural Kentucky will be checked only once every two months. Only a fraction of Web resources, which can be expected to change very often, like Twitter and several blogs, get updated much faster.
The simple reason is that crawling consumes a lot of resources on boths ends - both the search engine and the servers hosting the site will be subject to CPU and IP traffic load. Since those resources are costly and limited, a good search engine will emply sophisticated algorithms to decide on when to visit your page again.
Now, it is important that you help the search engines to decide on when to crawl your page again.
The first important thing is to be clear about the expected validity of the statements in your GoodRelations data. For example,
In many cases, it is hard to determine exact dates for each statement in advance, so we need to apply heuristics (rules of thumb). You should find a good balance between the two extremes, i.e.
A good heuristic is 48- 72 hours of validity for typical shop pages that are generated dynamically from a database and one year for static company profiles,both counted from the creation of the respective RDF/XML or HTML+RDFa resource.
So for a database-driven shop application, you could estimate the validity as
Now, there are several techniques that can be used to communicate the validity of your GoodRelations data, and which may be used by search engines and indexing services to decide upon the intervals for checking back with your site:
The GoodRelations vocabulary itself defines two datatype properties that can be attached to instances of gr:Offering, gr:OpeningHoursSpecification, or gr:PriceSpecification.
They should be used at least for gr:Offering nodes.
foo:myOffering
a gr:Offering;
foaf:page <http://www.example.com/xyz>;
rdfs:comment "We sell and repair computers and motorbikes"@en ;
gr:includes foo:myProducts ;
gr:hasBusinessFunction gr:Sell, gr:Repair ;
gr:validFrom "2010-03-04T00:00:00+01:00"^^xsd:dateTime ;
gr:validThrough "2010-03-06T00:00:00+01:00"^^xsd:dateTime.
The HTTP Protocol specifies several means for the server to indicate for how long using a cached copy of a resource (more precisely: its representation) is safe. This is explained in more detail in the section 13.2.1 Server-Specified Expiration of the RFC 2616, the authoritative specification of the HTTP 1.1 protocol.
Basically, you should configure your HTTP server so that it specifies a cache expiration time sooner or identical to the end of the validity of the GoodRelations data contained in the representation. If there exist multiple validity specifications in the same resource (e.g. one for the offering and one for the price), then the one that expires first should be used.
Servers specify explicit expiration times using either the Expires header, or the max-age directive of the Cache-Control header.
Sitemap documents following the Sitemap Protocol are an important technique to help search engines and other crawlers find all relevant resources in a given Web site. You should always provide a sitemap file for your shop and list the URIs of all individual pages that contain GoodRelations markup in RDF/XML or RDFa.
Besides helping a crawler to initially discover all pages contained in your site, you can also use the sitemap document to indicate how frequently the content is expected to change, and in particular, which pages will change more frequently than others.
Here is an minimal example of a sitemap:
``
``http://www.example.com/
``
``
``
``
The interesting elements for us in here are lastmod,changefreq, andpriority.
The following definitions are slightly adapted excerpts from the sitemap protocol specification:
You can use the Semantic Sitemap Extension for helping crawlers find and fetch data dumps and SPARQL endpoints related to your data.
All three techniques should be used in parallel and should ideally indicate the same validity period, at least approximately.
As a minimal solution,
Also, your server should not specify a longer cache validity for any single page than the day on which the first gr:Offering or gr:UnitPriceSpecification contained in that resource (!)will expire.
If you want to minimize crawler / spider traffic and maximize the freshness of your data in search engines and indexing services, you should try to use priority to point the crawler to the subset of products / data that change more frequently than others. For example, memory chips and hard disk drivesare usually subject to more substantial variations in prices than are books or cables.
Thanks to Giovanni Tumarello from DERI for raising this topic and for Kingsley Idehen from OpenLink Software for initially pointing me to the HTTP caching mechanisms.