GoodRelations is a standardized vocabulary for product, price, and company data that can (1) be embedded into existing static and dynamic Web pages and that (2) can be processed by other computers. This increases the visibility of your products and services in the latest generation of search engines, recommender systems, and other novel applications.
Martin Hepp (UniBW)
martin.hepp at ebusiness-unibw.org
Fri Jan 15 16:08:55 CET 2010
Hi all, It seems there is a quick and easy way to get a full RDF/XML representation of all 20 Million Amazon offers. Here is how it will likely work: 1. Take the Amazon sitemap index files, as given by http://www.amazon.com/robots.txt # Sitemap files Sitemap: http://www.amazon.de/sitemap_index_0.xml Sitemap: http://www.amazon.de/sitemap_index_1.xml Sitemap: http://www.amazon.de/sitemap_index_2.xml Sitemap: http://www.amazon.de/sitemap_index_3.xml Sitemap: http://www.amazon.de/sitemap-manual-index.xml Sitemap: http://www.amazon.de/sitemap_wishlist_index.xml 2. Take the individual sitemap files from all of those, e.g. http://www.amazon.de/sitemap_page_0.xml.gz from http://www.amazon.de/sitemap_index_0.xml <sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84"> <sitemap> <loc>http://www.amazon.de/sitemap_page_0.xml.gz</loc> <lastmod>2006-10-16</lastmod> </sitemap> 3. Now, for each of those ca. 20 Million entries given as <loc> elements, e.g. http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/ <url> <loc>http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/</loc> </url> use the URIburner service (http://uriburner.com/sparql/) to extract the complete commercial meta-data in GoodRelations. Note that not all URIs are current and that URIburner cannot produce GoodRelations data for not all pages, but it can for the majority of the ca. 20 Million pages. You will get, on average, 200 GoodRelations triples per Amazon page, so the total will be in the order of magnitude of 4 billion ! (If you want to check it for yourself, try select COUNT (*) WHERE {?s ?p ?o. FILTER (regex(?o, "^http://purl.org/goodrelations/v1#", "i") or regex(?p, "^http://purl.org/goodrelations/v1#", "i")) } against the URI http://www.amazon.com/Pull-Power-Semantic-Transform-Business/dp/1591842778/ Important: Using URIburner on the full set of Amazon URIs will likely impose a great load on the underlying server, operated by OpenLink Software. If you want to use this option, in particular for commercial purposes, please contact Kingsley Idehen before you start. His e-mail is <kidehen at openlinksw.com>. Best wishes Martin Hepp -- -------------------------------------------------------------- martin hepp e-business & web science research group universitaet der bundeswehr muenchen e-mail: hepp at ebusiness-unibw.org phone: +49-(0)89-6004-4217 fax: +49-(0)89-6004-4620 www: http://www.unibw.de/ebusiness/ (group) http://www.heppnetz.de/ (personal) skype: mfhepp twitter: mfhepp Check out GoodRelations for E-Commerce on the Web of Linked Data! ================================================================= Project page: http://purl.org/goodrelations/ Resources for developers: http://www.ebusiness-unibw.org/wiki/GoodRelations Webcasts: Overview - http://www.heppnetz.de/projects/goodrelations/webcast/ How-to - http://vimeo.com/7583816 Recipe for Yahoo SearchMonkey: http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey Talk at the Semantic Technology Conference 2009: "Semantic Web-based E-Commerce: The GoodRelations Ontology" http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287 Overview article on Semantic Universe: http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html Tutorial materials: ISWC 2009 Tutorial: The Web of Data for E-Commerce in Brief: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009