Data and API Catalog SchemaIt is important for data sources to be able to advertise their APIs and Datasets, so catalogs are able to pick them up. There are two important pieces in doing this:
- Files or endpoints that catalogs or search engines can access to find the appropriate APIs and Datasets. This is similar to how sitemaps.xml works for web sites.
- A standard format for describing APIs and Datasets, so that callers can gather the appropriate information. There are a number of formats suggested for this, this standard will start with the CKAN's Data Catalog Schema and Protocol modifications on top of DCAT
catalog.xml and apis.xmlAny data source that provides datasets or APIs can expose them to catalogs and search engines through a few endpoints. These enpoints are specifically created so they can be implemented through either simply creating a file or building a more dynamic web service to implement them. This is important, because one goal of the standard is to allow simple data stores to be able to expose APIs or datasets through a catalog. The available endpoints are:
- /catalog.xml is an endpoint that contains all the resources contained by a data source. This can include datasets, APIs, and visualizations.
- /apis.xml is an endpoint that contains all the APIs surfaced by a data source. This will ONLY include APIs. The purpose for pulling this file out separately is to make it easier for many API catalogs to provide programmatic access. These SHOULD be implemented by includes in the Catalog.xml.
In addition, Catalogs MAY provide query-able endpoints above and beyond Catalog.xml. In this case, they are free to do so, but the responses should correspond to the same format as below.
catalog.xml exampleNOTE: It would be nice to standardize on the catalog query api as well. This could be an area of more work.
We think about the format of the endpoints in two ways, there is the data model and the actual file formats created. The data model is based on DCAT, and the default file format is XML. The XML format is based on RDF, however, some dynamic endpoints MAY decide to offer other formats such as N3 or JSON.
Differences from DCATThere are a few differences from DCAT proper, both a series of properties we suggest not using, as well as a new set of properties that extend DCAT.
- Don't use dcat:dataDictionary. Save for later use, when it is better defined
- Don't use dcat:dataQuality. Save for later use, when it is better defined
- Don't use dcat:granularity. Save for later use, when it is better defined
- Don't use dc:references. Save for later use, when it is better defined
Description: Describes a resource provided by this data service (or catalog). This is based on a dcat:distribution, however extends it with a number of properties to add context, since this element is not showing up as part of a catalog entry.
|title||dct:title||The name or title for this representation. If not present, implementers should use the name from the related Dataset.|
|description||dct:description||Text describing this representation. If not present, implementers should use the description from the related Dataset.|
|identifier||dct:identifier||Unique identifier for this data representation|
|keyword||dcat:keyword||Keywords specific to this representation. If the dataset has keywords as well, the keywords relating to this object should be the union.|
|dataset||rdfs:Resource||Reference to the datasets this resource is a part of. In the case that this aggregates or uses multiple data sets, this will contain all the datasets referenced.|
|last_modified||xsd:datetime||The time this was last modified|
|created||xsd:datetime||The time this was initially created|
|authentication||rdf:literal||Authentication type required for this resources. "None" means it is open to the public. "Basic" means basic authentication. "OAuth" and "OAuth2" refer to OAuth version 1.0 and 2.0, respectively. This ONLY refers to mechanism and not how authorization is done.|
|contact_email||rdf:literal||An optional field with the email address of the person to contact about this dataset.|
|documentation||rdf:Resource||The documentation for this site. Should link to a human readable documentation.|
Description: This is an API that accesses a particular dataset or set of datasets.
|client_library||rdfs:Resource||References a client library that can be used to access this API.|
|api_type||Literal||A string denoting a particular api this endpoint supports.|
<ods:api> <dct:title>SODA Api for Current Recalls</dct:title> <dct:description> This is the SODA Api for getting access to all the egg products that are actively being investigated </dct:description> <dct:identifier>ongoing-egg-recalls</dct:identifier> <dcat:accessURL>https://fda.demo.socrata.com/resource/ongoing-egg-recalls</dcat:accessURL> <dcat:keyword>Eggs</dcat:keyword> <dcat:keyword>Recalls</dcat:keyword> <ods:authentication>Basic</ods:authentication> <ods:documentation>https://fda.demo.socrata.com/developers/docs/egg-recalls</ods:documentation> <ods:client_library>https://github.com/socrata/soda-java</ods:client_library> <ods:client_library>https://github.com/socrata/soda-scala</ods:client_library> <ods:client_library>https://github.com/socrata/socrata-php</ods:client_library> <ods:client_library>https://github.com/socrata/soda-js</ods:client_library> <ods:client_library>https://github.com/socrata/socrata-python</ods:client_library> <ods:client_library>https://github.com/socrata/socrata-api-csharp</ods:client_library> <ods:contact_email>firstname.lastname@example.org</ods:contact_email> <!-- MIME Types provided --> <dct:format>application/json</dct:format> <dct:format>application/rdf+xml</dct:format> <dct:format>text/csv</dct:format> <!-- What type of API it is --> <ods:api_type>rest/soda</ods:api_type> <ods:last_modified>2012-10-10T16:46:00+800</ods:last_modified> <ods:created>2012-10-10T16:46:00+800</ods:created> </ods:api>
Description: A chart built on top of the associated datasets.
The following example, would surface a chart that is based on several different datasets, for each year of egg recalls. Since, this does not neatly fit into the dataset down approach of DCAT, we annotate this with the dataset property from ods:Representation
<ods:chart> <dct:title>Number of recalls per manufacturer (2004 - 2012)</dct:title> <dct:description> This chart graphs the number of egg product recalls per manufacturer for the years 2004 - 2010 </dct:description> <dct:identifier>recalls-per-manufacturer</dct:identifier> <dcat:accessURL>https://fda.demo.socrata.com/resource/recalls-per-manufacturer</dcat:accessURL> <dcat:keyword>Eggs</dcat:keyword> <dcat:keyword>Recalls</dcat:keyword> <ods:contact_email>email@example.com</ods:contact_email> <!-- MIME Types accepted --> <dct:format>text/html</dct:format> <dct:format>image/png</dct:format> <!-- Since this chart uses data from many different datasets, it references them here --> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2004</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2005</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2006</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2007</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2008</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2009</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2010</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2011</ods:dataset> <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2012</ods:dataset> <ods:last_modified>2012-10-10T16:46:00+800</ods:last_modified> <ods:created>2012-10-10T16:46:00+800</ods:created> </ods:chart>
Description: A map representation of the data in one or more datasets. The metadata fields provided for this Representation type were inspired by standards such as ISO 19115:2003.
|extents||TBD||The geographic extents that this map covers.|
|projection||TBD||The geospatial projection of this map|
Description: A "Derived Dataset" is a data product that is the result of filtering or combining one or more datasets in a particular way. For example:
- Taking two datasets and joining them on a common identifier
- Taking a larger dataset and filtering it down to represent a specific subset
- Taking a larger dataset and aggregating it in some way
- Modifying a dataset as the result of a computation or conversion, such as translating a dataset from one language to another, or converting the currencies or units of a dataset
At a minimum, every derived dataset should include a description of the modificaitons that were performed in its Description field. A more formal description of the steps involved will be considered in the future, but is out of scope for now.
If you’d like to contribute, please do! There are many ways to do so:
- Join the conversation on the Google Group for Open Data Standards
- Add your own comments to the spec and send us a pull request
- Add issues to the issue tracking system for this project
Inspirations and References
- DCAT Working Draft
- Dublin Core
- Vocab.org Gregorian Intervals
- CKAN Data Management System Documentation
- CKAN Data Catalog Schema and Protocol v0.1
- Data.gov’s API Catalog proposal
This work is all licensed Creative Commons By-Attribution 3.0