View on GitHub

Data-catalog-schema

Data and API Catalog Schema

Data and API Catalog Schema

It is important for data sources to be able to advertise their APIs and Datasets, so catalogs are able to pick them up. There are two important pieces in doing this:
  1. Files or endpoints that catalogs or search engines can access to find the appropriate APIs and Datasets. This is similar to how sitemaps.xml works for web sites.
  2. A standard format for describing APIs and Datasets, so that callers can gather the appropriate information. There are a number of formats suggested for this, this standard will start with the CKAN's Data Catalog Schema and Protocol modifications on top of DCAT

catalog.xml and apis.xml

Any data source that provides datasets or APIs can expose them to catalogs and search engines through a few endpoints. These enpoints are specifically created so they can be implemented through either simply creating a file or building a more dynamic web service to implement them. This is important, because one goal of the standard is to allow simple data stores to be able to expose APIs or datasets through a catalog. The available endpoints are:

In addition, Catalogs MAY provide query-able endpoints above and beyond Catalog.xml. In this case, they are free to do so, but the responses should correspond to the same format as below.

Examples

apis.xml example

catalog.xml example

NOTE: It would be nice to standardize on the catalog query api as well. This could be an area of more work.

The Format

We think about the format of the endpoints in two ways, there is the data model and the actual file formats created. The data model is based on DCAT, and the default file format is XML. The XML format is based on RDF, however, some dynamic endpoints MAY decide to offer other formats such as N3 or JSON.

Differences from DCAT

There are a few differences from DCAT proper, both a series of properties we suggest not using, as well as a new set of properties that extend DCAT.
  1. Don't use dcat:dataDictionary. Save for later use, when it is better defined
  2. Don't use dcat:dataQuality. Save for later use, when it is better defined
  3. Don't use dcat:granularity. Save for later use, when it is better defined
  4. Don't use dc:references. Save for later use, when it is better defined
In addition the new extensions to DCAT are meant to extend the distributions for being able to more specifically handle cases where an API, Chart or Graph can be created from multiple datasets. In addition, more properties are added to a new Representation class that is derived from Distribution. This is because many datasets, charts, graphs, maps, etc. Will require their own titles, descriptions, identifiers, etc.

Representation

Range: dcat:distribution

Description: Describes a resource provided by this data service (or catalog). This is based on a dcat:distribution, however extends it with a number of properties to add context, since this element is not showing up as part of a catalog entry.

PropertyTypeDescription
titledct:titleThe name or title for this representation. If not present, implementers should use the name from the related Dataset.
descriptiondct:descriptionText describing this representation. If not present, implementers should use the description from the related Dataset.
identifierdct:identifierUnique identifier for this data representation
keyworddcat:keywordKeywords specific to this representation. If the dataset has keywords as well, the keywords relating to this object should be the union.
datasetrdfs:ResourceReference to the datasets this resource is a part of. In the case that this aggregates or uses multiple data sets, this will contain all the datasets referenced.
last_modifiedxsd:datetimeThe time this was last modified
createdxsd:datetimeThe time this was initially created
authenticationrdf:literalAuthentication type required for this resources. "None" means it is open to the public. "Basic" means basic authentication. "OAuth" and "OAuth2" refer to OAuth version 1.0 and 2.0, respectively. This ONLY refers to mechanism and not how authorization is done.
contact_emailrdf:literalAn optional field with the email address of the person to contact about this dataset.
documentationrdf:ResourceThe documentation for this site. Should link to a human readable documentation.
Clarifications Every resources MUST have at least one dcat:format property.

Api

Range: ods:representation

Description: This is an API that accesses a particular dataset or set of datasets.

Property Type Description
client_library rdfs:Resource References a client library that can be used to access this API.
api_type Literal A string denoting a particular api this endpoint supports.

Examples

<ods:api>
    <dct:title>SODA Api for Current Recalls</dct:title>
    <dct:description>
    This is the SODA Api for getting access to all the egg products that are actively being investigated
    </dct:description>
    <dct:identifier>ongoing-egg-recalls</dct:identifier>
    <dcat:accessURL>https://fda.demo.socrata.com/resource/ongoing-egg-recalls</dcat:accessURL>
    <dcat:keyword>Eggs</dcat:keyword>
    <dcat:keyword>Recalls</dcat:keyword>

    <ods:authentication>Basic</ods:authentication>
    <ods:documentation>https://fda.demo.socrata.com/developers/docs/egg-recalls</ods:documentation>
    <ods:client_library>https://github.com/socrata/soda-java</ods:client_library>
    <ods:client_library>https://github.com/socrata/soda-scala</ods:client_library>
    <ods:client_library>https://github.com/socrata/socrata-php</ods:client_library>
    <ods:client_library>https://github.com/socrata/soda-js</ods:client_library>
    <ods:client_library>https://github.com/socrata/socrata-python</ods:client_library>
    <ods:client_library>https://github.com/socrata/socrata-api-csharp</ods:client_library>

    <ods:contact_email>someone@fda.gov</ods:contact_email>

    <!-- MIME Types provided -->
    <dct:format>application/json</dct:format>
    <dct:format>application/rdf+xml</dct:format>
    <dct:format>text/csv</dct:format>

    <!-- What type of API it is -->
    <ods:api_type>rest/soda</ods:api_type>
    <ods:last_modified>2012-10-10T16:46:00+800</ods:last_modified>
    <ods:created>2012-10-10T16:46:00+800</ods:created>
</ods:api>

Chart

Range: ods:representation

Description: A chart built on top of the associated datasets.

Example

The following example, would surface a chart that is based on several different datasets, for each year of egg recalls. Since, this does not neatly fit into the dataset down approach of DCAT, we annotate this with the dataset property from ods:Representation

<ods:chart>
    <dct:title>Number of recalls per manufacturer (2004 - 2012)</dct:title>
    <dct:description>
    This chart graphs the number of egg product recalls per manufacturer for the years 2004 - 2010
    </dct:description>
    <dct:identifier>recalls-per-manufacturer</dct:identifier>
    <dcat:accessURL>https://fda.demo.socrata.com/resource/recalls-per-manufacturer</dcat:accessURL>
    <dcat:keyword>Eggs</dcat:keyword>
    <dcat:keyword>Recalls</dcat:keyword>

    <ods:contact_email>someone@fda.gov</ods:contact_email>

    <!-- MIME Types accepted -->
    <dct:format>text/html</dct:format>
    <dct:format>image/png</dct:format>


    <!-- Since this chart uses data from many different datasets, it references them here -->
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2004</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2005</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2006</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2007</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2008</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2009</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2010</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2011</ods:dataset>
    <ods:dataset>https://fda.demo.socrata.com/resource/egg-recalls-2012</ods:dataset>


    <ods:last_modified>2012-10-10T16:46:00+800</ods:last_modified>
    <ods:created>2012-10-10T16:46:00+800</ods:created>
</ods:chart>

Map

Range: ods:representation

Description: A map representation of the data in one or more datasets. The metadata fields provided for this Representation type were inspired by standards such as ISO 19115:2003.

Property Type Description
extents TBD The geographic extents that this map covers.
projection TBD The geospatial projection of this map

DerivedDataset

Range: ods:representation

Description: A "Derived Dataset" is a data product that is the result of filtering or combining one or more datasets in a particular way. For example:

At a minimum, every derived dataset should include a description of the modificaitons that were performed in its Description field. A more formal description of the steps involved will be considered in the future, but is out of scope for now.

Contributing

If you’d like to contribute, please do! There are many ways to do so:

Inspirations and References

License

This work is all licensed Creative Commons By-Attribution 3.0