Hopefully
I won't make this posting too rambling. Full Disclosure: I am not a
database expert of any kind. For the most part, all GISers are probably
familiar enough with what a database, or more specifically a Relational
Database is. Pretty much any entry-level GIS textbook discusses it, and
you might even use one like the personal geodatabase in ArcGIS or something
more sophisticated like Oracle Spatial. The idea behind the relational
database (and I realize I will butcher this) is that you have a series of
tables, and those tables are related to one another throu primary keys and
foreign keys stored in those tables. There is a movement, and I'm not
sure how long it has been going, for a non-relational data storage system that
has been called NoSQL. Google uses a NoSQL data store
called BigTable, and Facebook uses something called Cassandra
by Apache. NoSQL is a bit all-encompassing, because it really covers any
of the data stores that are not the classic Relational Databases. These
cover a range of things that I don't really want to get into. The two
that are of most interest to me are the Document Store and Graph
Databases. Document stores are cool because they store, well, documents
rather than tables. Each document can have it's own set of properties and
values, and you aren't tied to a database schema.
Typically the documents are stored in a format like XML or JSON (Javascript
Object Notation). It isn't hard to make a leap of storing GIS geometry in
a Document-oriented Database, because there already exists a specification
called GeoJSON. Personally, I find it freeing not to be tied to a
database schema, and find it difficult to design them.
But
now for the meat of this post - Graph Database. If you don't know a graph
is not a chart. A graph is a mathematical structure to model
relationships. We GISers are most familiar with its form as a network, or
transportation network. Graphs are made up of nodes, or vertices, and
edges that connect nodes. Importantly, an edge may have direction or no
direction. For example, node1 and node2 are mutual friends and are just
connected, or node1 considers node2 a friend but node2 doesn't consider node1 a
friend. As you can probably guess, graphs are used extensively in social
network analysis. A graph database is a database that stores data as a
graph, or I suppose multiple graphs. The emphasis is on the relationship
between the nodes of data. Personally, I think this type of database is
the obvious direction that spatial-enabled databases should take. A lot
of our spatial analysis tasks involve searching the relationships between
data. This could really expand those functions, and potentially make them
quicker. There are at least two areas that come to mind when I think of
these possibilities. One is topology. What is topology to us but
the relationship between different geometries? Here is graph of the
topological relationship of some theoretical data:
One
thing that might be obvious from this is that we are used to separating out our
polygons into different tables or shapefiles that group our data. At a
higher level geometry is grouped by type: polygon, point, and polyline.
But with the graph database that wouldn't be necessary and we would be able to
search for data based on their relationship with each other. This
presents new analytical possibilities because data is no longer separate.
See Tim Berners-lee's Ted talk for more info about linked data.
The other
possibility that I see with this, is relationships between metadata.
Metadata in a GIS is boring. Yes it is important, but no one seems to use
it, and it is tedious to create. FGDC is a pain. Metadata through
relationships sounds a lot more interesting to me. Searching for related
information by who it was created, regions/areas, or temporality could be
really useful.
Anyway,
those are my thoughts on how NoSQL should be the next step in the GIS world.