Document-oriented database
A document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data. Document-oriented databases are one of the main categories of NoSQL databases and the popularity of the term "document-oriented database" (or "document store") has grown[1] with the use of the term NoSQL itself. In contrast to relational databases and their notion of "Relation", i.e., a tuple (or row) of related strong-typed data items, these systems are designed around an abstract notion of a "Document".
Contents
Documents
The central concept of a document-oriented database is that Documents, in largely the usual English sense, contain vast amounts of data which can usefully be made available. Document-oriented database implementations differ widely in detail and functionality. Most accept documents in a variety of forms, and encapsulate them in a standardized internal format, while extracting at least some specific data items that are then associated with the document.
A trivial example would be scanning paper documents, extracting the title, author, and date from them either by OCR or having a human locate and enter them, and storing each document in a 4-column relational database, the columns being author, title, date, and a blob full of page images. Some documents-oriented databases do essentially the same things, but with PDF (which may or may not contain text rather than images of text).
Today much more can be accomplished, and an effective document-oriented database must extract and manage a great deal more information about the documents it manages. Fortunately, documents are usually now available in more usable forms. A great deal of publishing is done in HTML, XML, TeX, or systems that can at least export or convert to those. Many other documents in the real world are emails, which also have a moderate amount of metadata available explicitly in their headers. In such cases a document database has access not just to images but to words, phrases, paragraph boundaries, and descriptive labels indicating the significance of parts of the text ("footnote", "chapter", "author name", etc.), and can make all that available for searching, statistical analysis, data mining, and other uses. Even when data is not in high-value forms such as these, modern document-oriented databases can often extract meaningful components via heuristic and other methods.
In a non-document database, there is generally a very small range of fields, many or most of which can only occur in extremely limited contexts, and which are generally required in those contexts. For example, a "person" record might consist of first and last names, address, city, country, work phone, home phone, and so on. Importantly, none of those fields has much internal structure, or repeats. Relational database implementations often require that any repeatable field be mapped into a separate table, in which multiple records refer back to the record they relate to in the original table via a "foreign key" attribute. Likewise, relational database implementations may not readily allow complex structure within a given field, since fields tend to be limited to a few atomic datatypes such as integers, dates, and strings. (This, however, may be relaxed: The PostGIS extension of PostgreSQL makes available geometric attribute types. This makes it possible to store complex geometric objects in fields which can then be processed via geometric relational operators. Another example, also from the PostgreSQL implementation, is a native XML attribute type that can be queried via a native "xpath" operator.)
Documents, in contrast, are structured in ways accessible to humans as well as computers. They are characterized by extremely frequent re-use of small components (words and phrases, but also component types such as "paragraph" or "footnote"), and by very free mixture of those types, as compared to the mixtures allowed in traditional databases. Hamlet is a document, consisting of structural units such as acts, scenes, speeches, attributions, stage directions, and notes. An entry in one's smart-phone address book is a "document" but only barely so, resembling a single record in a relational or similar database far more.
Almost any format can be used for extracted metadata: XML, YAML, JSON, and BSON. However, the document itself is usually stored, at least as a blob in its original format, which may be XML, PDF, proprietary/binary word-processor formats, or "plain text"; functionality of the database is largely dependent on the format in which documents reach it, and the database's ability to extract specific data from that format.
Documents inside a document-oriented database are similar, in some ways, to records or rows in relational databases, but they have vastly more internal structure (the extent the database itself is aware of that structure, and can use it, varies). Documents, particularly in XML, TeX, and other high-end formats, do adhere to a formal schema; but many documents do not, or if they do, the schema is not explicit. For example, the following is a document:
<Article> <Author> <FirstName>Bob</FirstName> <Surname>Smith</Surname> </Author> <Abstract>This paper concerns....</Abstract> <Section n="1"><Title>Introduction</Title> <Para>... </Section> </Article>
A second document, even of the same genre and schema, may have a far different number and arrangement of sections, paragraphs, and the like; it may have multiple co-authors; it may have much other metadata such as copyright or publication information, bibliographic references to other documents (in the same or other databases, or in no database at all), and so on.
Two such documents typically share many structural elements with one another, but each may also have elements the other does not. Unlike a relational database where every record contains the identical sequence of fields (a few of which may be empty or hold missing value indicators), document structures generally allow for an unbounded number of hierarchically-organized components, with extensive repetition. It would be absurd, for example, to design a database with table for "sections," that tried to provide as many fields as the number of paragraphs in the longest section one will ever see (not to mention the many other kinds of document components that appear within sections). Even if one did, naming fields in a relation something like "p1", "p2",... does not, so far as the database is concerned, indicate that those fields have anything to do with one another, or belong in a certain meaningful order. In order to avoid confusion with the quite different notion of database "fields", document databases may refer to the parts of documents as "components" or "elements".
Documents, however, often conform to formal schemata which constrain just what classes of components are allowed, and where. TeX provides a wide range of components, though authors can create their own as well. The many established schemata for use with XML are similar, but authors can also create or use a formal schema in a schema language such as DTD, XSD, Relax NG, or Schematron. Among the most widely used schemata are JATS for technical journals; Text Encoding Initiative for literary works; DocBook for computer systems manuals, and HTML for Web publication.
Some of the most popular Web sites are document databases. The many collections of articles at pubmed.gov or major journal publishers; Wikipedia and its kin; and even search engines (though many of those store links to indexed documents, rather than the full documents themselves).
Keys and retrieval
Documents may be addressed in the database via a unique key that represents that document. This key is often a simple string, a URI, or a path. The key can be used to retrieve the document from the database. Typically, the database retains an index on the key to speed up document retrieval. The most primitive document databases may do little more than that. However, modern document-oriented databases provide far more, because they extract and index all kinds of metadata, and usually also the entire data content, of the documents. Such databases offer a query language that allows the user to retrieve documents based on their content. For example, you may want to retrieve all the documents whose date falls within some range, that contains a citation to another document, etc.. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to the next.
Organization
Implementations offer a variety of ways of organizing documents, including notions of:
- Collections
- Tags
- Non-visible Metadata
- Directory hierarchies
- Buckets
Implementations
Name | Publisher | License | Language | Notes | RESTful API |
---|---|---|---|---|---|
ArangoDB | triAGENS | Apache License 2.0 | C, C++ & Javascript | A distributed multi model, high-performance document store and graph database. | Yes [2] |
BaseX | BaseX Team | BSD License | Java, XQuery | Support for XML, JSON and binary formats; client-/server based architecture; concurrent structural and full-text searches and updates; REST APIs. | Yes |
Cassandra | Apache Software Foundation | Apache License | Java | JSON over HTTP | Yes |
Cloudant | Cloudant, Inc. | Proprietary | Erlang, Java, Scala, and C | Distributed database service based on BigCouch, the company's open source fork of the Apache-backed CouchDB project. | Yes |
Clusterpoint Database | Clusterpoint Ltd. | Free license | C, C++, SQL, Php, Java, .NET, Python, Node.js | Distributed XML and JSON database server with secure high-performance ACID-compliant transactions; built-in full text search; database as a service[3] | Yes |
Couchbase Server | Couchbase, Inc. | Apache License | Erlang and C | Distributed NoSQL Document Database. | Yes [4] |
CouchDB | Apache Software Foundation | Apache License | Erlang | JSON over REST/HTTP with Multi-Version Concurrency Control and limited ACID properties. Uses map and reduce for views and queries.[5] | Yes [6] |
CryptonorDB | Dotissi SRL | Commercial | C# - .NET, Windows Store, Windows Phone, Xamarin.iOS, Xamarin.Android, Unity3D, Mono; Java - Android | Privacy aware cloud-mobile database, with client libraries for Windows Store, Windows Phone, Xamarin Android, Xamarin iOS, Android, Unity3D (iOS, Android, Windows Store, Windows Phone) | Yes |
eXist | eXist, [1] | LGPL | XQuery, Java | XML over REST/HTTP, WebDAV, Lucene Fulltext search, validation, versioning, clustering, triggers, URL rewriting, collections, ACLS, XQuery Update | Yes [7] |
FleetDB | FleetDB | MIT License | Clojure | A JSON-based schema-free database optimized for agile development. | (unknown) |
Informix | IBM | Proprietary | Various (Compatible with MongoDB API) | RDBMS with JSON, replication, sharding and ACID compliance | (unknown) |
Inquire | Infodata Systems, Inc. | Proprietary | unknown | In the mid-80's this was the dominant document-oriented commercial database, widely successful. The company seems to have gone out of business in 2005. | (unknown) |
Lotus Notes | IBM | Proprietary | LotusScript, Java, Lotus @Formula | (unknown) | |
MarkLogic | MarkLogic Corporation | Free Developer license or Commercial | REST, Java, XQuery, XSLT, C++ | Distributed document-oriented database with Multi-Version Concurrency Control, integrated Full text search and ACID-compliant transaction semantics | Yes |
MongoDB | MongoDB, Inc | GNU AGPL v3.0 for the DBMS, Apache 2 License for the client drivers[8] | C++ | Document database with replication and sharding | Optional [9] |
MUMPS Database[10] | Proprietary and Affero GPL[11] | MUMPS | Commonly used in health applications. | (unknown) | |
OrientDB | Orient Technologies | Apache License | Java | JSON over HTTP | Yes |
Qizx[12] | Qualcomm | Commercial | REST, Java, XQuery, XSLT, C,Python | Distributed document-oriented database with integrated Full text search | Yes |
RavenDB | Hibernating Rhinos LTD | Proprietary and modified Affero GPL[13] | C#, JavaScript | Yes | |
Redis | BSD License | ANSI C | Key-value store supporting lists and sets with binary-safe protocol | (unknown) | |
RethinkDB | GNU APGL for the DBMS, Apache 2 License for the client drivers | C++ | (unknown) | ||
Rocket U2 | Rocket Software | Proprietary | UniData, UniVerse | Yes (Beta) | |
Sqrrl Enterprise | sqrrl | Proprietary | Java | Distributed, real-time database featuring cell-level security and massive scalability. | Yes |
Symport | Mountain Labs | Proprietary | Ruby on Rails, CoffeeScript, EmberJS | Secure web-based data collection and management platform tailored for research. | (Unknown) |
XML database implementations
Most XML databases are document-oriented databases.
See also
- Database theory
- Data hierarchy
- Full text search
- In-memory database
- Internet Message Access Protocol (IMAP)
- NoSQL
- Object database
- Online database
- Real time database
- Relational database
References
- ^ DB-Engines Ranking per database model category
- ^ ArangoDB REST API
- ^ Clusterpoint Database
- ^ Documentation. Couchbase. Retrieved on 2013-09-18.
- ^ CouchDB Overview
- ^ CouchDB Document API
- ^ eXist-db Open Source Native XML Database. Exist-db.org. Retrieved on 2013-09-18.
- ^ MongoDB Licensing
- ^ MongoDB REST Interfaces
- ^ Extreme Database programming with MUMPS Globals
- ^ GTM MUMPS FOSS on SourceForge
- ^ "Qizx". Qualcomm Qizx. Retrieved 23 February 2015.
- ^ Ravendb Licensing
Further reading
- Assaf Arkin. (2007, September 20). Read Consistency: Dumb Databases, Smart Services. Labnotes:Don’t let the bubble go to your head!
|
|