Logo
Color-Of-Code
  Home   All tags   Terms and Conditions

Funani: Data Storage

January 01, 2019

Data storage

Rationale

Data is stored on disk, Metadata is stored in a database. How the data in the database is structured is defined in another blueprint: [metadata storage](metadata storage). The metadata in the database also contains a link back to the data. The data is usually huge and shall not be stored inside the database. This also ensures a better performance and easier backup solutions.

Tasks

The data storage module's jobs are:

  • store data inside a store and return an unique handle on it
  • load data from the store from a handle
  • enumerate the store contents (for backup or verification)
  • ensure no duplicate file is in the store
  • keep the files as they are, without any change to them

Ideas

The handle on the files can be implemented as the SHA1 sum of the file. That way if a new file is uploaded and a file with the same SHA1 sum already exists, the file is a duplicate. Another benefit is that the data consistency can be checked by verifying the hash values. For a quick retrieval, the file name and directory can also be implemented from the SHA1 code.

The real names of the files are stored in the database.

SHA1 in python:

import hashlib

BLOCKSIZE = 65536
hasher = hashlib.sha1()
with open(file_path, 'rb') as afile:
  buf = afile.read(BLOCKSIZE)
  while len(buf) > 0:
    hasher.update(buf)
    buf = afile.read(BLOCKSIZE)
print(hasher.hexdigest())

Open Source ideas

Filesystems

Libraries

  • Cassette under APL v2 Casette (C#)
  • Keep inside Arvados under AGPL v2 Keep (Go)
  • Vault under Public Domain Vault (Clojure)
  • Camlistore under APL v2 Camlistore (Go)
  • IPFS under MIT IPFS (Go)