Data storage
Rationale
Data is stored on disk, Metadata is stored in a database. How the data in the database is structured is defined in another blueprint: [metadata storage](metadata storage). The metadata in the database also contains a link back to the data. The data is usually huge and shall not be stored inside the database. This also ensures a better performance and easier backup solutions.
Tasks
The data storage module's jobs are:
- store data inside a store and return an unique handle on it
- load data from the store from a handle
- enumerate the store contents (for backup or verification)
- ensure no duplicate file is in the store
- keep the files as they are, without any change to them
Ideas
The handle on the files can be implemented as the SHA1 sum of the file. That way if a new file is uploaded and a file with the same SHA1 sum already exists, the file is a duplicate. Another benefit is that the data consistency can be checked by verifying the hash values. For a quick retrieval, the file name and directory can also be implemented from the SHA1 code.
The real names of the files are stored in the database.
SHA1 in python:
import hashlib
BLOCKSIZE = 65536
hasher = hashlib.sha1()
with open(file_path, 'rb') as afile:
buf = afile.read(BLOCKSIZE)
while len(buf) > 0:
hasher.update(buf)
buf = afile.read(BLOCKSIZE)
print(hasher.hexdigest())
Open Source ideas
- Content Addressable Storage CAS
- XAM Initiative: http://www.snia.org/forums/xam
Filesystems
- ZFS filesystem ZFS and ECC, ZFS linux support
- Use raw disk access in case of virtualization See one of the answers
- Distributed Filesystem: ceph
- Open Archive
Libraries
- Cassette under APL v2 Casette (C#)
- Keep inside Arvados under AGPL v2 Keep (Go)
- Vault under Public Domain Vault (Clojure)
- Camlistore under APL v2 Camlistore (Go)
- IPFS under MIT IPFS (Go)