Data is stored on disk, Metadata is stored in a database. How the data in the database is structured is defined in another blueprint: metadata storage. The metadata in the database also contains a link back to the data. The data is usually huge and shall not be stored inside the database. This also ensures a better performance and easier backup solutions.
The data storage module's jobs are:
- store data inside a store and return an unique handle on it
- load data from the store from a handle
- enumerate the store contents (for backup or verification)
- ensure no duplicate file is in the store
- keep the files as they are, without any change to them
The handle on the files can be implemented as the SHA1 sum of the file. That way if a new file is uploaded and a file with the same SHA1 sum already exists, the file is a duplicate. Another benefit is that the data consistency can be checked by verifying the hash values. For a quick retrieval, the file name and directory can also be implemented from the SHA1 code.
The real names of the files are stored in the database.
SHA1 in python:
import hashlib BLOCKSIZE = 65536 hasher = hashlib.sha1() with open(file_path, 'rb') as afile: buf = afile.read(BLOCKSIZE) while len(buf) > 0: hasher.update(buf) buf = afile.read(BLOCKSIZE) print(hasher.hexdigest())
Open Source ideas
- ZFS filesystem ZFS and ECC, ZFS linux support
- Use raw disk access in case of virtualization See one of the answers
- Distributed Filesystem: ceph
- Open Archive