I’d like to talk about a project I’ve been working on in my spare time for some time now. The project, called Deneb, is a system for synchronizing directories across multiple computers. I will go over its design and planned features, to try to give an idea how it differs from existing solutions.
Deneb internally represents files and directories using content-addressed storage - the content of files in a Deneb synchronized directory is stored on disk as chunks referenced by their hash. The chunks are immutable, greatly simplifying the interaction with the chunks in a concurrent setting. Race conditions cannot happen since chunks are atomically created and are only read, never modified. Compression is optionally used when storing the chunks.
The metadata of all files and subdirectories in a Deneb directory is recorded in a catalog file. Catalog entries corresponding to files also contain a list of hashes of the chunks making up the file. The catalogs are immutable - when changes are made inside a directory, a new catalog needs to be created. This approach has been successfully used in the CernVM File System (CernVM-FS)1, a read-only distributed filesystem used to ship the physics software stacks to more than 100000 compute nodes making up the Worldwide LHC Compute Grid (WLCG). In Deneb, an important addition to this storage model is planned: the encryption of all chunks and file catalogs.
The main way to interact with Deneb is through its filesystem interface, implemented as a filesystem in userspace (FUSE) module. The filesystem has basic read and write support, but a fully featured filesystem implementation, including links, extended attributes, locks, etc. is planned. Due to the reliance on FUSE, the filesystem interface is only available on Linux, macOS and FreeBSD.
Internally, the project is split into a core library and user interface modules which use the core library. There is only the FUSE interface at the moment, but mobile applications are also planned.
The networking and synchronization aspects of the system have been sketched out, but are yet to be implemented. The desired workflow is having Deneb instances running on multiple machines, of which only one is active at any given moment. Active instances announce to their network of peers the new catalogs they produce. Passive instances can retrieve these new catalogs and merge them with their own. Using a central server as a rendez-vous point for the different instances when communicating catalog updates is the most natural implementation.
Transferring the immutable file chunks between instances can be done through a separate channel, and doesn’t necessarily need to rely on a central server for synchronization. A peer-to-peer architucture, such as a distributed hash table (DHT), could be employed.
There is now basic support for reading and writing into a Deneb directory through the file system interface. Writing is not persistent between restarts of the the Deneb instance. The transaction engine - the subsystem commiting the changes and creating new metadata catalogs - is under development.
With the transaction engine completed, and the addition of optional compression of chunks and encryption of all file data and metadata, the minimum set of features related to local operation is implemented. The remaining work will be on the communication and synchronization aspects of the system.
This concludes the brief overview of Deneb. In future posts, I would like to go into more detail about its different subsystems and discuss some of the tradeoffs made in its design.
Deneb is open source, licensed under the Mozilla Public License v2, and can be found on GitHub.
- At the time of writing, I was working as a software engineer at the CERN in the team developing the CernVM File System. [return]