A Private “DropBox” for Your Enterprise

Companies today still use FTP, SFTP, SSH, and/or HTTP as their daily data sharing platform. Some people collect/generate data and upload the data to a data server, while others login the server and download the data they are interested in. There are several problems for this model. First, as the data increases, there are more and more data servers set up, and users often have hard time to figure out which server hosts a particular data file. Second, when there are multiple data servers, data files are often replicated either intentionally or by mistake, it is difficult to tell which version is the most recent, even more difficult is to keep all replicas consistent. Third, when users try to download data files from remote locations via the Internet, they often experience low throughput. Now there have been a business called WAN acceleration to help solve the third problem, but WAN acceleration software cannot help anything about the first two problems.

On the other hand, there are many Internet storage service emerging in recent years, DropBox, Google Doc, Amazon S3, just to name a few. These services put users’ files in a storage “cloud” and provide a single namespace. Replications and conflicts are handled transparently to users. These solved the first two problems described above. In fact, due to these benefits, online storage service is very popular today.

However, enterprises cannot simply move their data to these online storage providers. There is immediately a security concern. Then there are other issues including capacity and cost. Uploading 100TB+ data to an online storage is still less than practical today for most cases, and the cost to keep them in service is very high (e.g., at Amazon S3, 100TB will cost approximately $12K per month plus data transfer cost). In addition, the intranet network connection is usually faster.

An alternative and probably better approach is to own a private data cloud “DropBox” inside the company, managing and serving data to all branches. Such a private data cloud should have the following features:

  • Single name space across multiple servers, even if the servers are located at different locations
  • Allow servers to be added and removed at run time (dynamic scaling)
  • Maintain replicas and take care of consistency between replicas transparently
  • Allow users to control the replication number and location of each file when necessary (e.g., hot files can be replicated more times)

Sector/Sphere meets all the above requirements. Sector can manage your data across thousandths of servers with a single name space. Sector automatically replicates data files to multiple data centers for fault tolerance and to increase read performance. Data location and replication number can be configured at per-file level if necessary and is dynamically changeable. For example, if new files have more readers than old files, the new files can be replicated at a higher degree and gradually reduce the replication number when users are more interested in even newer files. In addition to all of these benefits, Sector also gives you integrated WAN acceleration ability with the UDT protocol, another open source software that we contribute. UDT has helped millions of users with their daily data transfer needs.

Overall, our system can support very large enterprises to share 100+TBs of data  every day among their global branches. You may also refer to our previous blog post to start trying the system by yourself.

Advertisements

Storage 2.0

While I was discussing new storage and file systems with my friend Chuan Wang, he came up this term: storage 2.0, which is exactly the “catch” word I have been searching for a while.

There have been numerous file systems developed since the birth of modern computers. However, among those mature and widely used file systems, there are basically two groups: the desktop file system (ZFS, NTFS, EXT, etc.) and the supercomputer file system (GPFS, Lustre, Panasas, etc.). In the middle, there is a chaos.

The “middle” is an area where users have a large number of loosely coupled commodity computers, where the supercomputer file system will not work well or additional layer middleware is required to aggregate individual desktop file systems. There are many file or storage systems in this area too (HDFS, Gluster, Dynamo, just to list a few), but they are used in more or less specific use scenarios and none of them becomes so dominant and widely deployed. This is partially due to the fact that requirements from different users can vary drastically from high consistency to high availability, but, in general, it is hard to provide all of these elements at once  (see Brewer’s CAP Theorem).

Yet there are indeed some common characteristics the “middle” file systems tend to (or need to) share. Together, these are the features required by what we would like to call storage 2.0, or the general “middle” file system.

Software level fault tolerance.

Almost all distributed file systems need to deal with hardware and network failure. Supercomputer file systems usually use RAID as a hardware level solution. However, hardware failure in commodity clusters is normal, so a file system must be able to provide fault tolerance within itself, rather than depending on another layer.

Self-healing.

Storage nodes may join and leave frequently, either due to system/network failure or maintenance/upgrade. The file system should continue to work in such situations and hide these internal changes from the clients. In particular, the file system should also  automatically re-balance storage space (e.g., when new nodes are inserted). In short, the file system should always work even if there is only one node left (although files may be lost if there are too many nodes down).

In-storage processing.

If a file system is running on a system where individual storage is attached with CPUs, it is only a waste of resources if these CPUs are used for serving data only. In-storage processing can significantly accelerate such operations as md5sum and grep, as it does not only avoid the need to read data out (to the client), but also can execute the commands on multiple files in parallel.

Ability to treat files differently.

A distributed file system, especially those served across wide area networks, are usually involved with higher level of versatility and flexibility. Files may come from different sources and serve different purposes. The location, security, and replication factor of each file needs to be treated differently and the rules should be dynamically updated whenever necessary.

Scalability.

A file system should be able to handle 10,000 storage nodes if necessary, yet we must be aware that the majority of systems have never come close to this level of scale. Extreme  high Scalability often does not come free. Many highly scalable systems uses P2P routing (e.g., distributed hash table) but consistency and performance are often compromised. Therefore, the file system should have “reasonable” scalability. It is also worth noting that scalability does not only apply to the number of nodes, but also apply to the number and size of files, and sometimes even geographical locations.

Performance.

The file system should support high performance lookup and IO throughput and provide faster IO throughput than a desktop file system due to concurrent data access. However, latency is usually higher than that in those supercomputer file systems.

There are other features that may be less important but can be crucial in certain situations, such as integrated security and integrated monitoring. For distributed file systems, depending on external security and monitoring may not be enough. They need to support these features within the system.

Storage 2.0 is not to replace current desktop and supercomputer file systems, but to fill a void left between them. At VeryCloud, we are trying to shape our Sector DFS within these requirements. We hope you can get involved with us, either you are an open source developer who is interested in distributed file system development or you are a potential user who feels that these features meet your specific requirements.