December 12, 2010 3 Comments
Companies today still use FTP, SFTP, SSH, and/or HTTP as their daily data sharing platform. Some people collect/generate data and upload the data to a data server, while others login the server and download the data they are interested in. There are several problems for this model. First, as the data increases, there are more and more data servers set up, and users often have hard time to figure out which server hosts a particular data file. Second, when there are multiple data servers, data files are often replicated either intentionally or by mistake, it is difficult to tell which version is the most recent, even more difficult is to keep all replicas consistent. Third, when users try to download data files from remote locations via the Internet, they often experience low throughput. Now there have been a business called WAN acceleration to help solve the third problem, but WAN acceleration software cannot help anything about the first two problems.
On the other hand, there are many Internet storage service emerging in recent years, DropBox, Google Doc, Amazon S3, just to name a few. These services put users’ files in a storage “cloud” and provide a single namespace. Replications and conflicts are handled transparently to users. These solved the first two problems described above. In fact, due to these benefits, online storage service is very popular today.
However, enterprises cannot simply move their data to these online storage providers. There is immediately a security concern. Then there are other issues including capacity and cost. Uploading 100TB+ data to an online storage is still less than practical today for most cases, and the cost to keep them in service is very high (e.g., at Amazon S3, 100TB will cost approximately $12K per month plus data transfer cost). In addition, the intranet network connection is usually faster.
An alternative and probably better approach is to own a private data cloud “DropBox” inside the company, managing and serving data to all branches. Such a private data cloud should have the following features:
- Single name space across multiple servers, even if the servers are located at different locations
- Allow servers to be added and removed at run time (dynamic scaling)
- Maintain replicas and take care of consistency between replicas transparently
- Allow users to control the replication number and location of each file when necessary (e.g., hot files can be replicated more times)
Sector/Sphere meets all the above requirements. Sector can manage your data across thousandths of servers with a single name space. Sector automatically replicates data files to multiple data centers for fault tolerance and to increase read performance. Data location and replication number can be configured at per-file level if necessary and is dynamically changeable. For example, if new files have more readers than old files, the new files can be replicated at a higher degree and gradually reduce the replication number when users are more interested in even newer files. In addition to all of these benefits, Sector also gives you integrated WAN acceleration ability with the UDT protocol, another open source software that we contribute. UDT has helped millions of users with their daily data transfer needs.
Overall, our system can support very large enterprises to share 100+TBs of data every day among their global branches. You may also refer to our previous blog post to start trying the system by yourself.