Sector vs. Hadoop

When I try to introduce Sector/Sphere to people  I meet at conferences, I usually start with one sentence: “Sector is a system similar to Hadoop”, because many people know Hadoop and understand how it works more or less, while Sector provides similar functionalities. This claim, however, is not very accurate, as there are many critical differences between the two systems.

Sector is not simply a direct implementation of GFS/MapReduce. In fact, when I started to work on Sector in 2005, I have not read the Google paper yet, and I was not aware of Hadoop until 2007. Sector originated from a content distribution system for very large scientific datasets (Sloan Digital Sky Survey). The current version of Sector still supports efficient data access and distribution over wide area networks, a goal that was not considered by the GFS/Hadoop community. Unlike GFS, Sector does not split files. On the one hand, this limit the sizes of files stored in the Sector file system and hence the system usability. On the other hand, it also greatly improves data transfer and processing performance when proper file sizes are used.

Sector – to be accurate, Sphere, as part of Sector – supports arbitrary user defined functions (UDFs)  to be applied to any data segment (a record, a group of records, a file, etc.) and allows the result to be written to independent files or to be sent to multiple bucket files by a user-defined key. The UDF model turned out to be equivalent to the MapReduce model as each UDF can simulate a Map operation, while the output organization according to keys can simulate a Reduce operation. Note that the “key” in Sector UDF is not part of a data record; it is used for output destination only. While MapReduce treats each record as a <key, value> pair, Sector sees all data in binary format and leaves the specific processing to the UDF.

The table below compares Sphere UDF model and MapReduce. You can rewrite any MapReduce computation using Sphere UDFs. Sphere uses persistent record index instead of a run-time parser. The Map and Reduce operations can be replaced with one or more UDFs. Finally, the Sphere output can be written into Sector files. A more detailed list of different technologies used in Sector can be found on the Sector website.

Sphere MapReduce
Record Offset Index Parser / Input Reader
Bucket Partition
UDF Reduce
Output Writer

Overall, Sector performs 2 – 20 times faster than Hadoop in our benchmark applications. It its worth trying, especially when you a C++ developer.


About Yunhong Gu
Yunhong is a computer scientist and open source software developer. He is the architect and lead developer of open source software UDT, a versatile application level UDP-based data transfer protocol that outperforms TCP in many cases, and Sector/Sphere, a cloud computing system software that supports distributed data storage, sharing, and processing. Yunhong earned a PhD degree in Computer Science from the University of Illinois at Chicago in 2005.

One Response to Sector vs. Hadoop

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: