Wednesday, March 6, 2013

Here's a repost of Bryan Duxbury's "When to use HBase" as the original rapleaf link no longer works and I think it's important to keep this around.



Matching Impedance: When to use HBase

(For the duration of this discussion, I’m going to assume you have at least heard of HBase. If not, go check it out first or you might be a little confused.)
Ever since I read the original Bigtable paper, I knew that its design was something that would befuddle a lot of developers. As an industry, we are largely educated into the world of relational databases, the ubiquitous system of tables, relationships, and SQL. On the whole, relational databases are one of the most widespread, reliable, and well-understood technologies out there. This is one reason why so many developers today are resistant to different storage technologies, such as object databases and distributed hash tables.
However, at some point, the model starts to break down. Usually there are two kinds of pain that people run into: scaling and impedance mismatch. The scaling issue usually boils down to the fact that most RDBMSs are monolithic, single-process systems. The way you scale this type of database (MySQL, Oracle, etc) is by adding bigger and more expensive hardware – more CPUs, RAM, and especially disks. In this regard, at least the problem is already solved: you just have to spend the money. Unfortunately, the cost of this approach does not scale nearly linearly – getting a machine that can support twice as many disks costs more than twice as much money.
Impedance mismatch is a more subtle and challenging problem to get over. The problem occurs when more and more complex schemas are shoehorned into a tabular format. The traditional issue is mapping object graphs to tables and relationships and back again. One common case where this sort of problem comes to light is when your objects have a lot of possible fields but most objects don’t have an instance of every field. In a traditional RDBMS, you have to have a separate column for each field and store NULLs. Essentially, you have to decide on a homogeneous set of fields for every object. Another problem is when your data is less structured than a standard RDBMS allows. If you will have an undefined, unpredictable set of fields for your objects, you either have to make a generic field schema (Object has many Fields) or use something like RDF to represent your schema.
HBase seeks to address some of these issues. Still, there are situations where HBase is the wrong tool for the job. As a developer, you need to make sure you take the time to see beyond the hype about this technology or that and really be sure that you’re matching impedance.

When HBase Shines

One place where HBase really does well is when you have records that are very sparse. This might mean un- or semi-structured data. In any case, unlike row-oriented RDBMSs, HBase is column-oriented, meaning that nulls are stored for free. If you have a row that only has one out of dozens of possible columns, literally only that single column is stored. This can mean huge savings in both disk space and IO read time.
Another way that HBase matches well to un- or semi-structured data is in its treatment of column families. In HBase, individual records of data are called cells. Cells are addressed with a row key/column family/cell qualifier/timestamp tuple. However, when you define your schema, you only specify what column families you want, with the qualifier portion determined dynamically by consumers of the table at runtime. This means that you can store pretty much anything in a column family without having to know what it will be in advance. This also allows you to essentially store one-to-many relationships in a single row! Note that this is not denormalization in the traditional sense, as you aren’t storing one row per parent-child tuple. This can be very powerful – if your child entities are truly subordinate, they can be stored with their parent, eliminating all join operations.
In addition to handling sparse data well, HBase is also great for versioned data. As mentioned, the timestamp is a part of the cell “coordinates”. This is handy, because HBase stores a configurable number of versions of each cell you write, and then allows you to query what the state of that cell is at different points in time. Imagine, for instance, a record of a person with a column for location. Over time, that location might change. HBase’s schema would allow you to easily store a person’s location history along with when it changed, all in the same logical place.
Finally, of course, there’s the scaling. HBase is designed to partition horizontally across tens to hundreds of commodity PCs. This is how HBase deals with the problem of adding more CPUs, RAM and disks. I don’t feel like I need to go far down the road of discussing this idea, because it seems to be the one thing everyone gets about HBase. (If you need more convincing, read the original Bigtable paper. It’s got graphs!)

When HBase Isn’t Right

I’ll just go ahead and say it: HBase isn’t right for every purpose. Sure, you could go ahead and take your problem domain and squeeze it into HBase in one way or another, but then you’d be committing the same error we’re trying avoid by moving away from RDBMSs in the first place.
Firstly, if your data fits into a standard RDBMS without too much squeezing, chances are you don’t need HBase. That is, if a modestly expensive server loaded with MySQL fits your needs, then that’s probably what you want. Don’t make the mistake of assuming you need massive scale right off the bat.
Next, if your data model is pretty simple, you probably want to use a RDBMS. If your entities are all homogeneous, you’ll probably have an easy time of mapping your objects to tables. You also get some nice flexibility in terms of your ability to add indexes, query on non-primary-key values, do aggregations, and so on without much additional work. This is where RDBMSs shine – for decades they’ve been doing this sort of thing and doing it well, at least at lower scale. HBase, on the other hand, doesn’t allow for querying on non-primary-key values, at least directly. HBase allows get operations by primary key and scans (think: cursor) over row ranges. (If you have both scale and need of secondary indexes, don’t worry – Lucene to the rescue! But that’s another post.)
Finally, another thing you shouldn’t do with HBase (or an RDBMS, for that matter), is store large amounts of binary data. When I say large amounts, I mean tens to hundreds of megabytes. Certainly both RDBMSs and HBase have the capabilities to store large amounts of binary data. However, again, we have an impedance mismatch. RDBMSs are built to be fast metadata stores; HBase is designed to have lots of rows and cells, but functions best when the rows are (relatively) small. HBase splits the virtual table space into regions that can be spread out across many servers. The default size of individual files in a region is 256MB. The closer to the region limit you make each row, the more overhead you are paying to host those rows. If you have to store a lot of big files, then you’re best off storing in the local filesystem, or if you have LOTS of data, HDFS. You can still keep the metadata in an RDBMS or HBase – but do us all a favor and just keep the path in the metadata.

Conclusion

This post certainly doesn’t cover every use case and benefit or drawback of HBase, but I think it gives a pretty decent start. My hope is that people will be able to gain some insight into when they should start thinking of HBase for their applications, and also use this as a springboard for more questions about how to make use of HBase and ideas about how to make it better. So, I’ll end with a request – please, tell us what’s missing!

Tuesday, June 14, 2011

working with Oracle datapump

First of all, only dumps created by expdp can be imported via impdp!

Second, you need to have the correct entry in your fstab if you are using an NFS mounted share as your datapump directory:


pso:/PSO /PSO nfs rw,bg,hard,nointr,rsize=32768,wsize=32768,tcp,vers=3,timeo=600,actimeo=0 0 0

Make sure the directory itself (PSO in this case) has the proper credentials - it should be owned by Oracle (make sure the group is correct as well) with rwx for all.

Then, create the Oracle server directory that references this directory (or a child of this directory). Note the grant to the schema user 'ARAMARKPURISMA'. This is important if you want to import as the schema user (rather than system).


DROP DIRECTORY ARAMARK;

CREATE OR REPLACE DIRECTORY
ARAMARK AS
'/PSO/Customers/Aramark';


GRANT READ, WRITE ON DIRECTORY SYS.ARAMARK TO ARAMARKPURISMA;

GRANT READ, WRITE ON DIRECTORY SYS.ARAMARK TO SYSTEM WITH GRANT OPTION;


Then, run impdp and remap tablespaces as needed:

impdp system/manager@se04 DIRECTORY='ARAMARK' REMAP_TABLESPACE=PDH_LARGE:MEDIUM remap_tablespace=pdh_medium:medium remap_tablespace=pdh_small:medium remap_tablespace=pdh_batch:medium schemas=purisma remap_schema=purisma:aramarkpurisma DUMPFILE=purisma_allobjs_0512.dmp

wget on linux with a proxy

I had to use wget with a proxy to download unrar from behind my companies firewall.

This is what I had to do.

First, I setup my proxy (note that SOCKS proxy isn't usable)

I added this to my .bash_profile and sourced it:

export http_proxy=http://bastion2.us.dnb.com:8080

Then I executed this command:

wget http://www.rarlab.com/rar/rarlinux-4.0.1.tar.gz

I went to the rarlab website and moused over the unrar for linux download link to see the proper URL to use.

Monday, November 16, 2009

How to mount a windows fileshare

Assuming the folder 'library' is being shared on the ntserv1 machine, here's how to do it from the command line:

mount -t cifs //ntserv1/library -o username=pcr,password=Purisma1 /library

Be sure to create the /library directory on your unix box and set the permissions to allow everyone to access it.

The /etc/fstab entry will look like this:

//ntserv1/Library /library cifs password=Purisma1,username=pcr 0 0

Friday, August 28, 2009

Help, there's garbage in my Outlook signature (or what the hell is this  in my sig)

Outlook will add garbage characters to your sig. It is the UTF-8 byte order marker.. for some reason Microsoft wants to add this for all signatures.

Anyway, to fix it, you need to manually modify your .sig files. They are located under C:\Documents and Settings\Username\Application Data\Microsoft\Signatures on Windows XP. On Vista they can be found under C:\Users\Username\AppData\Roaming\Microsoft\Signatures.

Once there, you will see several files:

signature.htm
signature.rtf
signature.txt

I just needed to modify the .rtf and .txt files by deleting the "" in the first line to rectify the issue. I used notepad for the .txt file and Word for the .rtf file.

Wednesday, August 26, 2009

oracle client libraries on 64bit linux (resolving libclntsh.so.10.1 error loading shared libraries)

I recently had a problem when after an oracle client install looked like it went ok, I tried to execute a particular utility (csscan) and it gave me an error of "libclntsh.so.10.1 error loading shared libraries".

I checked my LD_LIBRARY_PATH, and it was set to the following:

/home/oracle/product/10.2.0/lib:/lib/:/usr/lib

After banging my head, I realized I was installing a 64 bit Oracle client, but trying to load 32 bit shared libraries from the OS (/lib and /usr/lib are 32 bit, but the Oracle 'lib' directory is 64 bit. Oracle has a 'lib32' directory for 32 bit libraries)

After changing my LD_LIBRARY_PATH to the following:

/home/oracle/product/10.2.0/lib:/lib64/:/usr/lib64

My problems were solved... so even if you think your LD_LIBRARY_PATH is set right, it probably isn't if you get that error.

Tuesday, August 11, 2009

Symbolic Links with 'ln'

To remember how it works, I say something in my head - I'll be making a link to x, and calling it y.

ln -s x y

So if you want to make a link to a directory called /PSO/achan/data-dumps, but you want to reference it from within your current project directory's data-dumps directory, you'd do something like:

ln -s /PSO/achan/data-dumps data-dumps