On HDFS, anyone can run hadoop fs -rmr /* and delete everything. The permissions system minimizes the danger of accidental global deletion on UNIX or NT because you're less likely to type an administrator password by accident. But HDFS has no such safeguard, and the typo corollary to Murphy's Law guarantees that someone is going to accidentally do this at some point. From reading documentation and Goolging around it seems like the mechanisms for protecting high-value HDFS data from accidental deletion are: 1. Set fs.trash.interval to a non-zero value. 2. Write your own backup utility. (1) is nice because it's built in to HDFS, but it only works for shell operations, and you may not have spare terrabytes of Trash to catch the big accidental deletes. (2) seems like a roll-your-own distcp solution. What are examples of systems people working with high-value HDFS data put in place so that they can sleep at night? Are there easy-to-use and reliable backup utilities, either to another HDFS cluster or to DVD/tape? Is there a way to disable fs -rmr for certain directories?
On 03/16/2011 01:35 PM, W.P. McNeill wrote:
> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.Not sure how you have your installation set but on ours (we installed
Cloudera CDH), only user "hadoop" has full read/write access to HDFS.
Since we rarely either login as user hadoop, or run jobs as that user,
this forces us to explicitly set and chown directory trees in HDFS that
only specific users can access, thus enforcing file read/write restrictions.
HTH,
DR
W.P is correct, however, that standard techniques like snapshots and mirrors
and point in time backups do not exist in standard hadoop.
This requires a variety of creative work-arounds if you use stock hadoop.
It is not uncommon for people to have memories of either removing everything
or somebody close to them doing the same thing.
Few people have memories of doing it twice.On Wed, Mar 16, 2011 at 11:20 AM, David Rosenstrauch wrote:
> On 03/16/2011 01:35 PM, W.P. McNeill wrote:
>
>> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.
>>
>
> Not sure how you have your installation set but on ours (we installed
> Cloudera CDH), only user "hadoop" has full read/write access to HDFS. Since
> we rarely either login as user hadoop, or run jobs as that user, this forces
> us to explicitly set and chown directory trees in HDFS that only specific
> users can access, thus enforcing file read/write restrictions.
>
> HTH,
>
> DR
>
Hi W.P.,
Hadoop does apply permissions taken from the shell. So, if the directory is owned by user "brian" and user "ted" does a "rmr /user/brian", then you get a permission denied error.
By default, this is not safeguarded against malicious users. A malicious user will do whatever they want with Hadoop cluster. This safeguards against accidents of authorized users.
*However*, there is a security branch that uses Kerberos for authentication. This is considered secure.
In the end though, Disks Are Not Backups. As someone who has accidentally deleted over 100TB of data in a matter of minutes, I can assure you that high value data belongs backed up on tape, ejected from a vault, and with the write protection tab flipped on.
Hadoop IS NOT ARCHIVAL STORAGE. That's an important fact.
BrianOn Mar 16, 2011, at 12:35 PM, W.P. McNeill wrote:
> On HDFS, anyone can run hadoop fs -rmr /* and delete everything. The
> permissions system minimizes the danger of accidental global deletion on
> UNIX or NT because you're less likely to type an administrator password by
> accident. But HDFS has no such safeguard, and the typo corollary to
> Murphy's Law guarantees that someone is going to accidentally do this at
> some point. From reading documentation and Goolging around it seems like
> the mechanisms for protecting high-value HDFS data from accidental deletion
> are:
>
> 1. Set fs.trash.interval to a non-zero value.
> 2. Write your own backup utility.
>
> (1) is nice because it's built in to HDFS, but it only works for shell
> operations, and you may not have spare terrabytes of Trash to catch the big
> accidental deletes. (2) seems like a roll-your-own distcp solution.
>
> What are examples of systems people working with high-value HDFS data put in
> place so that they can sleep at night? Are there easy-to-use and reliable
> backup utilities, either to another HDFS cluster or to DVD/tape? Is there a
> way to disable fs -rmr for certain directories?
On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote: > On HDFS, anyone can run hadoop fs -rmr /* and delete everything.In addition to what everyone else has said, I'm fairly certain that -rmr / is specifically safeguarded against. But /* might have slipped through the cracks. > What are examples of systems people working with high-value HDFS data put in > place so that they can sleep at night?I set in place crontabs where we randomly delete the entire file system to remind folks that HDFS is still immature. :D OK, not really. In reality, we have basically a policy that everyone signs off on before getting an account where they understand that Hadoop should not be considered 'primary storage', is not a data warehouse, is not backed up, and could disappear at any moment. But we also make sure that the base (ETL'd) data lives on multiple grids. Any other data should be reproducible from that base data.
On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote: > On HDFS, anyone can run hadoop fs -rmr /* and delete everything.In addition to what everyone else has said, I'm fairly certain that -rmr / is specifically safeguarded against. But /* might have slipped through the cracks. > What are examples of systems people working with high-value HDFS data put in > place so that they can sleep at night?I set in place crontabs where we randomly delete the entire file system to remind folks that HDFS is still immature. :D OK, not really. In reality, we have basically a policy that everyone signs off on before getting an account where they understand that Hadoop should not be considered 'primary storage', is not a data warehouse, is not backed up, and could disappear at any moment. But we also make sure that the base (ETL'd) data lives on multiple grids. Any other data should be reproducible from that base data.