ArchiveOrangemail archive

common-user.hadoop.apache.org


(List home) (Recent threads) (34 other Apache Hadoop lists)

Subscription Options

  • RSS or Atom: Read-only subscription using a browser or aggregator. This is the recommended way if you don't need to send messages to the list. You can learn more about feed syndication and clients here.
  • Conventional: All messages are delivered to your mail address, and you can reply. To subscribe, send an email to the list's subscribe address with "subscribe" in the subject line, or visit the list's homepage here.
  • Low traffic list: less than 3 messages per day
  • This list contains about 19,486 messages, beginning Jun 2009
  • 0 messages added yesterday
Report the Spam
This button sends a spam report to the moderator. Please use it sparingly. For other removal requests, read this.
Are you sure? yes no

hadoop fs -rmr /*?

Ad
W.P. McNeill 1300296959Wed, 16 Mar 2011 17:35:59 +0000 (UTC)
On HDFS, anyone can run hadoop fs -rmr /* and delete everything.  The
permissions system minimizes the danger of accidental global deletion on
UNIX or NT because you're less likely to type an administrator password by
accident.  But HDFS has no such safeguard, and the typo corollary to
Murphy's Law guarantees that someone is going to accidentally do this at
some point.  From reading documentation and Goolging around it seems like
the mechanisms for protecting high-value HDFS data from accidental deletion
are:

   1. Set fs.trash.interval to a non-zero value.
   2. Write your own backup utility.

(1) is nice because it's built in to HDFS, but it only works for shell
operations, and you may not have spare terrabytes of Trash to catch the big
accidental deletes.  (2) seems like a roll-your-own distcp solution.

What are examples of systems people working with high-value HDFS data put in
place so that they can sleep at night?  Are there easy-to-use and reliable
backup utilities, either to another HDFS cluster or to DVD/tape?  Is there a
way to disable fs -rmr for certain directories?
David Rosenstrauch 1300299685Wed, 16 Mar 2011 18:21:25 +0000 (UTC)
On 03/16/2011 01:35 PM, W.P. McNeill wrote:
> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.Not sure how you have your installation set but on ours (we installed 
Cloudera CDH), only user "hadoop" has full read/write access to HDFS. 
Since we rarely either login as user hadoop, or run jobs as that user, 
this forces us to explicitly set and chown directory trees in HDFS that 
only specific users can access, thus enforcing file read/write restrictions.

HTH,

DR
Ted Dunning 1300305329Wed, 16 Mar 2011 19:55:29 +0000 (UTC)
W.P is correct, however, that standard techniques like snapshots and mirrors
and point in time backups do not exist in standard hadoop.

This requires a variety of creative work-arounds if you use stock hadoop.

It is not uncommon for people to have memories of either removing everything
or somebody close to them doing the same thing.

Few people have memories of doing it twice.On Wed, Mar 16, 2011 at 11:20 AM, David Rosenstrauch wrote:

> On 03/16/2011 01:35 PM, W.P. McNeill wrote:
>
>> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.
>>
>
> Not sure how you have your installation set but on ours (we installed
> Cloudera CDH), only user "hadoop" has full read/write access to HDFS. Since
> we rarely either login as user hadoop, or run jobs as that user, this forces
> us to explicitly set and chown directory trees in HDFS that only specific
> users can access, thus enforcing file read/write restrictions.
>
> HTH,
>
> DR
>
Brian Bockelman 1300307518Wed, 16 Mar 2011 20:31:58 +0000 (UTC)
Hi W.P.,

Hadoop does apply permissions taken from the shell.  So, if the directory is owned by user "brian" and user "ted" does a "rmr /user/brian", then you get a permission denied error.

By default, this is not safeguarded against malicious users.  A malicious user will do whatever they want with Hadoop cluster.  This safeguards against accidents of authorized users.

*However*, there is a security branch that uses Kerberos for authentication.  This is considered secure.

In the end though, Disks Are Not Backups.  As someone who has accidentally deleted over 100TB of data in a matter of minutes, I can assure you that high value data belongs backed up on tape, ejected from a vault, and with the write protection tab flipped on.

Hadoop IS NOT ARCHIVAL STORAGE.   That's an important fact.

BrianOn Mar 16, 2011, at 12:35 PM, W.P. McNeill wrote:

> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.  The
> permissions system minimizes the danger of accidental global deletion on
> UNIX or NT because you're less likely to type an administrator password by
> accident.  But HDFS has no such safeguard, and the typo corollary to
> Murphy's Law guarantees that someone is going to accidentally do this at
> some point.  From reading documentation and Goolging around it seems like
> the mechanisms for protecting high-value HDFS data from accidental deletion
> are:
> 
>   1. Set fs.trash.interval to a non-zero value.
>   2. Write your own backup utility.
> 
> (1) is nice because it's built in to HDFS, but it only works for shell
> operations, and you may not have spare terrabytes of Trash to catch the big
> accidental deletes.  (2) seems like a roll-your-own distcp solution.
> 
> What are examples of systems people working with high-value HDFS data put in
> place so that they can sleep at night?  Are there easy-to-use and reliable
> backup utilities, either to another HDFS cluster or to DVD/tape?  Is there a
> way to disable fs -rmr for certain directories?
Allen Wittenauer 1300311563Wed, 16 Mar 2011 21:39:23 +0000 (UTC)
On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote:

> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.In addition to what everyone else has said, I'm fairly certain that -rmr / is specifically safeguarded against.  But /* might have slipped through the cracks.

> What are examples of systems people working with high-value HDFS data put in
> place so that they can sleep at night?I set in place crontabs where we randomly delete the entire file system to remind folks that HDFS is still immature.

	:D

	OK, not really.

	In reality, we have basically a policy that everyone signs off on before getting an account where they understand that Hadoop should not be considered 'primary storage', is not a data warehouse, is not backed up, and could disappear at any moment.  But we also make sure that the base (ETL'd) data lives on multiple grids.  Any other data should be reproducible from that base data.
Allen Wittenauer 1300311499Wed, 16 Mar 2011 21:38:19 +0000 (UTC)
On Mar 16, 2011, at 10:35 AM, W.P. McNeill wrote:

> On HDFS, anyone can run hadoop fs -rmr /* and delete everything.In addition to what everyone else has said, I'm fairly certain that -rmr / is specifically safeguarded against.  But /* might have slipped through the cracks.

> What are examples of systems people working with high-value HDFS data put in
> place so that they can sleep at night?I set in place crontabs where we randomly delete the entire file system to remind folks that HDFS is still immature.

	:D

	OK, not really.

	In reality, we have basically a policy that everyone signs off on before getting an account where they understand that Hadoop should not be considered 'primary storage', is not a data warehouse, is not backed up, and could disappear at any moment.  But we also make sure that the base (ETL'd) data lives on multiple grids.  Any other data should be reproducible from that base data.
Home | About | Privacy