今天在测试的时候,数据正在入库。突然断电了。
重启动以后,启动./start-dfs.sh以后,查看日志,会报出
2012-04-13 15:39:43,208 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 8020, call rollEditLog() from 196.1.2.160:34939: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Checkpoint not created. Name node is in safe mode.
The ratio of reported blocks 0.8544 has not reached the threshold 0.9990. Safe mode will be turned off automatically. org.apache.hadoop.hdfs.server.namenode.SafeModeException: Checkpoint not created. Name node is in safe mode. The ratio of reported blocks 0.8544 has not reached the threshold 0.9990. Safe mode will be turned off automatically. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.rollEditLog(FSNamesystem.java:4584) at org.apache.hadoop.hdfs.server.namenode.NameNode.rollEditLog(NameNode.java:660) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)这样的错,且安全模式不会自动退出。
这是因为,在断电的时候,正在进入入库,造成了数据块丢失。
这时,使用./hadoop fsck /查看,会发现,里面的
Corrupt blocks: 15
证明有15个丢失的块儿。
这些块已经没有办法找回了,唯有进行删除操作。
这时,在${hadoophome}/bin目录下,执行
./hadoop dfsadmin -safemode leave 关闭安全模式。
执行./hadoop fsck -delete /将丢失的块儿删除即可。
然后再执行./stop-dfs.sh命令,将系统关闭,
再次重启。即可
因为我的集群是使用hbase进行数据的存取的。因此,如果这时候直接启动hbase的话,它也会报出找不到数据的问题。
2012-04-13 16:08:58,172 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 85176ms for lease recovery on hdfs://gsmaster/hbase/.logs/gsdata1,60020,1334287545838/gsdata1%3A60020.1334291689722:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/gsdata1,60020,1334287545838/gsdata1%3A60020.1334291689722 for DFSClient_hb_m_gsmaster:60000_1334304432374 on client 196.1.2.160, because this file is already being created by NN_Recovery on 196.1.2.161
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:1202) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:1157) at org.apache.hadoop.hdfs.server.namenode.NameNode.recoverLease(NameNode.java:404) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955)这时,我们先将master进程KISS掉,并在${hadoophome}/bin目录下,将hbase目录下的,
dir | 2012-04-13 17:07 | rwxr-xr-x | goldsoft | supergroup | ||||
dir | 2012-04-13 17:07 | rwxr-xr-x | goldsoft | supergroup | ||||
dir | 2012-04-13 17:12 | rwxr-xr-x | goldsoft | supergroup | ||||
dir | 2012-04-13 17:11 | rwxr-xr-x | goldsoft | supergroup |
这四个目录中除.oldlogs以外的删除,使用./hadoop fs -rmr /hbase/-ROOT-这样的命令,
然后再启动hbase,等等hbase启动完成以后,即在日志里显示
2012-04-13 17:08:43,901 INFO org.apache.hadoop.hbase.master.HMaster: Master has completed initialization
时,在${hbasehome}/bin下,执行./hbase org.jruby.Main add_table.rb /hbase目录/表名即可为了防止数据的丢失,我们在入库的时候,最好是在已经确认放库成功以后,再将源数据删除,这样,才能保证数据不会被丢失。因为断电这样的情况,是谁也料想不到的