summaryrefslogtreecommitdiffstats
path: root/doc/admin-guide/en-US/markdown/admin_troubleshooting.md
blob: 88fb85c240cb60363d55d5b302b3fc3c0a881c6e (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
Troubleshooting GlusterFS
=========================

This section describes how to manage GlusterFS logs and most common
troubleshooting scenarios related to GlusterFS.

Managing GlusterFS Logs
=======================

This section describes how to manage GlusterFS logs by performing the
following operation:

-   Rotating Logs

Rotating Logs
-------------

Administrators can rotate the log file in a volume, as needed.

**To rotate a log file**

-   Rotate the log file using the following command:

    `# gluster volume log rotate `

    For example, to rotate the log file on test-volume:

        # gluster volume log rotate test-volume
        log rotate successful

    > **Note**
    >
    > When a log file is rotated, the contents of the current log file
    > are moved to log-file- name.epoch-time-stamp.

Troubleshooting Geo-replication
===============================

This section describes the most common troubleshooting scenarios related
to GlusterFS Geo-replication.

Locating Log Files
------------------

For every Geo-replication session, the following three log files are
associated to it (four, if the slave is a gluster volume):

-   Master-log-file - log file for the process which monitors the Master
    volume

-   Slave-log-file - log file for process which initiates the changes in
    slave

-   Master-gluster-log-file - log file for the maintenance mount point
    that Geo-replication module uses to monitor the master volume

-   Slave-gluster-log-file - is the slave's counterpart of it

**Master Log File**

To get the Master-log-file for geo-replication, use the following
command:

`gluster volume geo-replication  config log-file`

For example:

`# gluster volume geo-replication Volume1 example.com:/data/remote_dir config log-file `

**Slave Log File**

To get the log file for Geo-replication on slave (glusterd must be
running on slave machine), use the following commands:

1.  On master, run the following command:

    `# gluster volume geo-replication Volume1 example.com:/data/remote_dir config session-owner 5f6e5200-756f-11e0-a1f0-0800200c9a66 `

    Displays the session owner details.

2.  On slave, run the following command:

    `# gluster volume geo-replication /data/remote_dir config log-file /var/log/gluster/${session-owner}:remote-mirror.log `

3.  Replace the session owner details (output of Step 1) to the output
    of the Step 2 to get the location of the log file.

    `/var/log/gluster/5f6e5200-756f-11e0-a1f0-0800200c9a66:remote-mirror.log`

Rotating Geo-replication Logs
-----------------------------

Administrators can rotate the log file of a particular master-slave
session, as needed. When you run geo-replication's ` log-rotate`
command, the log file is backed up with the current timestamp suffixed
to the file name and signal is sent to gsyncd to start logging to a new
log file.

**To rotate a geo-replication log file**

-   Rotate log file for a particular master-slave session using the
    following command:

    `# gluster volume geo-replication  log-rotate`

    For example, to rotate the log file of master `Volume1` and slave
    `example.com:/data/remote_dir` :

        # gluster volume geo-replication Volume1 example.com:/data/remote_dir log rotate
        log rotate successful

-   Rotate log file for all sessions for a master volume using the
    following command:

    `# gluster volume geo-replication  log-rotate`

    For example, to rotate the log file of master `Volume1`:

        # gluster volume geo-replication Volume1 log rotate
        log rotate successful

-   Rotate log file for all sessions using the following command:

    `# gluster volume geo-replication log-rotate`

    For example, to rotate the log file for all sessions:

        # gluster volume geo-replication log rotate
        log rotate successful

Synchronization is not complete
-------------------------------

**Description**: GlusterFS Geo-replication did not synchronize the data
completely but still the geo- replication status displayed is OK.

**Solution**: You can enforce a full sync of the data by erasing the
index and restarting GlusterFS Geo- replication. After restarting,
GlusterFS Geo-replication begins synchronizing all the data. All files
are compared using checksum, which can be a lengthy and high resource
utilization operation on large data sets. If the error situation
persists, contact Red Hat Support.

For more information about erasing index, see ?.

Issues in Data Synchronization
------------------------------

**Description**: Geo-replication display status as OK, but the files do
not get synced, only directories and symlink gets synced with the
following error message in the log:

[2011-05-02 13:42:13.467644] E [master:288:regjob] GMaster: failed to
sync ./some\_file\`

**Solution**: Geo-replication invokes rsync v3.0.0 or higher on the host
and the remote machine. You must verify if you have installed the
required version.

Geo-replication status displays Faulty very often
-------------------------------------------------

**Description**: Geo-replication displays status as faulty very often
with a backtrace similar to the following:

2011-04-28 14:06:18.378859] E [syncdutils:131:log\_raise\_exception]
\<top\>: FAIL: Traceback (most recent call last): File
"/usr/local/libexec/glusterfs/python/syncdaemon/syncdutils.py", line
152, in twraptf(\*aa) File
"/usr/local/libexec/glusterfs/python/syncdaemon/repce.py", line 118, in
listen rid, exc, res = recv(self.inf) File
"/usr/local/libexec/glusterfs/python/syncdaemon/repce.py", line 42, in
recv return pickle.load(inf) EOFError

**Solution**: This error indicates that the RPC communication between
the master gsyncd module and slave gsyncd module is broken and this can
happen for various reasons. Check if it satisfies all the following
pre-requisites:

-   Password-less SSH is set up properly between the host and the remote
    machine.

-   If FUSE is installed in the machine, because geo-replication module
    mounts the GlusterFS volume using FUSE to sync data.

-   If the **Slave** is a volume, check if that volume is started.

-   If the Slave is a plain directory, verify if the directory has been
    created already with the required permissions.

-   If GlusterFS 3.2 or higher is not installed in the default location
    (in Master) and has been prefixed to be installed in a custom
    location, configure the `gluster-command` for it to point to the
    exact location.

-   If GlusterFS 3.2 or higher is not installed in the default location
    (in slave) and has been prefixed to be installed in a custom
    location, configure the `remote-gsyncd-command` for it to point to
    the exact place where gsyncd is located.

Intermediate Master goes to Faulty State
----------------------------------------

**Description**: In a cascading set-up, the intermediate master goes to
faulty state with the following log:

raise RuntimeError ("aborting on uuid change from %s to %s" % \\
RuntimeError: aborting on uuid change from af07e07c-427f-4586-ab9f-
4bf7d299be81 to de6b5040-8f4e-4575-8831-c4f55bd41154

**Solution**: In a cascading set-up the Intermediate master is loyal to
the original primary master. The above log means that the
geo-replication module has detected change in primary master. If this is
the desired behavior, delete the config option volume-id in the session
initiated from the intermediate master.

Troubleshooting POSIX ACLs
==========================

This section describes the most common troubleshooting issues related to
POSIX ACLs.

setfacl command fails with “setfacl: \<file or directory name\>: Operation not supported” error
-----------------------------------------------------------------------------------------------

You may face this error when the backend file systems in one of the
servers is not mounted with the "-o acl" option. The same can be
confirmed by viewing the following error message in the log file of the
server "Posix access control list is not supported".

**Solution**: Remount the backend file system with "-o acl" option. For
more information, see ?.

Troubleshooting Hadoop Compatible Storage
=========================================

This section describes the most common troubleshooting issues related to
Hadoop Compatible Storage.

Time Sync
---------

Running MapReduce job may throw exceptions if the time is out-of-sync on
the hosts in the cluster.

**Solution**: Sync the time on all hosts using ntpd program.

Troubleshooting NFS
===================

This section describes the most common troubleshooting issues related to
NFS .

mount command on NFS client fails with “RPC Error: Program not registered”
--------------------------------------------------------------------------

Start portmap or rpcbind service on the NFS server.

This error is encountered when the server has not started correctly.

On most Linux distributions this is fixed by starting portmap:

`$ /etc/init.d/portmap start`

On some distributions where portmap has been replaced by rpcbind, the
following command is required:

`$ /etc/init.d/rpcbind start `

After starting portmap or rpcbind, gluster NFS server needs to be
restarted.

NFS server start-up fails with “Port is already in use” error in the log file."
-------------------------------------------------------------------------------

Another Gluster NFS server is running on the same machine.

This error can arise in case there is already a Gluster NFS server
running on the same machine. This situation can be confirmed from the
log file, if the following error lines exist:

    [2010-05-26 23:40:49] E [rpc-socket.c:126:rpcsvc_socket_listen] rpc-socket: binding socket failed:Address already in use
    [2010-05-26 23:40:49] E [rpc-socket.c:129:rpcsvc_socket_listen] rpc-socket: Port is already in use 
    [2010-05-26 23:40:49] E [rpcsvc.c:2636:rpcsvc_stage_program_register] rpc-service: could not create listening connection 
    [2010-05-26 23:40:49] E [rpcsvc.c:2675:rpcsvc_program_register] rpc-service: stage registration of program failed 
    [2010-05-26 23:40:49] E [rpcsvc.c:2695:rpcsvc_program_register] rpc-service: Program registration failed: MOUNT3, Num: 100005, Ver: 3, Port: 38465 
    [2010-05-26 23:40:49] E [nfs.c:125:nfs_init_versions] nfs: Program init failed 
    [2010-05-26 23:40:49] C [nfs.c:531:notify] nfs: Failed to initialize protocols

To resolve this error one of the Gluster NFS servers will have to be
shutdown. At this time, Gluster NFS server does not support running
multiple NFS servers on the same machine.

mount command fails with “rpc.statd” related error message
----------------------------------------------------------

If the mount command fails with the following error message:

mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.

Start rpc.statd

For NFS clients to mount the NFS server, rpc.statd service must be
running on the clients.

Start rpc.statd service by running the following command:

`$ rpc.statd `

mount command takes too long to finish.
---------------------------------------

Start rpcbind service on the NFS client.

The problem is that the rpcbind or portmap service is not running on the
NFS client. The resolution for this is to start either of these services
by running the following command:

`$ /etc/init.d/portmap start`

On some distributions where portmap has been replaced by rpcbind, the
following command is required:

`$ /etc/init.d/rpcbind start`

NFS server glusterfsd starts but initialization fails with “nfsrpc- service: portmap registration of program failed” error message in the log.
----------------------------------------------------------------------------------------------------------------------------------------------

NFS start-up can succeed but the initialization of the NFS service can
still fail preventing clients from accessing the mount points. Such a
situation can be confirmed from the following error messages in the log
file:

    [2010-05-26 23:33:47] E [rpcsvc.c:2598:rpcsvc_program_register_portmap] rpc-service: Could notregister with portmap 
    [2010-05-26 23:33:47] E [rpcsvc.c:2682:rpcsvc_program_register] rpc-service: portmap registration of program failed
    [2010-05-26 23:33:47] E [rpcsvc.c:2695:rpcsvc_program_register] rpc-service: Program registration failed: MOUNT3, Num: 100005, Ver: 3, Port: 38465
    [2010-05-26 23:33:47] E [nfs.c:125:nfs_init_versions] nfs: Program init failed
    [2010-05-26 23:33:47] C [nfs.c:531:notify] nfs: Failed to initialize protocols
    [2010-05-26 23:33:49] E [rpcsvc.c:2614:rpcsvc_program_unregister_portmap] rpc-service: Could not unregister with portmap
    [2010-05-26 23:33:49] E [rpcsvc.c:2731:rpcsvc_program_unregister] rpc-service: portmap unregistration of program failed
    [2010-05-26 23:33:49] E [rpcsvc.c:2744:rpcsvc_program_unregister] rpc-service: Program unregistration failed: MOUNT3, Num: 100005, Ver: 3, Port: 38465

1.  Start portmap or rpcbind service on the NFS server.

    On most Linux distributions, portmap can be started using the
    following command:

    `$ /etc/init.d/portmap start `

    On some distributions where portmap has been replaced by rpcbind,
    run the following command:

    `$ /etc/init.d/rpcbind start `

    After starting portmap or rpcbind, gluster NFS server needs to be
    restarted.

2.  Stop another NFS server running on the same machine.

    Such an error is also seen when there is another NFS server running
    on the same machine but it is not the Gluster NFS server. On Linux
    systems, this could be the kernel NFS server. Resolution involves
    stopping the other NFS server or not running the Gluster NFS server
    on the machine. Before stopping the kernel NFS server, ensure that
    no critical service depends on access to that NFS server's exports.

    On Linux, kernel NFS servers can be stopped by using either of the
    following commands depending on the distribution in use:

    `$ /etc/init.d/nfs-kernel-server stop`

    `$ /etc/init.d/nfs stop`

3.  Restart Gluster NFS server.

mount command fails with NFS server failed error.
-------------------------------------------------

mount command fails with following error

*mount: mount to NFS server '10.1.10.11' failed: timed out (retrying).*

Perform one of the following to resolve this issue:

1.  Disable name lookup requests from NFS server to a DNS server.

    The NFS server attempts to authenticate NFS clients by performing a
    reverse DNS lookup to match hostnames in the volume file with the
    client IP addresses. There can be a situation where the NFS server
    either is not able to connect to the DNS server or the DNS server is
    taking too long to responsd to DNS request. These delays can result
    in delayed replies from the NFS server to the NFS client resulting
    in the timeout error seen above.

    NFS server provides a work-around that disables DNS requests,
    instead relying only on the client IP addresses for authentication.
    The following option can be added for successful mounting in such
    situations:

    `option rpc-auth.addr.namelookup off `

    > **Note**
    >
    > Note: Remember that disabling the NFS server forces authentication
    > of clients to use only IP addresses and if the authentication
    > rules in the volume file use hostnames, those authentication rules
    > will fail and disallow mounting for those clients.

    or

2.  NFS version used by the NFS client is other than version 3.

    Gluster NFS server supports version 3 of NFS protocol. In recent
    Linux kernels, the default NFS version has been changed from 3 to 4.
    It is possible that the client machine is unable to connect to the
    Gluster NFS server because it is using version 4 messages which are
    not understood by Gluster NFS server. The timeout can be resolved by
    forcing the NFS client to use version 3. The **vers** option to
    mount command is used for this purpose:

    `$ mount  -o vers=3 `

showmount fails with clnt\_create: RPC: Unable to receive
---------------------------------------------------------

Check your firewall setting to open ports 111 for portmap
requests/replies and Gluster NFS server requests/replies. Gluster NFS
server operates over the following port numbers: 38465, 38466, and
38467.

For more information, see ?.

Application fails with "Invalid argument" or "Value too large for defined data type" error.
-------------------------------------------------------------------------------------------

These two errors generally happen for 32-bit nfs clients or applications
that do not support 64-bit inode numbers or large files. Use the
following option from the CLI to make Gluster NFS return 32-bit inode
numbers instead: nfs.enable-ino32 \<on|off\>

Applications that will benefit are those that were either:

-   built 32-bit and run on 32-bit machines such that they do not
    support large files by default

-   built 32-bit on 64-bit systems

This option is disabled by default so NFS returns 64-bit inode numbers
by default.

Applications which can be rebuilt from source are recommended to rebuild
using the following flag with gcc:

` -D_FILE_OFFSET_BITS=64`

Troubleshooting File Locks
==========================

In GlusterFS 3.3 you can use `statedump` command to list the locks held
on files. The statedump output also provides information on each lock
with its range, basename, PID of the application holding the lock, and
so on. You can analyze the output to know about the locks whose
owner/application is no longer running or interested in that lock. After
ensuring that the no application is using the file, you can clear the
lock using the following `clear lock` command:

`# `

For more information on performing `statedump`, see ?

**To identify locked file and clear locks**

1.  Perform statedump on the volume to view the files that are locked
    using the following command:

    `# gluster volume statedump  inode`

    For example, to display statedump of test-volume:

        # gluster volume statedump test-volume
        Volume statedump successful

    The statedump files are created on the brick servers in the` /tmp`
    directory or in the directory set using `server.statedump-path`
    volume option. The naming convention of the dump file is
    `<brick-path>.<brick-pid>.dump`.

    The following are the sample contents of the statedump file. It
    indicates that GlusterFS has entered into a state where there is an
    entry lock (entrylk) and an inode lock (inodelk). Ensure that those
    are stale locks and no resources own them.

        [xlator.features.locks.vol-locks.inode]
        path=/
        mandatory=0
        entrylk-count=1
        lock-dump.domain.domain=vol-replicate-0
        xlator.feature.locks.lock-dump.domain.entrylk.entrylk[0](ACTIVE)=type=ENTRYLK_WRLCK on basename=file1, pid = 714782904, owner=ffffff2a3c7f0000, transport=0x20e0670, , granted at Mon Feb 27 16:01:01 2012

        conn.2.bound_xl./gfs/brick1.hashsize=14057
        conn.2.bound_xl./gfs/brick1.name=/gfs/brick1/inode
        conn.2.bound_xl./gfs/brick1.lru_limit=16384
        conn.2.bound_xl./gfs/brick1.active_size=2
        conn.2.bound_xl./gfs/brick1.lru_size=0
        conn.2.bound_xl./gfs/brick1.purge_size=0

        [conn.2.bound_xl./gfs/brick1.active.1]
        gfid=538a3d4a-01b0-4d03-9dc9-843cd8704d07
        nlookup=1
        ref=2
        ia_type=1
        [xlator.features.locks.vol-locks.inode]
        path=/file1
        mandatory=0
        inodelk-count=1
        lock-dump.domain.domain=vol-replicate-0
        inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0, pid = 714787072, owner=00ffff2a3c7f0000, transport=0x20e0670, , granted at Mon Feb 27 16:01:01 2012

2.  Clear the lock using the following command:

    `# `

    For example, to clear the entry lock on `file1` of test-volume:

        # gluster volume clear-locks test-volume / kind granted entry file1
        Volume clear-locks successful
        vol-locks: entry blocked locks=0 granted locks=1

3.  Clear the inode lock using the following command:

    `# `

    For example, to clear the inode lock on `file1` of test-volume:

        # gluster  volume clear-locks test-volume /file1 kind granted inode 0,0-0
        Volume clear-locks successful
        vol-locks: inode blocked locks=0 granted locks=1

    You can perform statedump on test-volume again to verify that the
    above inode and entry locks are cleared.