system monitoring tool 사용예 [AIX]

by 라제폰 2009. 11. 4. 15:33

1. top명령어로 프로세스를 모니터링할때 httpd프로세스들이 과도하게 cpu를 사용하는 경우는 대처방안.

해당 httpd 프로세스를 확인한후
아래의 명령어를 이용해서 httpd 프로세스의 상태를 확인한다.

/httpd/kosdaq/bin/lsof -i | grep httpd 를 실행하면

httpd  38546 kosdaq   16u IPv4 0x706092dc    0t0 TCP *:www (LISTEN)        -->정상인 넘
httpd  78970 kosdaq    5u IPv4                  0t509 TCP no PCB, CANTSENDMORE, CANTRCVMORE   -->죽여야할 넘
httpd  78970 kosdaq    6u IPv4 0x705fa2dc    0t0 TCP loopback:43513->loopback:8011 (CLOSE_WAIT) -->죽여야할 넘

/httpd/kosdaq/bin/lsof -p 78970
COMMAND   PID   USER   FD   TYPE     DEVICE SIZE/OFF   NODE NAME
httpd   78970 kosdaq cwd   VDIR       10,4     1536      2 / (/dev/hd4)
httpd   78970 kosdaq    0r VCHR        2,2      0t0     96 /dev/null
httpd   78970 kosdaq    1w VCHR        2,2      0t0     96 /dev/null
httpd   78970 kosdaq    2w VREG       58,4 13193922 436524 /httpd (/dev/httplv)
httpd   78970 kosdaq    3r FIFO 0x5349e620        0
httpd   78970 kosdaq    4w FIFO 0x5349e620        0
httpd   78970 kosdaq    5u IPv4               0t509    TCP no PCB, CANTSENDMORE, CANTRCVMORE
httpd   78970 kosdaq    6u IPv4 0x705fa2dc      0t0    TCP loopback:43513->loopback:8011 (CLOSE_WAIT)
httpd   78970 kosdaq   15w VREG       58,4 13193922 436524 /httpd (/dev/httplv)
httpd   78970 kosdaq   16u IPv4 0x706092dc      0t0    TCP *:www (LISTEN)

netstat -n | grep 9090
tcp4 0 0 172.16.1.21.9090 172.16.1.21.51851 TIME_WAIT

2. 메모리 모니터링

2.1 vmstat 2 5

kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r b   avm   fre   re pi po fr   sr cy in    sy cs us sy id wa
0 1 728275 1219   0   0   0 62   98   0 32   106 40 20 6 70 3
2 2 728393 1078   0   3   0   0    0   0 247 28819 245 59 13 26 3
3 2 728373 1077   0   0   0   0    0   0 265 20857 426 77 22 0 0
1 2 728277 1209   0   0   0   0    0   0 240 7176 265 78 13 8 1
1 2 728277 1209   0   0   0   0    0   0 258 1451 114 85 1 14 0

fre < 2 * Real Memory(MB) - 8 이면 문제상황(memory is committed)
pi > 5 이면 문제상황(memory is committed)

2.2 sar -r 1 10

AIX KEPA1 3 4 000BF3DD4C00 12/21/01

17:13:13   slots cycle/s fault/s odio/s
17:13:14 704651    0.00    3.00    0.00
17:13:15 704651    0.00    1.00    0.00
17:13:16 704651    0.00    0.00    2.00
17:13:17 704651    0.00    0.00    0.00
17:13:18 704651    0.00    0.00    0.00
17:13:19 704651    0.00   15.00    4.00
17:13:20 704651    0.00   37.62    8.91
17:13:21 704651    0.00   45.00    7.00
17:13:22 704651    0.00    2.00    9.00
17:13:23 704651    0.00 5005.00   10.00

Average 704651 0 510 4

cycle/s : Reports the number of page replacement cycles per second.

fault/s : Reports the number of page faults per second. This is not a count of
page faults that generate I/O, because some page faults can be resolved without I/O.

slots : Reports the number of free pages on the paging spaces.

odio/s : Reports the number of nonpaging disk I/Os per second.

2.3 lsps -a

Page Space Physical Volume   Volume Group    Size   %Used Active Auto Type
paging00    hdisk0            rootvg        2560MB      47     yes   yes    lv
hd6         hdisk0            rootvg        2560MB      47     yes   yes    lv

물리적인 메모리가 최소한 2560 * 0.47 + 2560 * 0.47 = 2406.4 MB 가 더 필요함을 알 수 있다.

2.4 ps gvc
    PID    TTY STAT TIME PGIN SIZE   RSS   LIM TSIZ   TRS %CPU %MEM COMMAND
39242      - A     0:05 857 14360 13364 32768    21    24 0.0 0.0 java      --> KosdaqServer3
42246      - A     3:36   45 1320    84 32768     1     4 0.0 0.0 java      --> WSMRefreshServer
158916      - A     2:17 18836 83888 76200 32768    21    24 0.0 2.0 java      --> tomcat4
181890      - A    23:21 18835 94868 84988 32768    21    24 0.4 2.0 java      --> tomcat1
191810      - A     7:14 18594 93312 85056 32768    21    24 0.1 2.0 java      --> tomcat3
203768      - A    18:06 31489 19568 12964 32768     1     4 0.3 0.0 java      --> HFtpDaemon
221316      - A    11:35 20167 92572 88020 32768    21    24 0.2 2.0 java      --> tomcat2

PGIN : Number of memory frames paged in
%MEM : Percentage of system memory used

2.5 svmon :

가상 메모리의 스냅샷을 캡쳐, 분석한다.

svmon -G
      m e m o r y           i n u s e             p i n       p g s p a c e
size inuse free pin   work pers clnt   work pers clnt size    inuse
65536 62724 2812 3508 41482 21242 0      3347 161   0    131072 25555

1) m e m o r y : System memory usage
   - size : Total size of real memory
   - inuse : Amount of memory in use
   - free: Amount of free memory
   - pin : Pinned memory(memory pages that cannot be swapped out)

2) i n u s e : Expands the column memory "inuse"
   - work : The system working set(data and stack regions)
   - pers : Pages that are persistent on file
   - clnt : Client allocated memory(network clients)

3) p i n : Expacds the column "pin"
- (Refer to preceding description of "inuse" columns)

4) p g s p a c e : size of the paging space
- size : Size of paging area
- inuse : Amount of page space in use(size of read memory extension)

또한, 아래의 명령어를 이용하면 특정 프로세스가 점유하고 있는 메모리의
량을 계산할 수 있습니다.

svmon -P <pid>

"inuse" 전체 값에서 "shared library"의 size를 빼면, 실제 새로운 process가
기동될 때 필요로 하는 memory의 량을 계산할 수 있어 매우 유용할 것으로
보입니다.

2.6 rmss :

Simulates a system with various sizes of memory for performance testing of applications
The rmss command simulates a system with various sizes of real memory, without having to extract
and replace memory boards. By running an application at several memory sizes and collecting performance statistics,
one can determine the memory needed to run an application with acceptable performance

실제 메모리가 6GB일지라도 어떤 이유에서건 2GB 만 사용하도록 S/W적으로 조정할 수 있습니다.

rmss -p : 현재의 시뮬레이션된 메모리크기를 보여준다.
rmss -r : real memory 로 재설정
rmss -c <number of MB> : 테스트를 위해서 해당 메모리만큼만 사용하도록 설정
rmss [ -d MemSize ] [ -f MemSize ] [ -n NumIterations ] [ -o OutputFile ] [ -s MemSize ] Command

사용예) rmss -s 24 -f 8 -d 2 -n 1 -o cc.rmss.out cc -O foo.c
cc -O foo.c 명령을 초기 24M에서 2M가씩 줄여가면서 1번씩 실행한 결과를 cc.rmss.out.cc파일에 저장한다.

결과)
Hostname: xray.austin.ibm.com
Real memory size:   48.00 Mb
Time of day: Wed Aug 8 13:07:33 1990
Command: cc -O foo.c
Simulated memory size initialized to 24.00 Mb.
Number of iterations per memory size = 1 warmup + 1 measured = 2.
Memory size Avg. Pageins Avg. Response Time   Avg. Pagein Rate
(megabytes)                       (sec.)           (pageins/sec.)
    -----------------------------------------------------------------
24.00             0.0              113.7                0.0
22.00             5.0              114.8                0.0
20.00             0.0              113.7                0.0
18.00             3.0              114.3                0.0
16.00             0.0              114.6                0.0
14.00             139.0            116.1                1.2
12.00             816.0            126.9                6.4
10.00             1246.0           135.7                9.2
8.00              2218.0           162.9                13.6

3. CPU 모니터링

3.1 sar -P ALL 1 10

checking active cpus..

AIX KEPA1 3 4 000BF3DD4C00 12/21/01

17:58:47 cpu    %usr    %sys    %wio   %idle
17:58:48 0       16       0      17      67
          1       93       1       0       6
          -       55       0       8      36
17:58:49 0       53       0       1      46
          1      100       0       0       0
          -       76       0       0      23
17:58:50 0       80       1       0      19
          1       96       2       0       2
          -       88       2       0      10

Average   0       50       0       6      44
          1       96       1       0       3
          -       73       1       3      23

3.2 nmon

3.3 cpu_state -l

모든 cpu의 상태를 보여준다.(active, inactive 여부)

3.3 vmstat

vmstat 5

kthr     memory             page              faults        cpu
----- ----------- ------------------------ ------------ -----------
r b   avm    fre re pi po fr   sr cy in    sy cs us sy id wa
0 1 729827 1286   0   0   0 62   98   0 32   107 40 20 6 70 3
1 2 729915 1168   0   0   0   0    0   0 250 18689 606 72 15 11 2
1 2 729699 1426   0   0   0   0    0   0 264 3631 312 13 8 78 1
0 2 729699 1425   0   0   0   0    0   0 229 1111 92 42 1 58 0
0 2 729571 1568   0   0   0   0    0   0 262 21076 1254 17 21 58 4

fre : Free list pages
fr : number of page steals
sr : number of page scans

가용한 메모리(fre)가 줄어들고, page steal값(fr) page scans(sr) 가 증가하면서
idle 0% 에 근접하고, system 25%, wait 60%를 나타내는 현상은 가용한 메모리가 전적으로 부족하여
시스템(VMM)이 free memory 를 보충하는 작업을 진행하고 있는 것을 의미한다. 이 때 CPU를 사용자
응용프로그램에서 사용하지 못함으로 인해 속도저하의 상태가 된다.

3.4 sar

sar 1 5

AIX KEPA1 3 4 000BF3DD4C00 12/21/01

18:52:11     %usr    %sys    %wio   %idle
18:52:12      94       1       0       5
18:52:13      82      18       0       0
18:52:14      68      32       0       0
18:52:15      87      13       0       0
18:52:16      65      34       0       0

Average 79 20 0 1

sar -P ALL 1 5 : cpu별도 사용량을 보여준다.

AIX KEPA1 3 4 000BF3DD4C00 12/21/01

18:58:03 cpu    %usr    %sys    %wio   %idle
18:58:04 0       73      26       0       1
          1       98       2       0       0
          -       86      14       0       0
18:58:05 0       95       0       0       5
          1       98       1       0       1
          -       96       0       0       3
18:58:06 0      100       0       0       0
          1       98       2       0       0
          -       99       1       0       0
18:58:07 0       65       1       0      34
          1       47       0       0      53
          -       56       0       0      43
18:58:08 0       22       1       0      77
          1       28       1       0      71
          -       25       1       0      74

Average   0       71       6       0      23
          1       74       1       0      25
          -       72       3       0      24

4 텍스트기반의 GUI 모니터링툴

linux : top
solaris : Performance Meter, SE, top
aix : monitor, nmon, top
hp : perf, top

5 file descriptor 를 사용한 모니터링

5.1 ulimit -a or ulimit -n

ulimit -a (real machine)

time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         1048576
stack(kbytes)        32768
memory(kbytes)       32768
coredump(blocks)     2097151 --> core파일의 크기 제한
nofiles(descriptors) 2000  --> 하나의 프로세스가 열 수 있는 file desciptor 의 최대 개수

이 값은 /etc/security/limits 에서 영구적으로 설정할 수 있다.
파일의 내용은 아래와 같다.

default:
        fsize = -1
        core = 2097151
        cpu = -1
        data = 2097151
        rss = 65536
        stack = 65536
        nofiles = 2000
        fsize_hard = -1

ulimit명령으로 설정값을 변경하는 것은 soft change이다. 즉, 다음번에 시스템이 부팅되면
효과를 잃어버린다.
또한 AIX에서는 오직 nofiles값만을 ulimit명령으로 변경할 수 있다.
file descriptor은 물리적인 파일을 의미하지 않는다. socket, file open stream등등을 나타낸다.
각 OS별로 default file descriptor값은 다음과 같다.
AIX : 2000
Linux : 1024
Solaris : 64

위의 용어는 OS별로 다를 수 있다.
리눅스의 경우에는 ..

ulimit -a

time(cpu-seconds)    unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        8192
memory(kbytes)       unlimited
coredump(blocks)     1000000
nofiles(descriptors) 1024
lockedmem(kbytes)    unlimited
processes            6144

5.2 lsof 유틸

nobody user 의 경우에는 ulimit 제한을 받지 않는다.
nobody user 가 open 한 file descriptors 의 갯수가 설정 값 이상을 초과
하였다고 하더라도 문제되지 않는다.(IBM 문수영씨 의견 -- aix)

6 네트웍 모니터링

6.1 no 명령(Network Option)

no 명령은 TCP, UDP, IP 에 영향을 미치는 initial network option 으로써,
N/W adapter type에 독립적이다.

no -a : 현재 설정된 모든 속성에 대한 no parameter 값을 보여준다.
no -o attribute_name : attribute_name 값을 보여준다.
no -o attribute_name = value : attribute_name 값을 value로 설정한다.

6.1.1 중요한 attribute 설명

- thewall
-the maximum amount of memory, in KB, that is allocated to the memory pool. (multipls of 4KB)
-default : 1/2 of real memory or 1048576(1GB)
-필요하다면 늘릴 수 있음.
- sb_max
-absolute upper bound on the size of TCP and UDP socket buffers per socket (multiples of 4096)
-default : 1024 bytes
-recommand : S7A(12-way, 8GB), H70(8-way, 4GB), 43P-260(2-way, 4GB) --> 262144
- somaxcomm

- tcp_sendspace

- tcp_recvspace

- udp_sendspace

- udp_recvspace

- rfc1323

- tcp_timewait
-how long connection are kept in the timewait state.
-given in 15 seconds intervals.
-default is 1(which means 15 seconds)
-recommand : S7A(12-way, 8GB), H70(8-way, 4GB), S80(12-way, 16GB), 43P-260(2-way, 4GB) --> 5 (65초)

- nbc_max_cache

- nbc_pseg_limit

- nbc_pseg

- nbc_limit

- MTU
-Limits the size of packets that are transmitted on the network, in bytes.
-default : adapter에 의존적임
-range : 512 ~ 65536 bytes
-to obtain : lsattr -E -l tr0, netstat -i
-to change, chdev -l interface -a mtu=<newvalue>, or SMIT
-NOTE : all the systems on the LAN must have the same MTU, they must change simultaneously.
         Change is effective across boots.
-recommand : (default/max/optimal )
  Ethernet : 1500/1500/1500
  Token Ring : 1492/17284/4096
  FDDI  : 4352/4352/4352
  ATM   : 9180/65530/9180
  Gigabit Ethernet : 9000/NaN/9000
-FYI.
  S7A(12-way, 8GB) --> ATM MTU=9180
  H70( 8-way, 8GB) --> ATM MTU=9180, Jumbo Frame Gigabit MTU=9000
  S80(12-way, 16GB) --> Jumbo Frame Gigabit MTU=9000
  43-260(2-way, 4GB) --> Jumbo Frame Gigabit MTU=9000
- tcp_mssdflt
-default maximum segment size used to communicate with remote networks.
-MTU of interface - TCP header size - IP header size - rfc1323 header size
which is : MTU - 20 - 20 - 12 or MTU - 52
Limiting data to MTU - 52 bytes ensures that, where possible, only full package will be send.
if set higher than the MTU of the adapter, IP or and intrmediate router may fragment packets.

주의 : 값을 변경하려면 다음과 같이 해야 합니다.
# chdev -l env0 -a tcp_recvspace=65536 -a tcp_sendspace=65536 -a tcp_nodelay=1
# chdev -l env -a rfc1323=0

확인사항 : no -o 명령을 사용하면 값이 일시적으로 적용되고, chdev를 사용하면 ODM의 값을 변경하여
영구적인 반영이 된다.(글씨.. 확인해 봐야 알겠는디..^^)

6.2 netstat 명령

6.2.1 netstat -l

ifconfig -a
------------------------------------------------------------------------------------
lo0: flags=e08084b<UP,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT>
        inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255
        inet6 ::1/0
en0: flags=4e080863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,PSEG>
        inet 210.126.140.22 netmask 0xffffffe0 broadcast 210.126.140.31
en1: flags=4e080863<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT,PSEG>
        inet 172.16.1.21 netmask 0xffff0000 broadcast 172.16.255.255

netstat -i
------------------------------------------------------------------------------------
Name Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs Coll
lo0   16896 link#1                        160309285     0 160358915     0     0
lo0   16896 127         loopback          160309285     0 160358915     0     0
lo0   16896 ::1                           160309285     0 160358915     0     0
en0   1500 link#2      0.6.29.ac.29.47   53177310     0 154226192     0     0
en0   1500 210.126.140 KEPA1             53177310     0 154226192     0     0
en1   1500 link#3      0.4.ac.2a.14.ca   494582532     0 299190271     0     0
en1   1500 172.16      kepa1             494582532     0 299190271     0     0

netstat -I en0
------------------------------------------------------------------------------------
Name Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs Coll
en0   1500 link#2      0.6.29.ac.29.47   53176749     0 154225896     0     0
en0   1500 210.126.140 KEPA1             53176749     0 154225896     0     0

netstat -I en1
------------------------------------------------------------------------------------
Name Mtu   Network     Address            Ipkts Ierrs    Opkts Oerrs Coll
en1   1500 link#3      0.4.ac.2a.14.ca   494579059     0 299186341     0     0
en1   1500 172.16      kepa1             494579059     0 299186341     0     0

netstat -I en1 2
------------------------------------------------------------------------------------
    input   (en1)      output           input   (Total)    output
packets errs packets errs colls packets errs packets errs colls
494589307     0 299197799     0     0 708081120     0 613787971     0     0
     379     0      464     0     0      454     0      536     0     0
      39     0       41     0     0      263     0      262     0     0
      36     0       38     0     0       56     0       51     0     0

주목해서 보아야할 곳은 colls(number of colisions)값인데, 이 값은 일정 정도 나타나는 것이
정상이다. 그러나 과도하게 높다는 것은 Network의 부하가 심하다거나, 병목현상이 발생하고 있다는
것을 의미한다.

6.2.2 netstat -m

AIX는 communication subsystem 내에서 효율적으로 pinned(physical)메모리를 할당하고, 또 재사용한다.
이때 mbufs, clusters 라는 메모리버퍼를 사용한다.

netstat -m 을 통해서 이 값이 얼마나 사용되고 있는지를 확인할 수 있다.