监控的作用有两个:一是可以通过查看历史或当前了解主机一段时间内的运行情况、负载情况;一是在出现状况时及时发出通知,告知相关人员进行处理。这里主要说下后者。 在nagios的配置中,关于主机状态和服务状态通知的方式主要有三种调用方法,一是通过contacts或contact_groups;一是通过模板引用define contacts;一是通过define host模板引用。
本文主要为承接 nagios分组相关 这篇日志而写的。该文中最后提到nagios的配置引用方式非常灵活。这里就结合监控通知联系人的调用方式做一个说明。
一、联系人引用方法一(通过contacts或contact_groups)
先通过define contacts定义好通知人和通知方式,在主机或服务中的引用如下:
define service{ use window-service #引用定义的服务模板 host_name jjh service_description PING check_command check_ping!100.0,20%!500.0,60% contacts admin1 #需事先定义过 }
<br />
注:上面的use使用的是模板,对应我们经常说的templates.cfg中的内容。contacts引用的是contacts.cfg中的内容。
二、联系人引用方法二(通过模板引用define contacts)
1、先定义联系人
define contact { contact_name ZheJiang use generic-contact #联系人中引用模板 alias ZheJiang_Mobile service_notification_commands notify-service-by-email,notify-service-by-sms email abc@361way.com,def@361way.com pager "1366XXXXXXX,13819XXXXXX" }
<br />
2、通过use引用
define service{ use ZheJiang #引用联系人 host_name ZJ-ZJ-App service_description CPU Load low_flap_threshold 0 high_flap_threshold 0.999 check_command check_nrpe!check_load } define service { use ZheJiang #引用联系人 host_name ZJ-ZJ-App service_description Check_Disk check_command check_nrpe!check_disk }
注:这里直接使用通过use使用了contact定义,use的作用类似于编程中的 include ,就是把前面定义过的东西直接套过来用。而上面define的contact里又use了templates.cfg中的定义。templates.cfg一般会定义通知触发条件,时间周期等。
三、联系人引用方法三(通过define host模板引用)
这里提到的方法和方法二其实是个对调,就是先定义好联系人,再在templates.cfg中通过contacts或contact_groups调用联系人。而host-xxxx.cfg中再去引用templates.cfg中的模板。由于方法二中已经提到过contacts.cfg中联系人的定义,这里就省过。这里只列几个templates.cfg中的常见定义:
#定义联系人模板 define contact{ name generic-contact ; The name of this contact template service_notification_period 24x7 ; service notifications can be sent anytime host_notification_period 24x7 ; host notifications can be sent anytime service_notification_options w,u,c,r,f,s ; #触发条件,下同 host_notification_options d,u,r,f,s ; send notifications for all host states, flapping events, and scheduled downtime events service_notification_commands notify-service-by-email ; send service notifications via email host_notification_commands notify-host-by-email ; send host notifications via email register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE! } #定义主机模板 define host{ name generic-host ; The name of this host template notifications_enabled 1 ; Host notifications are enabled event_handler_enabled 1 ; Host event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts notification_period 24x7 ; Send host notifications at any time register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! } define host{ name JJH-server ; The name of this host template use generic-host ; This template inherits other values from the generic-host template check_period 24x7 ; By default, Linux hosts are checked round the clock check_interval 5 ; Actively check the host every 5 minutes retry_interval 1 ; Schedule host check retries at 1 minute intervals max_check_attempts 10 ; Check each Linux host 10 times (max) check_command check-host-alive ; Default command to check Linux hosts notification_period workhours notification_interval 120 ; Resend notifications every 2 hours notification_options d,u,r ; Only send notifications for specific host states contact_groups admins-jjh #包含引用联系人组 hostgroups JJH-servers register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL HOST, JUST A TEMPLATE! } #定义服务模板 define service{ name generic-service ; The 'name' of this service template active_checks_enabled 1 ; Active service checks are enabled passive_checks_enabled 1 ; Passive service checks are enabled/accepted parallelize_check 1 ; Active service checks should be parallelized (disabling this can lead to major performance problems) obsess_over_service 1 ; We should obsess over this service (if necessary) check_freshness 0 ; Default is to NOT check service 'freshness' notifications_enabled 1 ; Service notifications are enabled event_handler_enabled 1 ; Service event handler is enabled flap_detection_enabled 1 ; Flap detection is enabled failure_prediction_enabled 1 ; Failure prediction is enabled process_perf_data 1 ; Process performance data retain_status_information 1 ; Retain status information across program restarts retain_nonstatus_information 1 ; Retain non-status information across program restarts is_volatile 0 ; The service is not volatile check_period 24x7 ; The service can be checked at any time of the day max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state normal_check_interval 10 ; Check the service every 10 minutes under normal conditions retry_check_interval 2 ; Re-check the service every two minutes until a hard state can be determined contact_groups admins ; Notifications get sent out to everyone in the 'admins' group notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events notification_interval 60 ; Re-notify about service problems every hour notification_period 24x7 ; Notifications can be sent out at any time register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! } #通过将notifications_enabled设为0,关闭通知 define service{ name no-notice-service ; The name of this service template use generic-service ; Inherit default values from the generic-service definition max_check_attempts 4 ; Re-check the service up to 4 times in order to determine its final (hard) state normal_check_interval 5 ; Check the service every 5 minutes under normal conditions notifications_enabled 0 ; Service notifications are enabled event_handler_enabled 0 retry_check_interval 1 ; Re-check the service every minute until a hard state can be determined register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! } #以下服务模板中指定了通知(联系人)组 define service{ name windows-service ; The name of this service template use generic-service ; Inherit default values from the generic-service definition max_check_attempts 4 ; Re-check the service up to 4 times in order to determine its final (hard) state normal_check_interval 5 ; Check the service every 5 minutes under normal conditions retry_check_interval 1 ; Re-check the service every minute until a hard state can be determined register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! contact_groups admins-win #包含引用联系人组 } define service{ name JJH-service ; The name of this service template use generic-service ; Inherit default values from the generic-service definition max_check_attempts 4 ; Re-check the service up to 4 times in order to determine its final (hard) state normal_check_interval 5 ; Check the service every 5 minutes under normal conditions retry_check_interval 1 ; Re-check the service every minute until a hard state can be determined register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE! contact_groups admins-jjh #包含引用联系人组 }
<br />
注:以上模板的书写可以看到非常灵活,可以设置是否通知,联系人组,通知频率,触发条件等。模板的书写为了以后在xxxhost.cfg中use引用方便,简少书写的内容。这又类似于编程中的变量。
而在比如361way.cfg之样的主机中引用模板时如下:
#模板引用 define host{ use JJH-server #使用模板 host_name jjh-cc parents aliyun statusmap_image linux40.gd2 alias jjh-cc address 115.29.161.54 notification_interval 0 process_perf_data 1 action_url /pnp4nagios/graph?host=$HOSTNAME$ } define service{ use JJH-service,srv-pnp ; Name of service template to use host_name jjh-cc service_description PING check_command check_ping!100.0,20%!500.0,60% } define service{ use JJH-service,srv-pnp host_name jjh-cc service_description check_cpu check_command check_nrpe!check_cpu }
四、总结
以上主要通过示例试图说明白nagios内contacts.cfg、templates.cfg、XXXhost.cfg之间的灵活引用关系。不过这里还省略了一个timeperiods.cfg (主要用于定义时间,例如工作或休息,中国时间和美国时间等通知的时间范围)。如果直接看上面的配置或我上面提到的三种方式可能会越看越迷糊,下面几句总结可能会对理解有所帮助。
1、从最笨的一思路出发,你在hostxxx.cfg中定义监控项时,可以直接加入service_notification_options、service_notification_period、notification_interval、notification_interval、contact_groups等参数。一样的可以实现你的监控通知需要。
2、为简化上面的笨方法,你将以上参数定义了一个变量,给其取了一个名字,在templates.cfg中做了定义,然后在hostxxx.cfg中通过use + name(template.cfg中定义的)的方式调用。ok,上面提到的参数都在模板中了,可以省略了。
3、联系人比较多时,不同的应用和主机要通知到不同的人,又取了一个contacts.cfg的文件,在其中对主要对通知人员做了定义和划分。无论是contacts use templates还是templates contact contact.cfg,最终不过是让其配置做了个汇总给hostxxx.cfg use 。
4、配置文件无论几个或者取什么名字等无所谓,如果你高兴,可以只设置一个配置文件。多个配置文件名的作用是便于区分,便于查找,简化工作。最终只要在nagios.cfg中include,nagios可以很多的做出处理。
5、define的作你就可以当做是定义变量,use的作用可以当作是引用变量或include配置文件。contacts、contact_groups这些都是nagios参数,可以看作系统内部函数。
参考页面:nagios在线手册