infiniband网络监控
应用介绍
监控infiniband网络项目
主要介绍的ib指令如下:
识别网络中的所有交换机
识别网络中的所有HCA
显示InfiniBand拓扑
显示网络的路由
显示节点的链接状态
显示节点的计数器
显示节点的计数器信息
显示节点的底层详细信息
显示端口的底层详细信息
将LID映射到GUID
对整个网络执行全面的诊断
对线缆综合诊断
确定对InfiniBand拓扑的更改
确定哪些链接出现重大错误
检查所有端口
15.1.1 Identify All Switches in the Fabric
You can use the ibswitches
command to identify the Sun Network QDR InfiniBand Gateway Switches in the InfiniBand fabric in your Exalogic machine. This command displays the Global Unique Identifier (GUID), name, Local Identifier (LID), and LID mask control (LMC) for each switch. The output of the command is a mapping of GUID to LID for switches in the fabric.
On any command-line interface (CLI), run the following command:
# ibswitches
The output is displayed, as in the following example:
Switch : 0x0021283a8389a0a0 ports 36 "Sun DCS 36 QDR switch localhost" enhancedport 0 lid 15 lmc 0
Note:
The actual output for your InfiniBand fabric will differ from that in the example.
15.1.2 Identify All HCAs in the Fabric
You can use the ibhosts
command to display identity information about the host channel adapters (HCAs) in the InfiniBand fabric in a subnet. This command displays the GUID and name for each HCA.
On the command-line interface (CLI), run the following command:
# ibhosts
The output is displayed, as in the following example:
Ca : 0x0003ba000100e388 ports 2 "nsn33-43 HCA-1"
Ca : 0x5080020000911310 ports 1 "nsn32-20 HCA-1"
Ca : 0x50800200008e532c ports 1 "ib-71 HCA-1"
Ca : 0x50800200008e5328 ports 1 "ib-70 HCA-1"
Ca : 0x50800200008296a4 ports 2 "ib-90 HCA-1"
.
.
.
#
Note:
The output in the example is just a portion of the full output and varies for each InfiniBand topology.
15.1.3 Display the InfiniBand Fabric Topology
To understand the routing that happens within your InfiniBand fabric, the ibnetdiscover
command displays the node-to-node connectivity. The output of the command is dependent upon the size of your fabric. You can also use this command to display the LIDs of HCAs.
On the command-line interface (CLI), enter the following command:
# ibnetdiscover
The output is displayed, as in the following example:
# Topology file: generated on Sat Apr 13 22:28:55 2002
#
# Max of 1 hops discovered
# Initiated from node 0021283a8389a0a0 port 0021283a8389a0a0
vendid=0x2c9
devid=0xbd36
sysimgguid=0x21283a8389a0a3
switchguid=0x21283a8389a0a0(21283a8389a0a0)
Switch 36 "S-0021283a8389a0a0" # "Sun DCS 36 QDR switch localhost" enhanced port 0 lid 15 lmc 0
[23] "H-0003ba000100e388"[2](3ba000100e38a) # "nsn33-43 HCA-1" lid 14 4xQDR
vendid=0x2c9
devid=0x673c
sysimgguid=0x3ba000100e38b
caguid=0x3ba000100e388
Ca 2 "H-0003ba000100e388" # "nsn33-43 HCA-1"
[2](3ba000100e38a) "S-0021283a8389a0a0"[23] # lid 14 lmc 0 "Sun DCS 36 QDR switch localhost" lid 15 4xQDR
Note:
The actual output for your InfiniBand fabric will differ from that in the example.
15.1.4 Display a Route Through the Fabric
You sometimes need to know the route between two nodes in the InfiniBand fabric. The ibtracert
command can provide that information by displaying the GUIDs, ports, and LIDs of the nodes.On the command-line interface (CLI), run the following command:
# ibtracert slid dlid
where slid
is the LID of the source node and dlid
is the LID of the destination node in the fabric.
The output is displayed, as in the following example:
# ibtracert 15 14
#
From switch {0x0021283a8389a0a0} portnum 0 lid 15-15 "Sun DCS 36 QDR switch localhost"
[23] -> ca port {0x0003ba000100e38a}[2] lid 14-14 "nsn33-43 HCA-1"
To ca {0x0003ba000100e388} portnum 2 lid 14-14 "nsn33-43 HCA-1"
#
For this example:
The route starts at switch with GUID 0x0021283a8389a0a0
and is using port 0
. The switch is LID 15
and in the description, the switch host's name is Sun DCS 36 QDR switch localhost
. The route enters at port 23
of the HCA with GUID 0x0003ba000100e38a
and exits at port 2
. The HCA is LID 14
.
Note:
The actual output for your InfiniBand fabric will differ from that in the example.
15.1.5 Display the Link Status of a Node
If you want to know the link status of a node in the InfiniBand fabric, run the ibportstate
command to display the state, width, and speed of that node:
On the command-line interface (CLI), run the following command:
# ibportstate lid port
where lid
is the LID of the node in the fabric, port
is the port of the node.
The output is displayed, as in the following example:
# ibportstate 15 23
PortInfo:
# Port info: Lid 15 port 23
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
Peer PortInfo:
# Port info: Lid 15 DR path slid 15; dlid 65535; 0,23
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkWidthSupported:..............1X or 4X
LinkWidthEnabled:................1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.................10.0 Gbps
#
Note:
The actual output for your InfiniBand fabric will differ from that in the example.
15.1.6 Display Counters for a Node
To help ascertain the health of a node in the fabric, use the perfquery
command to display the performance, error, and data counters for that node:
On the command-line interface (CLI), enter the following command:
# perfquery lid port
where lid
is the LID of the node in the fabric, and port
is the port of the node.
Note:
If a port value of 255 is specified for a switch node, the counters are the total for all switch ports.
For example:
# perfquery 15 23
#
# Port counters: Lid 15 port 23
PortSelect:......................23
CounterSelect:...................0x1b01
SymbolErrors:....................0
.
.
.
VL15Dropped:.....................0
XmtData:.........................20232
RcvData:.........................20232
XmtPkts:.........................281
RcvPkts:.........................281
Note:
The output in the example is just a portion of the full output.
15.1.7 Display Data Counters for a Node
To list the data counters for a node in the fabric, use the ibdatacounts
command.
On the command-line interface (CLI), enter the following command:
# ibdatacounts lid port
where lid
is the LID of the node in the fabric, and port
is the port of the node.
For example:
# ibdatacounts 15 23
#
XmtData:.........................6048
RcvData:.........................6048
XmtPkts:.........................84
RcvPkts:.........................84
Note:
The actual output for your InfiniBand fabric will differ from that in the example.
15.1.8 Display Low-Level Detailed Information for a Node
If intensive troubleshooting is necessary to resolve a problem, you can use the smpquery
command to display very detailed information about a node in the fabric.
On the command-line interface (CLI), enter the following command:
# smpquery switchinfo lid
where lid
is the LID of the node in the fabric.
For example:
# smpquery switchinfo 15
#
# Switch info: Lid 15
LinearFdbCap:....................49152
RandomFdbCap:....................0
McastFdbCap:.....................4096
LinearFdbTop:....................16
DefPort:.........................0
DefMcastPrimPort:................255
DefMcastNotPrimPort:.............255
LifeTime:........................18
StateChange:.....................0
LidsPerPort:.....................0
PartEnforceCap:..................32
InboundPartEnf:..................1
OutboundPartEnf:.................1
FilterRawInbound:................1
FilterRawOutbound:...............1
EnhancedPort0:...................1
#
# smpquery portinfo lid port
Note:
The actual output for your InfiniBand fabric will differ from that in the example.
15.1.9 Display Low-Level Detailed Information for a Port
If intensive troubleshooting is necessary to resolve a problem, you can use the smpquery
command to display very detailed information about a port.
On the command-line interface (CLI), enter the following command:
# smpquery portinfo lid port
where lid
is the LID of the node in the fabric.
For example:
# smpquery portinfo 15 23
#
Mkey:............................0x0000000000000000
GidPrefix:.......................0x0000000000000000
Lid:.............................0x0000
SMLid:...........................0x0000
CapMask:.........................0x0
DiagCode:........................0x0000
MkeyLeasePeriod:.................0
LocalPort:.......................0
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkDownDefState:................Polling
ProtectBits:.....................0
LMC:.............................0
.
.
.
SubnetTimeout:...................0
RespTimeVal:.....................0
LocalPhysErr:....................8
OverrunErr:......................8
MaxCreditHint:...................85
RoundTrip:.......................16777215
#
Note:
The actual output for your InfiniBand fabric will differ from that in the example, and it is just a portion of the full output.
15.1.10 Map LIDs to GUIDs
In the InfiniBand fabric in Exalogic machines, as a Subnet Manager and Subnet administrator, you may want to assign subnet-specific LIDs to nodes in the fabric. Often in the use of the InfiniBand commands, you must provide an LID to issue a command to a particular InfiniBand device.
Alternatively, the output of a command might identify InfiniBand devices by their LID. You can create a file that is a mapping of node LIDs to node GUIDs, which can help with administrating your InfiniBand fabric.
Note:
Creation of the mapping file is not a requirement for InfiniBand administration.
The following procedure creates a file that lists the LID in hexadecimal, the GUID in hexadecimal, and the node description:
- Create an inventory file:
# osmtest -f c -i inventory.txt
The
inventory.txt
file can be used for other purposes too, besides this procedure. - Create a mapping file:
# cat inventory.txt |grep -e '^lid' -e 'port_guid' -e 'desc' |sed 's/^lid/\nlid/'> mapping.txt
- Edit the latter half of the
mapping.txt
file to remove the nonessential information. The content of themapping.txt
file looks similar to the following:lid 0x14 port_guid 0x0021283a8620b0a0 # node_desc Sun DCS 72 QDR switch 1.2(LC) lid 0x15 port_guid 0x0021283a8620b0b0 # node_desc Sun DCS 72 QDR switch 1.2(LC) lid 0x16 port_guid 0x0021283a8620b0c0 # node_desc Sun DCS 72 QDR switch 1.2(LC)
Note:
The output in the example is just a portion of the entire file.
15.1.11 Perform Comprehensive Diagnostics for the Entire Fabric
If you require a full testing of your InfiniBand fabric, you can use the ibdiagnet
command to perform many tests with verbose results. The command is a useful tool to determine the general overall health of the InfiniBand fabric.
On the command-line interface (CLI), run the following command:
# ibdiagnet -v -r
The ibdiagnet.log
file contains the log of the testing.
15.1.12 Perform Comprehensive Diagnostics for a Route
You can use the ibdiagpath
command to perform some of the same comprehensive tests for a particular route.
On the command-line interface (CLI), run the following command:
# ibdiagpath -v -l slid dlid
where slid
is the LID of the source node in the fabric, and dlid
is the LID of the destination node.
The ibdiagpath.log
file contains the log of the testing.
15.1.13 Determine Changes to the InfiniBand Topology
If your fabric has a number of nodes that are suspect, the osmtest
command enables you to take a snapshot (inventory file) of your fabric and at a later time compare that file to the present conditions.
Note:
Although this procedure is most useful after initializing the Subnet Manager, it can be performed at any time.
Complete the following steps:
- Ensure that Subnet Manager is initiated.
- On the command-line interface (CLI), run the following command to take a snapshot of the topology:
# osmtest -f c
For example:
# osmtest -f c Command Line Arguments Done with args Flow = Create Inventory Aug 13 19:44:53 601222 [B7D466C0] 0x7f -> Setting log level to: 0x03 Aug 13 19:44:53 601969 [B7D466C0] 0x02 -> osm_vendor_init: 1000 pending umadsspecified using default guid 0x21283a8620b0f0 Aug 13 19:44:53 612312 [B7D466C0] 0x02 -> osm_vendor_bind: Binding to port0x21283a8620b0f0 Aug 13 19:44:53 636876 [B7D466C0] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- OSMTEST: TEST "Create Inventory" PASS #
- After an event, compare the present topology to that saved in the inventory file, as in the following example:
# osmtest -f v Command Line Arguments Done with args Flow = Validate Inventory Aug 13 19:45:02 342143 [B7EF96C0] 0x7f -> Setting log level to: 0x03 Aug 13 19:45:02 342857 [B7EF96C0] 0x02 -> osm_vendor_init: 1000 pending umadsspecified using default guid 0x21283a8620b0f0 Aug 13 19:45:02 351555 [B7EF96C0] 0x02 -> osm_vendor_bind: Binding to port0x21283a8620b0f0 Aug 13 19:45:02 375997 [B7EF96C0] 0x02 -> osmtest_validate_sa_class_port_info: ----------------------------- SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x2602 cap_mask2:0x0 resp_time_val:0x10 ----------------------------- Aug 13 19:45:02 378991 [B7EF96C0] 0x01 -> osmtest_validate_node_data: Checkingnode 0x0021283a8620b0a0, LID 0x14 Aug 13 19:45:02 379172 [B7EF96C0] 0x01 -> osmtest_validate_node_data: Checkingnode 0x0021283a8620b0b0, LID 0x15 . . . Aug 13 19:45:02 480201 [B7EF96C0] 0x01 ->osmtest_validate_single_path_rec_guid_pair: Checking src 0x0021283a8620b0f0 to dest 0x0021283a8620b0f0 Aug 13 19:45:02 480588 [B7EF96C0] 0x01 -> osmtest_validate_path_data: Checkingpath SLID 0x19 to DLID 0x19 Aug 13 19:45:02 480989 [B7EF96C0] 0x02 -> osmtest_run: ***************** ALL TESTS PASS ***************** OSMTEST: TEST "Validate Inventory" PASS #
Note:
Depending on the size of your InfiniBand fabric, the output from the
osmtest
command could be tens of thousands of lines long.
15.1.14 Determine Which Links Are Experiencing Significant Errors
You can use the ibdiagnet
command to determine which links are experiencing symbol errors and recovery errors by injecting packets.
On the command-line interface (CLI), run the following command:
# ibdiagnet -c 100 -P all=1
In this instance of the ibdiagnet
command, 100 test packets are injected into each link and the -P all=1
option returns all counters that increment during the test.
In the output of the ibdiagnet
command, search for the symbol_error_counter
string. That line contains the symbol error count in hexadecimal. The preceding lines identify the node and port with the errors. Symbol errors are minor errors, and if there are relatively few during the diagnostic, they can be monitored.
Note:
According to the InfiniBand specification 10E-12 BER, the maximum allowable symbol error rate is 120 errors per hour.
In addition, in the output of the ibdiagnet
command, search for the link_error_recovery_counter
string.
That line contains the recovery error count in hexadecimal. The preceding lines identify the node and port with the errors. Recovery errors are major errors and the respective links must be investigated for the cause of the rapid symbol error propagation.
Additionally, the ibdiagnet.log
file contains the log of the testing.
15.1.15 Check All Ports
To perform a quick check of all ports of all nodes in your InfiniBand fabric, you can use the ibcheckstate
command.
On the command-line interface (CLI), run the following command:
# ibcheckstate -v
The output is displayed, as in the following example:
# Checking Switch: nodeguid 0x0021283a8389a0a0
Node check lid 15: OK
Port check lid 15 port 23: OK
Port check lid 15 port 19: OK
.
.
.
# Checking Ca: nodeguid 0x0003ba000100e388
Node check lid 14: OK
Port check lid 14 port 2: OK
## Summary: 5 nodes checked, 0 bad nodes found
## 10 ports checked, 0 ports with bad state found
#
©版权声明:本文内容由互联网用户自发贡献,版权归原创作者所有,本站不拥有所有权,也不承担相关法律责任。如果您发现本站中有涉嫌抄袭的内容,欢迎发送邮件至: [email protected] 进行举报,并提供相关证据,一经查实,本站将立刻删除涉嫌侵权内容。
转载请注明出处: apollocode » infiniband网络监控
文件列表(部分)
名称 | 大小 | 修改日期 |
---|---|---|
infiniband监控网络.doc | 13.97 KB | 2020-05-16 |
发表评论 取消回复