Line 2: Line 2:
 ===== 29 Sep 2015 ===== ===== 29 Sep 2015 =====
 +  * K80 Failure, Vesta1
 +volker@vesta1:​~$ nvidia-smi -L
 +GPU 0: Tesla K80 (UUID: GPU-a4c1ed90-a3e2-22be-1739-bf836ea157fc)
 +GPU 1: Tesla K80 (UUID: GPU-f090063b-9435-87a5-0ba5-34975b5fc981)
 +GPU 2: Tesla K80 (UUID: GPU-f76f6bfa-a99f-5196-7624-24b8a0e84269)
 +GPU 3: Tesla K80 (UUID: GPU-dc4b8a71-f635-ea72-a054-973021340366)
 +GPU 4: Tesla K80 (UUID: GPU-eb698ccb-c615-ab6c-dfc6-74561a6098ea)
 +GPU 5: Tesla K80 (UUID: GPU-c2ecd9ae-8fed-5172-e7e3-6dc6c770e2a4)
 +Unable to determine the product name for gpu 0000:​14:​00.0:​ GPU is lost
 +Unable to determine the product name for gpu 0000:​15:​00.0:​ GPU is lost
 +GPU 8: Tesla K80 (UUID: GPU-63eb6611-c38f-3146-07e4-17b0dd45beec)
 +GPU 9: Tesla K80 (UUID: GPU-3acb2e5f-5862-e59e-237a-24affdbb65dd)
 +GPU 10: Tesla K80 (UUID: GPU-c6ea8d80-24bc-6a47-91ac-616db5b680eb)
 +GPU 11: Tesla K80 (UUID: GPU-27266a29-25e5-c902-4f6b-e7018ee2d145)
 +GPU 12: Tesla K80 (UUID: GPU-3c7c64c7-9ea9-809a-92e9-f78b30174375)
 +GPU 13: Tesla K80 (UUID: GPU-bff19037-7ff9-8e92-41ee-4a1b72196287)
 +GPU 14: Tesla K80 (UUID: GPU-3bfef1d4-9262-f902-84fa-809fde093c1a)
 +GPU 15: Tesla K80 (UUID: GPU-d6784c9c-c990-1d47-da07-30a341630bb6)
   * K80 Failure, Vesta2   * K80 Failure, Vesta2
Line 24: Line 46:
 GPU 15: Tesla K80 (UUID: GPU-c7fdb877-88ab-3b85-4d4f-ed80cf5cdd7b) GPU 15: Tesla K80 (UUID: GPU-c7fdb877-88ab-3b85-4d4f-ed80cf5cdd7b)
 </​code>​ </​code>​
 +  * Newly Retired Pages (WTF?)
 +volker@vesta2:​$ nvidia-smi --query-retired-pages=gpu_uuid,​retired_pages.address,​retired_pages.cause --format=csv
 +gpu_uuid, retired_pages.address,​ retired_pages.cause
 +GPU-4457aa27-9352-3308-199c-86bc3fbe31de,​ 0x0000000000004fc2,​ Double Bit ECC
 +GPU-4457aa27-9352-3308-199c-86bc3fbe31de,​ 0x0000000000003d62,​ Double Bit ECC
 +GPU-4457aa27-9352-3308-199c-86bc3fbe31de,​ 0x0000000000006a56,​ Double Bit ECC
 +===== 24 Sep 2015 =====
   * Retired Pages   * Retired Pages
 +volker@vesta1:​~$ nvidia-smi --query-retired-pages=gpu_uuid,​retired_pages.address,​retired_pages.cause --format=csv
 +gpu_uuid, retired_pages.address,​ retired_pages.cause
 +GPU-3acb2e5f-5862-e59e-237a-24affdbb65dd,​ 0x00000000000065a1,​ Double Bit ECC
 +GPU-3acb2e5f-5862-e59e-237a-24affdbb65dd,​ 0x00000000000065e2,​ Double Bit ECC
 <​code>​ <​code>​
Line 33: Line 73:
 GPU-4457aa27-9352-3308-199c-86bc3fbe31de,​ 0x0000000000003d62,​ Double Bit ECC GPU-4457aa27-9352-3308-199c-86bc3fbe31de,​ 0x0000000000003d62,​ Double Bit ECC
 </​code>​ </​code>​
