====== Hardware Log ====== ===== 29 Sep 2015 ===== * K80 Failure, Vesta1 volker@vesta1:~$ nvidia-smi -L GPU 0: Tesla K80 (UUID: GPU-a4c1ed90-a3e2-22be-1739-bf836ea157fc) GPU 1: Tesla K80 (UUID: GPU-f090063b-9435-87a5-0ba5-34975b5fc981) GPU 2: Tesla K80 (UUID: GPU-f76f6bfa-a99f-5196-7624-24b8a0e84269) GPU 3: Tesla K80 (UUID: GPU-dc4b8a71-f635-ea72-a054-973021340366) GPU 4: Tesla K80 (UUID: GPU-eb698ccb-c615-ab6c-dfc6-74561a6098ea) GPU 5: Tesla K80 (UUID: GPU-c2ecd9ae-8fed-5172-e7e3-6dc6c770e2a4) Unable to determine the product name for gpu 0000:14:00.0: GPU is lost Unable to determine the product name for gpu 0000:15:00.0: GPU is lost GPU 8: Tesla K80 (UUID: GPU-63eb6611-c38f-3146-07e4-17b0dd45beec) GPU 9: Tesla K80 (UUID: GPU-3acb2e5f-5862-e59e-237a-24affdbb65dd) GPU 10: Tesla K80 (UUID: GPU-c6ea8d80-24bc-6a47-91ac-616db5b680eb) GPU 11: Tesla K80 (UUID: GPU-27266a29-25e5-c902-4f6b-e7018ee2d145) GPU 12: Tesla K80 (UUID: GPU-3c7c64c7-9ea9-809a-92e9-f78b30174375) GPU 13: Tesla K80 (UUID: GPU-bff19037-7ff9-8e92-41ee-4a1b72196287) GPU 14: Tesla K80 (UUID: GPU-3bfef1d4-9262-f902-84fa-809fde093c1a) GPU 15: Tesla K80 (UUID: GPU-d6784c9c-c990-1d47-da07-30a341630bb6) * K80 Failure, Vesta2 volker@vesta2:~$ nvidia-smi -L GPU 0: Tesla K80 (UUID: GPU-0c5ab30c-edd7-5783-863f-1e7456dfe380) GPU 1: Tesla K80 (UUID: GPU-850dfaee-cd37-0da0-8a0e-89fd2dc4195c) GPU 2: Tesla K80 (UUID: GPU-53da1a60-cbd9-aaae-7b7a-0486f7a68956) GPU 3: Tesla K80 (UUID: GPU-1b2c7b2e-b895-a4eb-bac7-ae5887eec894) GPU 4: Tesla K80 (UUID: GPU-2195f8d4-a646-3ae5-9867-bcbca2cd41b0) GPU 5: Tesla K80 (UUID: GPU-c0a7616e-6af3-da34-47cb-4ae0a32b3fac) GPU 6: Tesla K80 (UUID: GPU-aaf87699-406d-a1f0-0d59-d73efb6b906c) GPU 7: Tesla K80 (UUID: GPU-986eef1c-8c4f-45ac-42af-42bf5792dd92) Unable to determine the product name for gpu 0000:86:00.0: GPU is lost Unable to determine the product name for gpu 0000:87:00.0: GPU is lost GPU 10: Tesla K80 (UUID: GPU-36ef7210-3faa-a757-9e53-3cdc2b973aeb) GPU 11: Tesla K80 (UUID: GPU-10756d7f-f8d9-3711-a8b9-fd3d30db5e88) GPU 12: Tesla K80 (UUID: GPU-12ff95e8-3277-e9d1-d87e-02b83bd55431) GPU 13: Tesla K80 (UUID: GPU-4457aa27-9352-3308-199c-86bc3fbe31de) GPU 14: Tesla K80 (UUID: GPU-143afd37-0bbd-ba30-8e5f-59eb0a169eda) GPU 15: Tesla K80 (UUID: GPU-c7fdb877-88ab-3b85-4d4f-ed80cf5cdd7b) * Newly Retired Pages (WTF?) volker@vesta2:$ nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv gpu_uuid, retired_pages.address, retired_pages.cause GPU-4457aa27-9352-3308-199c-86bc3fbe31de, 0x0000000000004fc2, Double Bit ECC GPU-4457aa27-9352-3308-199c-86bc3fbe31de, 0x0000000000003d62, Double Bit ECC GPU-4457aa27-9352-3308-199c-86bc3fbe31de, 0x0000000000006a56, Double Bit ECC ===== 24 Sep 2015 ===== * Retired Pages volker@vesta1:~$ nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv gpu_uuid, retired_pages.address, retired_pages.cause GPU-3acb2e5f-5862-e59e-237a-24affdbb65dd, 0x00000000000065a1, Double Bit ECC GPU-3acb2e5f-5862-e59e-237a-24affdbb65dd, 0x00000000000065e2, Double Bit ECC volker@vesta2:~$ nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv gpu_uuid, retired_pages.address, retired_pages.cause GPU-4457aa27-9352-3308-199c-86bc3fbe31de, 0x0000000000004fc2, Double Bit ECC GPU-4457aa27-9352-3308-199c-86bc3fbe31de, 0x0000000000003d62, Double Bit ECC