Summary
- I have 3 cards of this exact type altogether, very high spec Gigabyte Windforce Extreme GTX 980ti Card B (power rail issues) and Gigabyte Windforce Extreme GTX 980ti Card C (memory issue)
- Resistances
- Vcore – 2.4
- Vmem – 61
- PEX – 306
- Artefacts, BIOS screen messed up.
- Failed mats test, odd pattern. Assertion errors in mods. Hopefully it’s memory and not the memory controller or GPU core.
- I have some of the required memory chips. All I can think of is to attempt a replacement of VRAM chip F1, but it could be more complicated than that, as I have never seen MATS/MODS look quite like that…
mats version 367.38. Testing GM200 with 100 MB of memory starting with 0 MB.
Errors found. Use -matsinfo for details.
This message will only appear once.
SUBPART RANK0 RD ERR RANK0 WR ERR UNKNOWN ERR
------------- ------------- ------------- ------------
FBIOA[ 31: 0] 0 0 0
FBIOA[ 63: 32] 0 0 0
FBIOB[ 31: 0] 0 0 0
FBIOB[ 63: 32] 0 0 0
FBIOC[ 31: 0] 0 0 0
FBIOC[ 63: 32] 0 0 0
FBIOD[ 31: 0] 0 0 0
FBIOD[ 63: 32] 0 0 0
FBIOE[ 31: 0] 0 0 0
FBIOE[ 63: 32] 0 0 0
FBIOF[ 31: 0] 0 2729656 0
FBIOF[ 63: 32] 0 0 0
Rank 0 Failing bits:
F008 F009 F010 F011 F012 F013 F014 F015
Read Error Count: 0
Write Error Count: 2729656
Unknown Error Count: 0
BIT RANK0 WRITE RANK0 READ UNKNOWN
--- ----------- ---------- -------
F008 135886 0 0
F008 135886 0 0
F008 135886 0 0
F008 135886 0 0
F009 273024 0 0
F009 273664 0 0
F009 273024 0 0
F009 273664 0 0
F009 408590 0 0
F009 408910 0 0
F009 408590 0 0
F009 408910 0 0
F010 135886 0 0
F010 135886 0 0
F010 135886 0 0
F010 135886 0 0
F011 271162 0 0
F011 271162 0 0
F011 271162 0 0
F011 271162 0 0
F012 135886 0 0
F012 135886 0 0
F012 135886 0 0
F012 135886 0 0
F013 271162 0 0
F013 271162 0 0
F013 271162 0 0
F013 271162 0 0
F014 135886 0 0
F014 135886 0 0
F014 135886 0 0
F014 135886 0 0
F015 271162 0 0
F015 271162 0 0
F015 271162 0 0
F015 271162 0 0
ADDRESS EXPECTED ACTUAL REREAD1 REREAD2 FAILBITS TPSEIB ROW COL
---------- -------- -------- -------- -------- -------- ------ ---- ---
0002fdc45c 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a7 007f 117
0002fdc458 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a6 007f 116
0002fdc454 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a5 007f 115
0002fdc450 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a4 007f 114
0002fd845c 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a7 007f 097
.......
000464bc50 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF0094 00bb 094
000581f61c 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF0077 00ea 1c7
000581f618 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF0076 00ea 1c6
000581f614 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF0075 00ea 1c5
000581f610 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF0074 00ea 1c4
if you are getting failure for first MBof FB then try option -no_scan_out
Error Code = 00000001
####### #### ######## ###
####### ###### ######## ###
## ## ## ## ###
## ## ## ## ###
####### ######## ## ###
####### ######## ## ###
## ## ## ## ###
## ## ## ######## ########
## ## ## ######## ########
Here is a small sample of the MODS failure
......
------------------------- BEGIN ASSERT INFO DUMP -------------------------
uteTime: 0x0003d0900
NVRM: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
NVRM: flcnQueueCmdWrite_IMPL: error while opening queue (id=0x1, status=0x1a).
NVRM: pmuCommandPostBlocking_IMPL: Failed to post command to PMU.
NVRM: thermPmuRpcExecute_IMPL: Failed to execute RPC command (status=0x0000001a, function=0x0000000a)
NVRM: bp @ ../../../../resman/kernel/thermctl/nv/thrmpmu.c:292
** ModsDrvBreakPoint **
NVRM: _thermPmuThermPolicyLoad: Error while executing THERM_POLICY_LOAD RPC (status = 0x0000ffff).
NVRM: bp @ ../../../../resman/kernel/thermctl/nv/thrmpmu.c:2195
** ModsDrvBreakPoint **
NVRM: threadId: 0x0000003e8 irql: 0x000000000 flags: 0x020
NVRM: enterTime: 0x16dfb3c0e6d0ef20 Limits: nonComputeTime: 0x000000000 computeTime: 0x0003d0900
NVRM: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
NVRM: _pmgrPmuSendUnloadCmdToPmu: Failed waiting for UNLOAD command to complete (status=101).
NVRM: bp @ ../../../../resman/kernel/pmgr/nv/pmgrpmu.c:2097
** ModsDrvBreakPoint **
NVRM: pmgrPmuUnloadHelper_IMPL: Failed to send Power Policy Unload cmd to PMU - status=0x00000065.
NVRM: bp @ ../../../../resman/kernel/pmgr/nv/pmgrpmu.c:1536
** ModsDrvBreakPoint **
NVRM: _pmgrPmuPrereqCallbackInit: Failed to UNLOAD PMGR task - status=0x00000065.
NVRM: threadId: 0x0000003e8 irql: 0x000000000 flags: 0x020
NVRM: enterTime: 0x16dfb3c0e6d0ef20 Limits: nonComputeTime: 0x000000000 computeTime: 0x0003d0900
NVRM: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
NVRM: flcnQueueCmdWrite_IMPL: error while opening queue (id=0x0, status=0x1a).
NVRM: pmuDetach_IMPL - Failed to detach PMU. Resetting it normally.
NVRM: pmuDetach_IMPL failed. Resetting...
NVRM: threadId: 0x0000003e8 irql: 0x000000000 flags: 0x020
NVRM: enterTime: 0x16dfb3c0e6d0ef20 Limits: nonComputeTime: 0x000000000 computeTime: 0x0003d0900
NVRM: _threadNodeCheckTimeout: Timeout was set to: 4000 msecs!
NVRM: pmuStateUnload_IMPL: error detaching PMU (status=0x1a)
NVRM: bp @ ../../../../resman/kernel/pmu/nv/objpmu.c:2806
** ModsDrvBreakPoint **
-------------------------- END ASSERT INFO DUMP --------------------------
Error 000000000818 : Gpu.ShutDown Mods detected an assertion failure
Error Code = 000000000690 (NVRM FB Training Failed)
####### #### ######## ###
####### ###### ######## ###
## ## ## ## ###
## ## ## ## ###
####### ######## ## ###
####### ######## ## ###
## ## ## ## ###
## ## ## ######## ########
## ## ## ######## ########
MODS end : Fri Mar 25 18:40:38 2022 [74.726 seconds (00:01:14.726 h:m:s)]
Replacing VRAM F1
- Had a very tough time getting the chip off! Even with a preheater, it took sveral attempts up to an eventual 450 deg C on the hot air. It could be the large board.
- I am not feeling very lucky with this replacement. Hopefully I haven’t caused other damage in the process.
Hmm… well in some ways I’ve made it worse, as the other F0 is now showing errors, but I am really starting to think there is another issue
mats version 367.38. Testing GM200 with 10 MB of memory starting with 0 MB.
Errors found. Use -matsinfo for details.
This message will only appear once.
SUBPART RANK0 RD ERR RANK0 WR ERR UNKNOWN ERR
------------- ------------- ------------- ------------
FBIOA[ 31: 0] 0 0 0
FBIOA[ 63: 32] 0 0 0
FBIOB[ 31: 0] 0 0 0
FBIOB[ 63: 32] 0 0 0
FBIOC[ 31: 0] 0 0 0
FBIOC[ 63: 32] 0 0 0
FBIOD[ 31: 0] 0 0 0
FBIOD[ 63: 32] 0 0 0
FBIOE[ 31: 0] 0 0 0
FBIOE[ 63: 32] 0 0 0
FBIOF[ 31: 0] 0 272400 0
FBIOF[ 63: 32] 0 441696 0
Rank 0 Failing bits:
F008 F009 F010 F011 F012 F013 F014 F015 F032 F033 F034 F035 F036 F037 F038 F039
F040 F041 F042 F043 F044 F045 F046 F047 F048 F049 F050 F051 F052 F053 F054 F055
F056 F057 F058 F059 F060 F061 F062 F063
Read Error Count: 0
Write Error Count: 714096
Unknown Error Count: 0
BIT RANK0 WRITE RANK0 READ UNKNOWN
--- ----------- ---------- -------
F008 13059 0 0
F008 13059 0 0
F008 13059 0 0
F008 13059 0 0
F009 27786 0 0
F009 27418 0 0
F009 27786 0 0
........
F062 27948 0 0
F062 27948 0 0
F062 27948 0 0
F062 27948 0 0
F063 54528 0 0
F063 54528 0 0
F063 54528 0 0
F063 54528 0 0
F063 54528 0 0
F063 54528 0 0
F063 54528 0 0
F063 54528 0 0
ADDRESS EXPECTED ACTUAL REREAD1 REREAD2 FAILBITS TPSEIB ROW COL
---------- -------- -------- -------- -------- -------- ------ ---- ---
00000adb3c 00000000 ffffffff ffffffff ffffffff ffffffff WF1077 0001 14f
00000adb38 00000000 ffffffff ffffffff ffffffff ffffffff WF1076 0001 14e
00000adb34 00000000 ffffffff ffffffff ffffffff ffffffff WF1075 0001 14d
00000adb30 00000000 ffffffff ffffffff ffffffff ffffffff WF1074 0001 14c
00000adb2c 00000000 ffffffff ffffffff ffffffff ffffffff WF1073 0001 14b
00000adb28 00000000 ffffffff ffffffff ffffffff ffffffff WF1072 0001 14a
00000adb24 00000000 ffffffff ffffffff ffffffff ffffffff WF1071 0001 149
00000adb20 00000000 ffffffff ffffffff ffffffff ffffffff WF1070 0001 148
00000adb1c 00000000 ffffffff ffffffff ffffffff ffffffff WF1077 0001 147
00000adb18 00000000 ffffffff ffffffff ffffffff ffffffff WF1076 0001 146
00000adb14 00000000 ffffffff ffffffff ffffffff ffffffff WF1075 0001 145
00000adb10 00000000 ffffffff ffffffff ffffffff ffffffff WF1074 0001 144
00000adb0c 00000000 ffffffff ffffffff ffffffff ffffffff WF1073 0001 143
00000adb08 00000000 ffffffff ffffffff ffffffff ffffffff WF1072 0001 142
00000adb04 00000000 ffffffff ffffffff ffffffff ffffffff WF1071 0001 141
00000adb00 00000000 ffffffff ffffffff ffffffff ffffffff WF1070 0001 140
00000a9b3c 00000000 ffffffff ffffffff ffffffff ffffffff WF1077 0001 0cf
00000a9b38 00000000 ffffffff ffffffff ffffffff ffffffff WF1076 0001 0ce
00000a9b34 00000000 ffffffff ffffffff ffffffff ffffffff WF1075 0001 0cd
00000a9b30 00000000 ffffffff ffffffff ffffffff ffffffff WF1074 0001 0cc
00000a9b2c 00000000 ffffffff ffffffff ffffffff ffffffff WF1073 0001 0cb
00000a9b28 00000000 ffffffff ffffffff ffffffff ffffffff WF1072 0001 0ca
00000a9b24 00000000 ffffffff ffffffff ffffffff ffffffff WF1071 0001 0c9
00000a9b20 00000000 ffffffff ffffffff ffffffff ffffffff WF1070 0001 0c8
.............
00002e0468 00000000 ffffffff ffffffff ffffffff ffffffff WF1012 0007 01a
00002e0464 00000000 ffffffff ffffffff ffffffff ffffffff WF1011 0007 019
00002e0460 00000000 ffffffff ffffffff ffffffff ffffffff WF1010 0007 018
00002e045c 00000000 ffffffff ffffffff ffffffff ffffffff WF1017 0007 017
00002e0458 00000000 ffffffff ffffffff ffffffff ffffffff WF1016 0007 016
00002e0454 00000000 ffffffff ffffffff ffffffff ffffffff WF1015 0007 015
00002e0450 00000000 ffffffff ffffffff ffffffff ffffffff WF1014 0007 014
00002e044c 00000000 ffffffff ffffffff ffffffff ffffffff WF1013 0007 013
00002e0448 00000000 ffffffff ffffffff ffffffff ffffffff WF1012 0007 012
00002e0444 00000000 ffffffff ffffffff ffffffff ffffffff WF1011 0007 011
00002e0440 00000000 ffffffff ffffffff ffffffff ffffffff WF1010 0007 010
000039e41c 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a7 0009 187
000039e418 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a6 0009 186
000039e414 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a5 0009 185
000039e410 00000000 0000ff00 0000ff00 0000ff00 0000ff00 WF00a4 0009 184
if you are getting failure for first MBof FB then try option -no_scan_out
Error Code = 00000001
####### #### ######## ###
####### ###### ######## ###
## ## ## ## ###
## ## ## ## ###
####### ######## ## ###
####### ######## ## ###
## ## ## ## ###
## ## ## ######## ########
## ## ## ######## ########