Seller Note “PLEASE NOTE: THE CARD IS STRICTLY FOR PARTS ONLY AS THE FANS SPIN BUT NO DISPLAY. DON’T KNOW THE HISTORY OF THE CARD, BOUGHT IT IN A JOBLOT.”
Summary
- The resistances and voltages seem ok.
- Power up test exhibits symptoms in keeping with a failed memory chip “Blank screen, back light comes on after a few minutes.”
- Mats test indeed indicates a failing chip or partially failing chip.
Mats Report
=== MEMORY ERRORS BY SUBPARTITION ===
SUBPART READ ERRORS WRITE ERRORS UNKNOWN ERRS
FBIOA0 0 0 0
FBIOA1 0 0 0
FBIOB0 0 0 0
FBIOB1 0 11112 0
FBIOC0 0 0 0
FBIOC1 0 0 0
Failing Bits:
B032 B033 B034 B035 B036 B037 B038 B039
Faulty chip is shown here:
This memory chip is K4G80325FB-HC25 – (labelled M6)
Update 20/12/2021 – Reflow improvements and new issues!
After reflowing the faulty chip, I had thought that I had grabbed an unlikely victory, as the card not only had a picture, but it seemed clear and the drivers loaded. However, after running Furmark for a few minutes, my joy was short-lived as the display broke up and it crashed! Now, after reboot, it still gets into windows (an improvement), but there are mild artefacts and it has gone code 43 in device manager. I will retest with Mats to see if there has been any change. hopefully, the faulty Vram isn’t hiding a worse problem like a faulty memory controller.
Update (Later that day) – Seemingly Code 43 gone, also runs fine in Kombuster with detuned gpu, mem and increased fan!
This has been an interesting development, it could be I tested too soon after the reflow, but the code 43 and mild artefacting seems to have gone and windows loads drivers. Also, the card passes Mats memory testing. Not only that but if CPU and mem are underclocked and the fan increased to 70%, stress testing in afterburner seems stable or at least can be. The thermals seem bad to me if the, with quick ramp up. This may well correlate with likely memory and possibly also GPU instability under load. Likely that reflowed memory chip ‘may appear to work’, but cannot take the stress. Possibly there is also a secondary issue, like with the GPU as it heats up. As things stand the card seems fine for desktop usage, which is at least interesting in that it allows further testing.
Next Steps
- Recheck resistances, especially memory.
- Consider replacing the reflowed memory chip (possibly there are also other faulty chips)
Update 05/02/2022 – Replaced faulty vram chip
Replaced the faulty memory chip. However, testing was only partially successful. Initially, the card showed a bios screen. So, shut down and replaced the cooler. Upon starting, only a blank screen and backlight. Mats testing revealed the same memory chip had an issue.
Mats report:
mats version 400.184. Testing GP106 with 20 MB of memory starting with 0 MB.
=== MEMORY ERRORS BY SUBPARTITION ===
SUBPART READ ERRORS WRITE ERRORS UNKNOWN ERRS
Read Error Count: 0
Write Error Count: 11112
Unknown Error Count: 0
FBIOA0 0 0 0
FBIOA1 0 0 0
FBIOB0 0 0 0
FBIOB1 0 11112 0
FBIOC0 0 0 0
FBIOC1 0 0 0
Failing Bits:
B032 B033 B034 B035 B036 B037 B038 B039
=== MEMORY ERRORS BY BIT ===
P : Partition (FBIO)
READ 0 READ 1 READ ?
P BIT READ ERRORS WRITE ERRORS UNKNOWN ERRS EXP. 1 EXP. 0 EXP. ?
B 032 0 80 0 40 40 0
B 033 0 120 0 56 64 0
B 034 0 80 0 40 40 0
B 035 0 120 0 56 64 0
B 036 0 80 0 40 40 0
B 037 0 120 0 56 64 0
B 038 0 11048 0 16 11032 0
B 039 0 120 0 56 64 0
=== MEMORY ERRORS BY ADDRESS ===
ADDRESS : Failing memory address, or buffer offset if starting with 'X+'
T : Type of memory error: W = write, R = read
P : Partition (FBIO)
S : Subpartition
B : Bank
E : Beat
Update 06/02/2022 – Reflowed faulty chip
Based on the theory that I hadn’t replaced the vram chip using sufficient heat, I decided to experiment to see if a reflow might help. This at first yielded very positive results. Not only was the display back, but:
- The mats test passed.
- Kombuster passed a HD benchmark.
- However, Heaven started to stutter, but not crash and so did subnautica, although the whole PC remained stable.
- I repeated the mats test and attempted ./mods gputest.js -mfg -dramclk_percent 100 -test 118 -matsinfo -fanspeed 75 – failed with :
Failure(s) :
LOOP TEST CODE MESSAGE
---- ------------------------ ------------ ---------------------------
1 SetPState 000000000143 PCI Express bus error
Error Code = 000000000143 (PCI Express bus error)
(there is a PCI fault with my test motherboard / CPU, which explains this error)
I decided to take the card apart and replace the old thermal paste ahead of more testing. Unfortunately, this perhaps wrenched the card a little bit and now it is back to a blank screen and backlight and mats failure again.
Update 16/04/2022
Retesting mats. Quite a strange failure pattern, I wouldn’t be surprised if there was an underlying secondary fault with this card:
mats version 400.184. Testing GP106 with 20 MB of memory starting with 0 MB.
Read Error Count: 0
Write Error Count: 11112
Unknown Error Count: 0
=== MEMORY ERRORS BY SUBPARTITION ===
SUBPART READ ERRORS WRITE ERRORS UNKNOWN ERRS
------- ----------- ------------ ------------
FBIOA0 0 0 0
FBIOA1 0 0 0
FBIOB0 0 0 0
FBIOB1 0 11112 0
FBIOC0 0 0 0
FBIOC1 0 0 0
Failing Bits:
B032 B033 B034 B035 B036 B037 B038 B039
=== MEMORY ERRORS BY BIT ===
P : Partition (FBIO)
READ 0 READ 1 READ ?
P BIT READ ERRORS WRITE ERRORS UNKNOWN ERRS EXP. 1 EXP. 0 EXP. ?
- --- ----------- ------------ ------------ ------ ------ ------
B 032 0 80 0 40 40 0
B 033 0 120 0 56 64 0
B 034 0 80 0 40 40 0
B 035 0 120 0 56 64 0
B 036 0 80 0 40 40 0
B 037 0 120 0 56 64 0
B 038 0 11048 0 16 11032 0
B 039 0 120 0 56 64 0
=== MEMORY ERRORS BY ADDRESS ===
ADDRESS : Failing memory address, or buffer offset if starting with 'X+'
T : Type of memory error: W = write, R = read
P : Partition (FBIO)
S : Subpartition
B : Bank
E : Beat
ADDRESS EXPECTED ACTUAL REREAD1 REREAD2 FAILBITS TPSBE ROW COL BIT(s)
------- -------- ------ ------- ------- -------- ----- --- --- ------
0000fbff5c 00000000 000000ff 000000ff 000000ff 000000ff WB147 0029 07a B032,B033,B034,B035,B036,B037,B038,B039
0000fbff58 00000000 000000ff 000000ff 000000ff 000000ff WB146 0029 07a B032,B033,B034,B035,B036,B037,B038,B039
0000fbff54 00000000 000000ff 000000ff 000000ff 000000ff WB145 0029 07a B032,B033,B034,B035,B036,B037,B038,B039
...
0000f416d8 55555555 555555ff 555555ff 555555ff 000000aa WB1f6 0028 05e B033,B035,B037,B039
0000f416d4 55555555 555555ff 555555ff 555555ff 000000aa WB1f5 0028 05e B033,B035,B037,B039
0000f416d0 55555555 555555ff 555555ff 555555ff 000000aa WB1f4 0028 05e B033,B035,B037,B039
00011ac5dc 55555555 555555ff 555555ff 555555ff 000000aa WB177 002f 026 B033,B035,B037,B039
00011ac5d8 55555555 555555ff 555555ff 555555ff 000000aa WB176 002f 026 B033,B035,B037,B039
00011ac5d4 55555555 555555ff 555555ff 555555ff 000000aa WB175 002f 026 B033,B035,B037,B039
...
0000fbff5c 55555555 555555ea 555555ea 555555ea 000000bf WB147 0029 07a B032,B033,B034,B035,B036,B037,B039
0000fbff58 55555555 555555ea 555555ea 555555ea 000000bf WB146 0029 07a B032,B033,B034,B035,B036,B037,B039
0000fbff54 55555555 555555ea 555555ea 555555ea 000000bf WB145 0029 07a B032,B033,B034,B035,B036,B037,B039
0000fbff50 55555555 555555ea 555555ea 555555ea 000000bf WB144 0029 07a B032,B033,B034,B035,B036,B037,B039
0000fbbf5c 55555555 555555ea 555555ea 555555ea 000000bf WB147 0029 04a B032,B033,B034,B035,B036,B037,B039
0000fbbf58 55555555 555555ea 555555ea 555555ea 000000bf WB146 0029 04a B032,B033,B034,B035,B036,B037,B039
0000fbbf54 55555555 555555ea 555555ea 555555ea 000000bf WB145 0029 04a B032,B033,B034,B035,B036,B037,B039
0000fbbf50 55555555 555555ea 555555ea 555555ea 000000bf WB144 0029 04a B032,B033,B034,B035,B036,B037,B039
...
0000a38f60 aaaaaaaa aaaaaaea aaaaaaea aaaaaaea 00000040 WB180 001b 00b B038
0000a38f64 aaaaaaaa aaaaaaea aaaaaaea aaaaaaea 00000040 WB181 001b 00b B038
0000a38f68 aaaaaaaa aaaaaaea aaaaaaea aaaaaaea 00000040 WB182 001b 00b B038
0000a38f6c aaaaaaaa aaaaaaea aaaaaaea aaaaaaea 00000040 WB183 001b 00b B038
If you are getting failure for first MB of FB then try option -no_scan_out
Error Code = 00000001
####### #### ######## ###
####### ###### ######## ###
## ## ## ## ###
## ## ## ## ###
####### ######## ## ###
####### ######## ## ###
## ## ## ## ###
## ## ## ######## ########
## ## ## ######## ########
My VRAM replacement skills are hopefully much better now, so I will try a reflow followed by a VRAM replacement.
Approach
- Reflow – Failed. Still the same blank screen, didn’t have high hopes.
- Remove/Clean/Replace – I am generally much more efficient at cleaning pads these days, hopefully, this will make a difference. Didn’t work, same error. I have probably wasted a couple of chips now, these ones aren’t quite so cheap as some others!
There is something deeper going on here, possibly the memory controller, as the pattern is unusual:
- The first 8 bits (byte) of B1 is the only part failing. It would be good to see a datasheet, perhaps there is a faulty connection?
- Also, the card did once output a clear picture. So, perhaps the memory controller / GPU is OK?
- The pads appeared OK under a magnifying glass, shame I didn’t check under the microscope.
I am going to have to come back to this card, no easy fix this time.
I have a MSI GTX 1060 OC doing the same thing. Giving me memory errors on FBIOB0. I did try and do a hot air reflow specifically to that vram chip and I thought that I had it solved but like you the victory was short lived and error came back after a short time of being on. I only have 33 errors compared to your 11112 so I am able to still use this GPU for everyday use but for any rendering it will not work.
Did you happen to solve the issue with yours yet ? Please let me know as I am sure it is the same with me.
PS. New Thermal Paste and pads did not seem to help in my case after I hit it with the hot air. Although my GPU temp is around 27-28C.
Hi, ok, thanks for the comment.
I am told that ’33 errors’ in MATS indicate a software error within MATS and can be ignored.
You may have a different type of issue. If you are still seeing memory related errors e.g. artefacts, then I would recommend to try replacing the faulty chip rather than reflowing it or possibly a re-ball, although both are more effort than simply reflowing the chip. Does it simply crash under load? This can also be core BGA and also VRM stability.
In my cards case, I have put it to one side because I strongly suspect that there are faulty solder joints under the GPU core. When I get a chance, I will remove the vram chip and measure some of the data lanes to see if I can spot proof of this. I cannot yet re-ball cores, so a fix will be a while.
Thanks