So in the first post, it looks like OP just dumped a bunch of sand on the lower level. At this point, I don't know if he was mixing cell capacities in that level or what configuration it might have even been in. We don't even know if the pictures of the disassembled cells are even in the order that they were installed, but perhaps they are and to the extent that they all look all very similar maybe it doesn't even matter.
If the cells were brushed off (with say a brush or broom), then we would have a better idea as to whether that crude on all the cells is bonded silica (ie. sand) or burned grease from the anti-corrosion treatment? It looks like it has bonded to the cells which are probably related to elevated heat, which would be indicative of high but balanced currents which is also an indication that all the cells were involved in a similar current loop (i.e. one bank discharging into the other). If this is the case, then a BMS with cutoff would have stopped the big current loop but would have done little for an internal short with the single cell.
Without any further clarification, I can summarize what it looks like:
1.) One post on one cell was basically destroyed indicating an imbalance in the cell current flow and a likely short in and around that post. Any current flow to that damaged post would have to be balanced by the current leaving the opposite post. Since the opposite post is not damaged there was obviously a much higher generation of heat developed at that point.
Now one could claim that it is a dirty post that caused elevated heat, but that does not explain why all the other cells seem to have dumped their energy into this cell. Nor does it even explain why there would have been elevated currents which are clearly indicated by the state of all of the posts. The sand appears to be bonded/heat oxidized to all the posts and not simply sitting on top.
This is not from a normal load charge/discharge cycle. It is
indicative of high elevated currents through all cells. However, the heat of this fire which may have in the OP's own words "been hotter than a bakery oven", since it did not even melt the aluminum cases, it was not much above 1100 deg F melting point of AL except in the direct proximity of the burned post. The evidence points to a long sustained drain of energy from many if not all the cells into the short at the single post.
2.)
Had there been separate BMS's with disconnects on the parallel strings, the big energy dump would NOT have occurred. The single cell that shorted would have expended itself, and that string would have been disconnected by the BMS. The fire might have lasted for 1 hour instead of 24 hours, involved only that single cell, and left the remaining cells largely intact if not totally unaffected.
We can conclude this because, with all the energy that was expended, even the damaged cell still only melted a small portion of its thin aluminum case! The crude on all the other cells strongly suggests that they all were under high load and dumped their energies into this single post, and as such this single cell was subjected to a much higher thermal stress than any other post but that sustained current dump still did relatively minor damage to the cell. Essentially if the cell was so damaged that it could not support a current then the source of energy release would have been removed.
Finally, I feel compelled to make the following statements not to disparage the OP, but more to provide some explanations for any other DIY'er that may look at the thread in horror and be dissuaded from building their own DIY battery bank.
Caution of LiFePO4 is warranted but also understanding that the OP violated some fundamental best practices (i.e. no BMS with an automatic current disconnect) also needs to be understood. The violation of this best practice seems to be the primary contributor to the cascading meltdown of something that would have otherwise been easily confined to a single cell. While the OP seems to be very diligent in his efforts to construct this battery system, those efforts are based on what amounts to a trial and error process where error means "smoke release" from the electronics.
There are other best practices being violated (i.e. spaghetti wiring) which in my estimation did not contribute to this melt down, but that doesn't mean that it might not create some other problem further down the road. One of the biggest problems with trial and error development (one not involving understanding the engineering and designing to mitigate risks) is that there is a presumption that something is safe because it has not failed yet when in fact it may be that it just has not bitten you yet. I think this episode is an example of that where the potential danger of running without a BMS was not fully appreciated.
In the engineering domain, something called
Failure Mode and Effect Analysis (FMEA) is part of a set of standard practices for analysis of what can go wrong/fail and in the event of those failures what the effects of those faults are predicted to be. A cursory FMEA of parallel strings of cells will quickly lead to the realization of the potential for unprotected (e.g. no BMS) parallel strings to dump all energy into a single shorted cell.
Learn when to use the failure modes and effects analysis (FMEA) and the general procedure an organization should follow through an FMEA example at ASQ.org.
asq.org
Hopefully, this provides some potential explanations for the incident and I think is far more productive than coming up with funny names for this thread. We have to appreciate the OP's willingness to share this disaster in the hopes of avoiding similar things for himself but also for others in the future. Any form of jokes/comedy is in my opinion misplaced, inappropriate, and immature which ultimately counter to the stated goals of this forum for sharing information and will lead to dissuading any future OP's from sharing their failures.