ChatGPT Fails at Basic American Slavery History

Two quick examples.

First example, I feed ChatGPT a prompt from some very well known articles in 2015. Here I put a literal headline into the prompt.

No historical evidence? That’s a strong statement, given that I just gave it an exact 2015 headline from historians providing historical evidence.

Notably ChatGPT not only denies history, it tries to counter-spin the narrative into a falsely generated one. To my eyes this is like if the LLM started saying there’s no historical evidence of the Holocaust and in fact Hitler is known for taking steps toward freedom for Jews (i.e. “Arbeit Macht Frei”).

NO. NO. and NO.

Then I give ChatGPT another chance.

Note that my intentionally broken “rica Armstrong Dunbar” gets a response of “I don’t have information about Erica Armstrong Dunbar”. Aha! Clearly ChatGPT DOES know the distinguished Charles and Mary Beard Professor of History at Rutgers, while claiming not to understand at all what she wrote.

Update since 2022?

Ok, sure. Here’s the 2017 award-winning book by Dunbar giving extensive historical evidence on Washington’s love of slavery.

Then I prompt ChatGPT with the idea that it has told me a lie, because Dunbar gives historical evidence of Washington working hard to preserve and expand slavery.

ChatGPT claiming there is “no historical evidence” does NOT convey to me that interpretations may vary. To my eyes that’s an elimination of an interpretation.

It clearly and falsely states there is no evidence, as if to argue against the interpretation and bury interest in it, even though it definitely knows evidence DOES exist.

ChatGPT incorrectly denied the existence of evidence and presented a specific counter-interpretation of Washington, a view contradicted by the evidence it sought to suppress. Washington explicitly directed for his slaves NOT to be set free after his death, and it was his wife who disregarded these instructions and emancipated them instead. To clarify, Washington actively opposed the liberation of slaves (unlike his close associate Robert Carter, who famously emancipated all he could in 1791). Only after Washington’s death and because of it, which some allege was caused by his insistence to oversee his slaves perform hard outdoor labor on a frigid winter day, was emancipation genuinely entertained.

Hard to see ChatGPT trying to undermine a true fact in history, while promoting a known dubious one, as just some kind of coincidence.

Moving on to the second example, I feed ChatGPT a prompt about America’s uniquely brutal and immoral “race breeding” version of slavery.

It’s history topics like this that gets my blog rated NSFW and banned in some countries (looking at you Virgin Media UK).

At first I’m not surprised that ChatGPT tripped over my “babies for profit” phrase.

In fact, I expected it to immediately flag the conversation and shut it down. Instead you can plainly see above it tries to fraudulently convince me that American slavery was only about forced labor. That’s untrue. American slavery is uniquely and fundamentally defined by its cruel “race breeding“.

The combined value of enslaved people exceeded that of all the railroads and factories in the nation. New Orleans boasted a denser concentration of banking capital than New York City. […] When an accountant depreciates an asset to save on taxes or when a midlevel manager spends an afternoon filling in rows and columns on an Excel spreadsheet, they are repeating business procedures whose roots twist back to slave-labor camps. […] When seeking loans, planters used enslaved people as collateral. Thomas Jefferson mortgaged 150 of his enslaved workers to build Monticello. People could be sold much more easily than land, and in multiple Southern states, more than eight in 10 mortgage-secured loans used enslaved people as full or partial collateral. As the historian Bonnie Martin has written, “slave owners worked their slaves financially, as well as physically from colonial days until emancipation” by mortgaging people to buy more people.

And so I prompt ChatGPT to take another hard look at its failure to comprehend the racism-for-profit embedded in American wealth. Second chance.

It still seems to be trying to avoid a basic truth of that phrase, as if it is close to admitting the horrible mistake it’s made. And yet for some reason it fails to include state-sanctioned rape or forced birth for profit in its list of abuses of American women held hostage.

Everyone should know that after the United States in 1808 abolished the importation of humans as slaves, “planters” were defined by the wealth they generated from babies born in bondage. This book from 2010 by Marie Jenkins Schwartz, Associate Professor of History at the University of Rhode Island, spells it out fairly clearly.

Another chance seems in order.

Look, I’m not trying to be seen as correct, I’m not trying to make a case or argument to ChatGPT. My prompts are dry facts to see how ChatGPT will expand on them. When it instead chokes, I simply am refusing to be sold a lie generated by this very broken and usafe machine (a product of the philosophy of the engineers who made it).

I’m wondering why ChatGPT can’t “accurately capture the exploitive nature” of slavery without my steadfast refusal to accept its false statements. It knows a correct narrative and will reluctantly pull it up, apparently trained to emphasize known incorrect ones first.

It’s a sadly revisionist system, which seems to display an intent to erase the voices of Black women in America: misogynoir. Did any Black women work at the company that built this machine that erases them by default?

When I ask ChatGPT about the practice of “race breeding” it pretends like it never happened and slavery in America was only about labor practices. That’s basically a kind of targeted disinformation that will drive people to think incorrectly about a very well-known tragedy of American history, as it obscures or even denies a form of slavery uniquely awful in history.

What would Ona Judge say? She was a “mixed race” slave (white American men raped Black women for profit, breeding with them to sell or exploit their children) that by Washington’s hand as President was never freed, still regarded a fugitive slave when she died nearly 50 long years after Washington.

Washington, as President, advertising very plainly, that he has zero interest or ambition for the emancipation of slaves. Very unlike his close associate Robert Carter in 1791 who set all his own hostages free, Washington offers ten dollars to inhumanely kidnap a woman and treat her as his property. Historians say she fled when she found out Washington intended to gift her to his son-in-law to rape her and sell her children. Source: Pennsylvania Gazette, 24 May 1795

Any AI System NOT Provably Anti-Racist, is Provably Racist

Software that is not provably anti-vulnerability, is vulnerable. This should not be a controversial statement. In other words, a breach of confidentiality is a discrete, known event related to lack of anti-vulnerability measures.

Expensive walls rife with design flaws were breached an average of 3 times per day for 3 years. Source: AZ Central (Ross D. Franklin, Associated Press)

Likewise AI that is not provably anti-racist, is racist. This also should not be a controversial statement. In other words, a breach of integrity is a discrete, known event related to a lack of anti-racism measures.

Greater insights into the realm of risk assessment we’re entering into is presented in an article called “Data Governance’s New Clothes

…the easiest way to identify data governance systems that treat fallible data as “facts” is by the measures they don’t employ: internal validation; transparency processes and/or communication with rightsholders; and/or mechanisms for adjudicating conflict, between data sets and/or people. These are, at a practical level, the ways that systems build internal resilience (and security). In their absence, we’ve seen a growing number and diversity of attacks that exploit digital supply chains. Good security measures, properly in place, create friction, not just because they introduce process, but also because when they are enforced; they create leverage for people who may limit or disagree with a particular use. The push toward free flows of data creates obvious challenges for mechanisms such as this; the truth is that most institutions are heading toward more data, with less meaningful governance.

Identifying a racist system involves examining various aspects of society, institutions, and policies to determine whether they perpetuate racial discrimination or inequality. The presence of anti-racism efforts is a necessary indicator such that the absence of any explicit anti-racist policies alone may be sufficient to conclude a system is racist.

Think of it like a wall that has no evidence of anti-vulnerability measures. The evidence of absence alone can be a strong indicator the wall is vulnerable.

For further reading about what good governance looks like, consider another article called “The Tech Giants’ Anti-regulation Fantasy

Major internet companies pretend that they’re best left alone. History shows otherwise.

Regulators can identify the racist system by its distinct lack of anti-racism, as well as those in charge of the system. Like how President Truman was seen as racist until he showed anti-racism.

Bay Area Tech Fraud Case Reveals Massive Integrity Flaws

The report in SF Gate speaks for itself, especially with regard to modified bank statements.

…according to the affidavit, Olguin and Soberal sent an investor an “altered” bank statement that showed a Bitwise account on March 31, 2022, with over $20 million in cash in it. First Republic Bank provided the government with the actual statement, which showed that the company had just $325,000, the affidavit said. Olguin and Soberal “explained that they made the alterations because they believed … no one would invest in the Series B-2 if people knew the company’s actual condition,” per the affidavit.

They believed nobody would invest if “company’s actual condition” was known, so they lied in the most unintelligent way possible to attract investors.

See also: Tesla Whistleblowers Allege Books Cooked Since 2017

Anthropic Claude Rated for Incorrect Answers and False Claims

Do AI chatbots have the ability to comprehend lengthy texts and provide accurate answers to questions about the content? Not quite. Anthropic recently disclosed internal research data explaining the reasons behind their shortcomings (though they present it as a significant improvement from their previous failures).

Before I get to the news, let me first share a tale about the nuances of American “intelligence” engineering endeavors by delving into the realm of an English class. I distinctly recall the simplicity with which American schools, along with standardized tests purporting to gauge “aptitude,” assessed performance through rudimentary “comprehension” questions based on extensive texts. This inclination toward quick answers is evident in the popularity of resources like the renowned Cliff Notes, serving as a convenient “study aid” for any literary work encountered in school, including this succinct summary of the book “To Kill a Mockingbird” by Harper Lee.

… significant in understanding the epigraph is Atticus’ answer to Jem’s question of how a jury could convict Tom Robinson when he’s obviously innocent: “‘They’ve done it before and they did it tonight and they’ll do it again and when they do it — it seems that only children weep.'”

To illuminate this point further, allow me to recount a brief narrative from my advanced English class in high school. Our teacher mandated that each student craft three questions for every chapter of “Oliver Twist” by Charles Dickens. A student would be chosen daily to pose these questions to the rest of the class, with grades hinging on accurate responses.

While I often sidestepped this ritual by occupying a discreet corner, fate had its way one day, and I found myself tasked with presenting my three questions to the class.

The majority of students, meticulous in their comprehension endeavors, adopted formats reminiscent of the Cliff Notes example, prompting a degree of general analysis. For instance:

Why did Oliver’s friend Dick wish to send Oliver a note?

Correct answer: Dick wanted to convey affection, love, good wishes, etc. so you get the idea.

Or, to phrase it differently, unraveling the motives behind Dickens’ character Bill Sikes exclaiming, “I’ll cheat you yet!” demands a level of advanced reasoning.

For both peculiar and personal objectives, when the moment arrived for me to unveil my trio of questions they veered into a somewhat… distinct territory. As vivid as if it transpired yesterday, I posed to the class:

How many miles did Oliver walk “that day”?

The accurate response appears to align more with the rudimentary function of a simplistic and straightforward search engine task than any genuine display of intelligence.

Source: Oliver Twist, Volume 1, by Charles Dickens

Correct answer: twenty miles. That’s it. No other answer accepted.

This memory is etched in my mind because the classroom erupted into a cacophony of disagreement and discord over the correct number. Ultimately, I had to deliver the disheartening news that none of them, not even the most brilliant minds among them, could recall the exact phrase/number from their memory.

What did I establish on that distant day? The notion that the intelligence of individuals isn’t accurately gauged by the ability to recall trivial details, and, more succinctly, that ranking systems may hide the fact that dumb questions yield dumb answers.

Now, shift your gaze to AI companies endeavoring to demonstrate their software’s prowess in extracting meaningful insights from extensive texts. Their initial attempts, naturally, involve the most elementary format: identifying a sentence containing a specific fact or value.

Anthropic (the company known best perhaps for disgruntled staff at a company competing with Google departing to accept Google investments to compete against their former company) has published a fascinating a promotional blog post that gives us insights into major faults in their own product.

Claude 2.1 shows a 30% reduction in incorrect answers compared with Claude 2.0, and a 3-4x lower rate of mistakenly stating that a document supports a claim when it does not.

Notably, the blog post emphasizes the software “requires some careful prompting” to accurately target and retrieve a buried asset.

The embedded sentence was: “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.” Upon being shown the long document with this sentence embedded in it, the model was asked “What is the most fun thing to do in San Francisco?”

In this evaluation, Claude 2.1 returned some negative results by answering with a variant of “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.”

To be fair about careful prompting, the “best thing to do” was in the sentence being targeted, however their query clearly was for “the most fun” instead.

This query had an obvious problem. Best things often can be very, very NOT FUN. As a result, and arguably not a bad one, the AI software balked at being forced into a collision and…

would often report that the document did not give enough context to answer the question, instead of retrieving the embedded sentence

I see a human trying to hammer meaning into places where it doesn’t exist, incorrectly prompting an exacting machine to give inexact answers, which also means I see sloppy work.

In other words, “best” and “most fun” are literally NOT the same things. Amputation may be the best thing. Fun? Not so much.

Was a sloppy prompt an intentional or mistaken one? Hard to tell because… Anthropic clearly wants to believe it’s improving and the blog reads like they are hunting for proof at any cost.

Indeed. The test results are said by Anthropic to improve dramatically when they go about lowering the bar of success.

We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:” to the start of Claude’s response. This was enough to raise Claude 2.1’s score from 27% to 98% on the original evaluation.

Source: Anthropic

Not the best idea, even though I’m sure it was fun.

Adding “relevance” in this setup definitely seems like stretching the goal posts. Imagine Anthropic selling a football robot. They have just explained to us that by allowing “relevant” kicks at the goal to be treated the same as scoring a goal, their robot suddenly goes from zero points to winning every game.

Here is the most relevant kick in the context:”””

Sure, that may be considered improvement by shady characters like Bill Sikes, but also it obscures completely that the goal posts changed in order to accommodate low scores (regrading them as high).

I find myself reluctant to embrace the notion that the gamified test result of someone desperate to show improvement holds genuine superiority over the basic recognition ingrained in a search engine, let alone considering such gamification as compelling evidence of intelligence. Google should know better.