I wrote this post before I had heard about the Waluigi Effect but will keep it here as none of the people interested in alignment whom I have mentioned to this concern to said to me ‘oh that is the Waluigi Effect’ so it is clearly not well-known!
Thinking about alignment recently, I have been bothered by something that I am thinking of as ‘The Bit Flip Problem’.
I can’t find the original story but I heard about some developers who were working on LLMs. There was some code to check that it met the various guardrails they had put in place to ensure that the output was ‘safe’. However a developer accidentally switched the check that the guardrails had been met from ‘True’ to ‘False’. And out the LLM spewed invective and other inappropriate content. Changing one bit from a 1 to 0 had totally reversed the ‘safety’ of the LLM.
I heard a similar story about how easy it was to switch an AI algorithm for discovering medicines to one for discovering toxins and poisons.
Any solution to alignment which relies on being able to answer the question along the lines of ‘Is the following action safe?’ or ‘Is the following action aligned?’ could hit the same problem and be equally fragile. Likewise any utility function can be easily reversed.
It feels like we need a solution to alignment where there isn’t a symmetry that can be easily reversed. How do we make a system without that symmetry when there the concept of alignment is a rather arbitrary one?
I think this is why alignment ideas related to minimising power seeking and minimising side effects are interesting. I feel they might be a route into getting that asymmetry somehow.
Humans don’t seem to have this fragility that I can make out. I wonder if the same systems that keep us alive are also the ones that make most of us ‘aligned’. Perhaps caring about other people and caring about our survival are intertwined in some way that come not from a single ‘bit’ but from a whole structure that isn’t easily reversible. Obviously there are psychopaths and the like, but it doesn’t feel like it is that easy to flip a human into a non-aligned state.
Re answering the question ‘Is the following action safe?', perhaps the kind of structure that would be important is simply the probabilistic flow chart of the possible consequences of that action. The consequences could be assessed for safety as a whole. It wouldn't be possible to assess (against probabilistically) an infinity of possible consequences, many of which are contingent upon others and upon unknowns. So there would have to be some kind of heuristic to obtain a ball park determination of the safety or otherwise of the action considered by the AI.