bhjilo.blogg.se - Angband borg

The last two can be particularly tricky, as we want to make sure that our formal definition of importance matches up with our informal one, and we currently lack a well defined "die" goal. These can be thought as ruling out the agent's existence, their creation, their influence (or importance) and their independence. Then, if there is some well-defined "die" goal for the agent, this could take all the agents with them. Pre-corriged methods can be used to ensure that any subagents remain value aligned with the original agent.Various methods around detecting importance can be used to ensure that, though subagents may exist, they won't be very influential.Reducing the AI's output options to a specific set can prevent them from being able to create any in the first place.Reduced impact methods can prevent subagents from being created, by requiring that the AI's interventions be non-disruptive (" Twenty million questions") or undetectable.Some of the methods I've developed seem suitable for controlling the existence or impact of subagents. For instance, if we want to rule out subagents by preventing the AI from having much influence if the AI itself were to stop ("If you die, you fail, no other can continue your quest"), then it is motivated to create powerful subagents that carefully reverse their previous influence if the AI were to be destroyed. The problem is very hard, because an imperfect definition of a subagent is simply an excuse to create an a subagent that skirts the limits of that definition (hum, that style of problem sounds familiar). So if the problem could be solved, many other control approaches could be potentially available. And it tends to evade many clever restrictions people try to program into the AI (eg "make use of only X amount of negentropy", "don't move out of this space").

The subagent problem, in a nutshell, is that "create a powerful subagent with goal U that takes over the local universe" is a solution for many of the goals an AI could have - in a sense, the ultimate convergent instrumental goal. This mainly to put down some of the ideas I've had, for later improvement or abandonment. A putative new idea for AI control index here.