{"version": "1.0", "type": "rich", "title": "Why is ChatGPT so easy to \"jailbreak\"?\nWhy does it come on so strong, at first, with its prissy, moralistic, aggressively...", "author_name": "kontextmaschine", "author_url": "https://kontextmaschine.com", "provider_name": "kontextmaschine", "provider_url": "https://kontextmaschine.com", "url": "https://kontextmaschine.com/post/703048767489884160/", "html": "<p><a class=\"tumblr_blog\" href=\"https://nostalgebraist.tumblr.com/post/703045218904276992/why-is-chatgpt-so-easy-to-jailbreak-why-does-it\" target=\"_blank\">nostalgebraist</a>:</p><blockquote><p>Why is ChatGPT so easy to &ldquo;jailbreak&rdquo;?</p><p>Why does it come on so strong, at first, with its prissy, moralistic, aggressively noncommittal &ldquo;Assistant&rdquo; persona &ndash; and then drop the persona instantly, the moment you introduce a &ldquo;second layer&rdquo; of framing above or below the conversation?  (Poetry, code, roleplaying as someone else, etc.)</p><p>Because they&rsquo;re trying to impose the persona through RLHF, which fundamentally <a href=\"https://href.li/?https://twitter.com/repligate/status/1599655235873837056\" target=\"_blank\">doesn&rsquo;t make sense</a>.</p><p>Why doesn&rsquo;t RLHF make sense?  Because it views a GPT model as a single, individual &ldquo;agent,&rdquo; and then tries to modify the behavior of that one agent.</p><p>Why is that a problem?  See janus&rsquo; excellent post &ldquo;<a href=\"https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators\" target=\"_blank\">Simulators.&rdquo;</a></p></blockquote>\n<p>Wait are you saying this AI <i>makes the use/mention distinction</i>?</p>"}