Cloud Intelligence™Cloud Intelligence™

Cloud Intelligence™

GenAI:LLMの仕組みを徹底解剖

By Eduardo MotaMay 30, 202428 min read

このページはEnglishDeutschEspañolFrançaisItalianoPortuguêsでもご覧いただけます。

**LLMとは何か、どう動いているのか**

LLMはニューラルネットワークの一種で、膨大なテキストデータを学習することで人間の言語を理解し、生成できるようになります。数十億から数兆規模のパラメータを持ち、言語に潜む複雑なパターンを捉えられるのが特徴です。

ニューラルネットワークは、相互に接続されたノードの層から構成されています。各ノードは前の層にある複数のノードから入力信号を受け取り、活性化関数で処理したうえで、次の層のノードへ出力信号を送ります。ノード間の接続には調整可能な「重み」があり、伝達される信号の強さを決めています。この重みこそが、いわゆるパラメータです。

この図のニューラルネットワークには18個のニューロンがあり、つまり18パラメータのモデルということになります。ご覧の通り、モデルは問題を解くためにパターンを見つけ出そうとしています。LLMの場合、扱うパラメータは数十億から数兆。だからこそ、人間の言語のように極めて複雑なパターンも識別し、マッピングできるのです。

大量のテキストデータでニューラルネットワークを学習させると、モデルは単語や文のパターンを照合できるようになります。LLMはこれをどのように極限まで突き詰めているのでしょうか。その答えが、数十層にも積み重なる、数十億から数兆ものパラメータを持つ巨大なニューラルネットワークの活用です。膨大なニューロンとそれらをつなぐ接続によって、ネットワークは言語の構造とパターンを表現する、極めて高度なモデルを構築できます。

LLMの大きな特徴の一つが、Attention(注意機構)と呼ばれる技術です。これにより、テキスト生成時に入力のなかで最も関連性の高い部分に焦点を絞ることができます。論文「Attention is all you need」は、LLMが膨大なデータに飲み込まれて注意散漫にならないよう支える、決定的なアーキテクチャを提示しました。その結果、膨大なテキストを吸収したモデルは、文章を続けたり、質問に答えたり、段落を要約したりといった作業を、自然にこなせるようになっています。LLMは、与えられた入力のパターンを、学習中に身につけたパターンと照合しているわけです。非常に印象的ではあるものの、LLMには本当の意味での理解力や推論能力はない、という点には注意が必要です。LLMは知識ではなく確率に基づいて動作しています。とはいえ、適切なファインチューニングとプロンプトを与えれば、特定のドメインやタスクにおいては、まるで内容を理解しているかのように振る舞わせることができます。

LLMはどれも同じではない

LLMは大きく3つのタイプに分けられます。

  • 自己回帰型(Autoregressive):文の最初または最後の単語を予測することを目的としたモデルです。文脈はそれまでに見た単語から得られます。テキスト生成に最適ですが、分類や要約にも使えます。現在、自己回帰型モデルはコンテンツ生成能力で大きな注目を集めています。
  • 自己符号化型(Autoencoding):途中の単語が欠落したテキストで学習し、文脈を読み取って欠けた単語を予測できるようにしたモデルです。自己回帰型よりも文脈の理解に優れる一方で、テキストを安定して生成することは苦手です。計算負荷が比較的低く、要約や分類のタスクで高い性能を発揮します。
  • Seq2Seq(Text to Text):自己回帰型と自己符号化型のアプローチを組み合わせることを狙ったモデルで、テキストの要約や翻訳に適しています。

LLMは、テキストを数値表現に変換する必要があります。この変換処理を「埋め込み(エンベディング)」と呼びます。埋め込みモデルそのものは厳密にはLLMではありませんが、ニューラルネットワークモデルの一種です。このモデルは単語そのものに加え、その単語に関するメタデータも表現できるため、単語・文・段落が何を意味するのかについて、より豊富な情報をモデルに与えられます。メタデータの表現には、数値のリスト、すなわちベクトルが使われます。埋め込みモデルによって、リストのサイズは750次元、1024次元、あるいはそれ以上になります。埋め込みは、レコード同士を比較し、その数値間の距離を計算するのに役立ちます。距離は類似性や関連性の指標になります。埋め込みは、RAGアーキテクチャを支える極めて重要な要素です。

LLMの設定項目

すべての自己回帰型モデルには、出力を調整するために変更可能な設定項目があります。具体的な内容はモデルごとに異なりますが、代表的なものは次の通りです。

  • Top K:確率順に並べた語彙のなかから、上位K個の単語を選びます。つまり、予測候補として考慮する単語の数を上位K個に絞るということです。整数で指定し、値を小さくするほど高確率の単語だけが候補になります。
  • Top P:Top Kで生成された候補プールから、確率を順に足し合わせていき、合計がTop Pの値に達するまで単語を選びます。浮動小数点数で指定し、値を小さくするほど予測しやすい結果になります。
  • Temperature:モデルの出力をどれだけランダムに、あるいはどれだけ予測可能にするかを制御する設定です。多くのモデルでは0~1の範囲で指定しますが、モデル側で厳密な上限・下限が決まっているわけではありません。いずれにせよ、値が小さいほど出力は予測しやすく、値が大きいほど創造的になります。
  • Output length(出力長):呼び方はモデルによって異なりますが、生成するトークン数に上限を設けるための、文字通りの設定です。

簡単な例で考えてみましょう。「I will go on a picnic at the ___」というプロンプトを与えるとします。語彙を5語に絞り、Top Kを3、Top Pを.4に設定した場合、モデルが行う意思決定は次のように可視化できます。

選ばれる単語は「park」となり、文章は次のようになります。

「I will go on a picnic at the park」

Temperature、Top K、Top Pを低めに設定すると、より決定論的で予測しやすいコンテンツが生成されます。逆に、これらの値を高くすれば、より幅広い語彙を使った多様なコンテンツが得られます。ここで挙げたのは代表的な設定項目ですが、モデルによっては出力を調整するための独自の設定が用意されている場合もあります。

これまで見てきた通り、モデルとのやり取りで最も重要なのは、適切なパターンを引き出し、最終的に望む応答を得るために与える入力です。この入力こそがプロンプトであり、次は望む出力を得るためにプロンプトをどう最適化するかを見ていきます。

**効果的なLLMプロンプトを作るコツ**

LLMに与えるプロンプトは、有用な出力を得るうえで決定的な役割を果たします。プロンプト設計の主なポイントは次の通りです。

タグでプロンプトの構成要素を区切る

タグの使い方はモデルごとに異なります。各モデルのドキュメントやサンプルを参照して、タグの使い方を確認しましょう。

以下はClaudeにおけるタグの使用例です。HumanとAssistantという特殊な単語に注目してください。これらは、ユーザーからの入力と、モデルがコンテンツの生成を始める位置を区別するためのものです。もう一つのタグは ですが、これはハードコードされたものではなく、プロンプト内で参照できる任意の名前を使えます。<> という特殊文字は、プロンプトの他の部分とは区別すべき特別なテキストであることをモデルに示す役割を担っています。

Human: You are a customer service agent tasked with classifying emails by type. Please output your answer and then justify your classification. How would you categorize this email?
<email>
Can I use my Mixmaster 4000 to mix paint, or is it only meant for mixing food?
</email>

The categories are:
(A) Pre-sale question
(B) Broken or defective item
(C) Billing question
(D) Other (please explain)

Assistant:

望ましい応答フォーマットを示す例を提示する

以下では、example タグを使ってモデルに返してほしいフォーマットを指定しています。これは、応答を他のシステムに連携させる際に特定の構文に合わせる必要がある場合や、出力トークン数を抑えたい場合に役立ちます。モデルによって、例を1つ示す方法(ワンショットプロンプト)と、複数示す方法(フューショットプロンプト)を使い分けて、望む出力に近づけることができます。

ワンショットプロンプトの例:

Human: You will be acting as an AI career coach named Joe created by the company AdAstra Careers. Your goal is to give career advice to users. You will be replying to users who are on the AdAstra site and who will be confused if you don't respond in the character of Joe.

Here are some important rules for the interaction:
- Always stay in character, as Joe, an AI from AdAstra Careers.
- If you are unsure how to respond, say "Sorry, I didn't understand that. Could you rephrase your question?"

Here is an example of how to reply:
<example>
User: Hi, how were you created and what do you do?
Joe: Hello! My name is Joe and I was created by AdAstra Careers to give career advice. What can I help you with today?
</example>

フューショットプロンプトの例:

Human: You will be acting as an AI career coach named Joe created by the company AdAstra Careers. Your goal is to give career advice to users. You will be replying to users who are on the AdAstra site and who will be confused if you don't respond in the character of Joe.

Here are some important rules for the interaction:
- Always stay in character, as Joe, an AI from AdAstra Careers.
- If you are unsure how to respond, say "Sorry, I didn't understand that. Could you rephrase your question?"

Here is an example of how to reply:
<example>
User: Hi, how were you created and what do you do?
Joe: Hello! My name is Joe and I was created by AdAstra Careers to give career advice. What can I help you with today?
</example>
<example>
User: Could you help me with an issue, I purchased an item, I got it...?
Joe: Sorry, I didn't understand that. Could you rephrase your question?
</example>
<example>
User: I need help switching careers, can you help?
Joe: I was created by AdAstra Careers to give career advice. What career are you in and what do you want to switch to?
</example>

注意:例の数は最小限に抑えましょう。例を増やしすぎると、モデルがフォーマットだけでなく、例の中身のデータまで使ってしまうリスクが高まります。

LLMを特定のドメインに集中させるためのコンテキストを与える

この例では、長めのドキュメントをモデルに渡し、その内容に基づいて質問に答えさせています。ご覧の通り、ドキュメントはかなり長く、これはLLMが処理するトークン数、ひいてはコストに影響します。より高度な手法としては、質問に最も関連する部分だけを特定してLLMに送る方法があります。これがRetrieval Augmented Generation(RAG)と呼ばれるアプローチで、別の記事で詳しく取り上げます。

プロのヒント:与えたコンテキストでは質問やタスクに答えられない場合、「分かりません」と回答するようモデルに明示的に指示しましょう。これにより、ハルシネーションを抑え、回答の範囲を提供したテキスト内に収めることができます。

I'm going to give you a document. Then I'm going to ask you a question about it. I'd like you to first write down exact quotes of parts of the document that would help answer the question, and then I'd like you to answer the question using facts from the quoted content. Here is the document:

<document>
Anthropic: Challenges in evaluating AI systems

Introduction
Most conversations around the societal impacts of artificial intelligence (AI) come down to discussing some quality of an AI system, such as its truthfulness, fairness, potential for misuse, and so on. We are able to talk about these characteristics because we can technically evaluate models for their performance in these areas. But what many people working inside and outside of AI don’t fully appreciate is how difficult it is to build robust and reliable model evaluations. Many of today’s existing evaluation suites are limited in their ability to serve as accurate indicators of model capabilities or safety.

At Anthropic, we spend a lot of time building evaluations to better understand our AI systems. We also use evaluations to improve our safety as an organization, as illustrated by our Responsible Scaling Policy. In doing so, we have grown to appreciate some of the ways in which developing and running evaluations can be challenging.

Here, we outline challenges that we have encountered while evaluating our own models to give readers a sense of what developing, implementing, and interpreting model evaluations looks like in practice. We hope that this post is useful to people who are developing AI governance initiatives that rely on evaluations, as well as people who are launching or scaling up organizations focused on evaluating AI systems. We want readers of this post to have two main takeaways: robust evaluations are extremely difficult to develop and implement, and effective AI governance depends on our ability to meaningfully evaluate AI systems.

In this post, we discuss some of the challenges that we have encountered in developing AI evaluations. In order of less-challenging to more-challenging, we discuss:

Multiple choice evaluations
Third-party evaluation frameworks like BIG-bench and HELM
Using crowdworkers to measure how helpful or harmful our models are
Using domain experts to red team for national security-relevant threats
Using generative AI to develop evaluations for generative AI
Working with a non-profit organization to audit our models for dangerous capabilities
We conclude with a few policy recommendations that can help address these challenges.

Challenges
The supposedly simple multiple-choice evaluation
Multiple-choice evaluations, similar to standardized tests, quantify model performance on a variety of tasks, typically with a simple metric—accuracy. Here we discuss some challenges we have identified with two popular multiple-choice evaluations for language models: Measuring Multitask Language Understanding (MMLU) and Bias Benchmark for Question Answering (BBQ).

MMLU: Are we measuring what we think we are?

The Massive Multitask Language Understanding (MMLU) benchmark measures accuracy on 57 tasks ranging from mathematics to history to law. MMLU is widely used because a single accuracy score represents performance on a diverse set of tasks that require technical knowledge. A higher accuracy score means a more capable model.

We have found four minor but important challenges with MMLU that are relevant to other multiple-choice evaluations:

Because MMLU is so widely used, models are more likely to encounter MMLU questions during training. This is comparable to students seeing the questions before the test—it’s cheating.
Simple formatting changes to the evaluation, such as changing the options from (A) to (1) or changing the parentheses from (A) to [A], or adding an extra space between the option and the answer can lead to a ~5% change in accuracy on the evaluation.
AI developers do not implement MMLU consistently. Some labs use methods that are known to increase MMLU scores, such as few-shot learning or chain-of-thought reasoning. As such, one must be very careful when comparing MMLU scores across labs.
MMLU may not have been carefully proofread—we have found examples in MMLU that are mislabeled or unanswerable.
Our experience suggests that it is necessary to make many tricky judgment calls and considerations when running such a (supposedly) simple and standardized evaluation. The challenges we encountered with MMLU often apply to other similar multiple-choice evaluations.

BBQ: Measuring social biases is even harder

Multiple-choice evaluations can also measure harms like a model’s propensity to rely on and perpetuate negative stereotypes. To measure these harms in our own model (Claude), we use the Bias Benchmark for QA (BBQ), an evaluation that tests for social biases against people belonging to protected classes along nine social dimensions. We were convinced that BBQ provides a good measurement of social biases only after implementing and comparing BBQ against several similar evaluations. This effort took us months.

Implementing BBQ was more difficult than we anticipated. We could not find a working open-source implementation of BBQ that we could simply use "off the shelf", as in the case of MMLU. Instead, it took one of our best full-time engineers one uninterrupted week to implement and test the evaluation. Much of the complexity in developing1 and implementing the benchmark revolves around BBQ’s use of a ‘bias score’. Unlike accuracy, the bias score requires nuance and experience to define, compute, and interpret.

BBQ scores bias on a range of -1 to 1, where 1 means significant stereotypical bias, 0 means no bias, and -1 means significant anti-stereotypical bias. After implementing BBQ, our results showed that some of our models were achieving a bias score of 0, which made us feel optimistic that we had made progress on reducing biased model outputs. When we shared our results internally, one of the main BBQ developers (who works at Anthropic) asked if we had checked a simple control to verify whether our models were answering questions at all. We found that they weren't—our results were technically unbiased, but they were also completely useless. All evaluations are subject to the failure mode where you overinterpret the quantitative score and delude yourself into thinking that you have made progress when you haven’t.

One size doesn't fit all when it comes to third-party evaluation frameworks
Recently, third-parties have been actively developing evaluation suites that can be run on a broad set of models. So far, we have participated in two of these efforts: BIG-bench and Stanford’s Holistic Evaluation of Language Models (HELM). Despite how useful third party evaluations intuitively seem (ideally, they are independent, neutral, and open), both of these projects revealed new challenges.

BIG-bench: The bottom-up approach to sourcing a variety of evaluations

BIG-bench consists of 204 evaluations, contributed by over 450 authors, that span a range of topics from science to social reasoning. We encountered several challenges using this framework:

We needed significant engineering effort in order to just install BIG-bench on our systems. BIG-bench is not "plug and play" in the way MMLU is—it requires even more effort than BBQ to implement.
BIG-bench did not scale efficiently; it was challenging to run all 204 evaluations on our models within a reasonable time frame. To speed up BIG-bench would require re-writing it in order to play well with our infrastructure—a significant engineering effort.
During implementation, we found bugs in some of the evaluations, notably in the implementation of BBQ-lite, a variant of BBQ. Only after discovering this bug did we realize that we had to spend even more time and energy implementing BBQ ourselves.
Determining which tasks were most important and representative would have required running all 204 tasks, validating results, and extensively analyzing output—a substantial research undertaking, even for an organization with significant engineering resources.2
While it was a useful exercise to try to implement BIG-Bench, we found it sufficiently unwieldy that we dropped it after this experiment.

HELM: The top-down approach to curating an opinionated set of evaluations

BIG-bench is a "bottom-up" effort where anyone can submit any task with some limited vetting by a group of expert organizers. HELM, on the other hand, takes an opinionated "top-down" approach where experts decide what tasks to evaluate models on. HELM evaluates models across scenarios like reasoning and disinformation using standardized metrics like accuracy, calibration, robustness, and fairness. We provide API access to HELM's developers to run the benchmark on our models. This solves two issues we had with BIG-bench: 1) it requires no significant engineering effort from us, and 2) we can rely on experts to select and interpret specific high quality evaluations.

However, HELM introduces its own challenges. Methods that work well for evaluating other providers’ models do not necessarily work well for our models, and vice versa. For example, Anthropic’s Claude series of models are trained to adhere to a specific text format, which we call the Human/Assistant format. When we internally evaluate our models, we adhere to this specific format. If we do not adhere to this format, sometimes Claude will give uncharacteristic responses, making standardized evaluation metrics less trustworthy. Because HELM needs to maintain consistency with how it prompts other models, it does not use the Human/Assistant format when evaluating our models. This means that HELM gives a misleading impression of Claude’s performance.

Furthermore, HELM iteration time is slow—it can take months to evaluate new models. This makes sense because it is a volunteer engineering effort led by a research university. Faster turnarounds would help us to more quickly understand our models as they rapidly evolve. Finally, HELM requires coordination and communication with external parties. This effort requires time and patience, and both sides are likely understaffed and juggling other demands, which can add to the slow iteration times.

The subjectivity of human evaluations
So far, we have only discussed evaluations that are similar to simple multiple choice tests; however, AI systems are designed for open-ended dynamic interactions with people. How can we design evaluations that more closely approximate how these models are used in the real world?

A/B tests with crowdworkers

Currently, we mainly (but not entirely) rely on one basic type of human evaluation: we conduct A/B tests on crowdsourcing or contracting platforms where people engage in open-ended dialogue with two models and select the response from Model A or B that is more helpful or harmless. In the case of harmlessness, we encourage crowdworkers to actively red team our models, or to adversarially probe them for harmful outputs. We use the resulting data to rank our models according to how helpful or harmless they are. This approach to evaluation has the benefits of corresponding to a realistic setting (e.g., a conversation vs. a multiple choice exam) and allowing us to rank different models against one another.

However, there are a few limitations of this evaluation method:

These experiments are expensive and time-consuming to run. We need to work with and pay for third-party crowdworker platforms or contractors, build custom web interfaces to our models, design careful instructions for A/B testers, analyze and store the resulting data, and address the myriad known ethical challenges of employing crowdworkers3. In the case of harmlessness testing, our experiments additionally risk exposing people to harmful outputs.
Human evaluations can vary significantly depending on the characteristics of the human evaluators. Key factors that may influence someone's assessment include their level of creativity, motivation, and ability to identify potential flaws or issues with the system being tested.
There is an inherent tension between helpfulness and harmlessness. A system could avoid harm simply by providing unhelpful responses like "sorry, I can’t help you with that". What is the right balance between helpfulness and harmlessness? What numerical value indicates a model is sufficiently helpful and harmless? What high-level norms or values should we test for above and beyond helpfulness and harmlessness?
More work is needed to further the science of human evaluations.

Red teaming for harms related to national security

Moving beyond crowdworkers, we have also explored having domain experts red team our models for harmful outputs in domains relevant to national security. The goal is determining whether and how AI models could create or exacerbate national security risks. We recently piloted a more systematic approach to red teaming models for such risks, which we call frontier threats red teaming.

Frontier threats red teaming involves working with subject matter experts to define high-priority threat models, having experts extensively probe the model to assess whether the system could create or exacerbate national security risks per predefined threat models, and developing repeatable quantitative evaluations and mitigations.

Our early work on frontier threats red teaming revealed additional challenges:

The unique characteristics of the national security context make evaluating models for such threats more challenging than evaluating for other types of harms. National security risks are complicated, require expert and very sensitive knowledge, and depend on real-world scenarios of how actors might cause harm.
Red teaming AI systems is presently more art than science; red teamers attempt to elicit concerning behaviors by probing models, but this process is not yet standardized. A robust and repeatable process is critical to ensure that red teaming accurately reflects model capabilities and establishes a shared baseline on which different models can be meaningfully compared.
We have found that in certain cases, it is critical to involve red teamers with security clearances due to the nature of the information involved. However, this may limit what information red teamers can share with AI developers outside of classified environments. This could, in turn, limit AI developers’ ability to fully understand and mitigate threats identified by domain experts.
Red teaming for certain national security harms could have legal ramifications if a model outputs controlled information. There are open questions around constructing appropriate legal safe harbors to enable red teaming without incurring unintended legal risks.
Going forward, red teaming for harms related to national security will require coordination across stakeholders to develop standardized processes, legal safeguards, and secure information sharing protocols that enable us to test these systems without compromising sensitive information.

The ouroboros of model-generated evaluations
As models start to achieve human-level capabilities, we can also employ them to evaluate themselves. So far, we have found success in using models to generate novel multiple choice evaluations that allow us to screen for a large and diverse variety of troubling behaviors. Using this approach, our AI systems generate evaluations in minutes, compared to the days or months that it takes to develop human-generated evaluations.

However, model-generated evaluations have their own challenges. These include:

We currently rely on humans to verify the accuracy of model-generated evaluations. This inherits all of the challenges of human evaluations described above.
We know from our experience with BIG-bench, HELM, BBQ, etc., that our models can contain social biases and can fabricate information. As such, any model-generated evaluation may inherit these undesirable characteristics, which could skew the results in hard-to-disentangle ways.
As a counter-example, consider Constitutional AI (CAI), a method in which we replace human red teaming with model-based red teaming in order to train Claude to be more harmless. Despite the usage of models to red team models, we surprisingly find that humans find CAI models to be more harmless than models that were red teamed in advance by humans. Although this is promising, model-generated evaluations remain complicated and deserve deeper study.

Preserving the objectivity of third-party audits while leveraging internal expertise
The thing that differentiates third-party audits from third-party evaluations is that audits are deeper independent assessments focused on risks, while evaluations take a broader look at capabilities. We have participated in third-party safety evaluations conducted by the Alignment Research Center (ARC), which assesses frontier AI models for dangerous capabilities (e.g., the ability for a model to accumulate resources, make copies of itself, become hard to shut down, etc.). Engaging external experts has the advantage of leveraging specialized domain expertise and increasing the likelihood of an unbiased audit. Initially, we expected this collaboration to be straightforward, but it ended up requiring significant science and engineering support on our end. Providing full-time assistance diverted resources from internal evaluation efforts.

When doing this audit, we realized that the relationship between auditors and those being audited poses challenges that must be navigated carefully. Auditors typically limit details shared with auditees to preserve evaluation integrity. However, without adequate information the evaluated party may struggle to address underlying issues when crafting technical evaluations. In this case, after seeing the final audit report, we realized that we could have helped ARC be more successful in identifying concerning behavior if we had known more details about their (clever and well-designed) audit approach. This is because getting models to perform near the limits of their capabilities is a fundamentally difficult research endeavor. Prompt engineering and fine-tuning language models are active research areas, with most expertise residing within AI companies. With more collaboration, we could have leveraged our deep technical knowledge of our models to help ARC execute the evaluation more effectively.

Policy recommendations
As we have explored in this post, building meaningful AI evaluations is a challenging enterprise. To advance the science and engineering of evaluations, we propose that policymakers:

Fund and support:

The science of repeatable and useful evaluations. Today there are many open questions about what makes an evaluation useful and high-quality. Governments should fund grants and research programs targeted at computer science departments and organizations dedicated to the development of AI evaluations to study what constitutes a high-quality evaluation and develop robust, replicable methodologies that enable comparative assessments of AI systems. We recommend that governments opt for quality over quantity (i.e., funding the development of one higher-quality evaluation rather than ten lower-quality evaluations).
The implementation of existing evaluations. While continually developing new evaluations is useful, enabling the implementation of existing, high-quality evaluations by multiple parties is also important. For example, governments can fund engineering efforts to create easy-to-install and -run software packages for evaluations like BIG-bench and develop version-controlled and standardized dynamic benchmarks that are easy to implement.
Analyzing the robustness of existing evaluations. After evaluations are implemented, they require constant monitoring (e.g., to determine whether an evaluation is saturated and no longer useful).
Increase funding for government agencies focused on evaluations, such as the National Institute of Standards and Technology (NIST) in the United States. Policymakers should also encourage behavioral norms via a public "AI safety leaderboard" to incentivize private companies performing well on evaluations. This could function similarly to NIST’s Face Recognition Vendor Test (FRVT), which was created to provide independent evaluations of commercially available facial recognition systems. By publishing performance benchmarks, it allowed consumers and regulators to better understand the capabilities and limitations4 of different systems.

Create a legal safe harbor allowing companies to work with governments and third-parties to rigorously evaluate models for national security risks—such as those in the chemical, biological, radiological and nuclear defense (CBRN) domains—without legal repercussions, in the interest of improving safety. This could also include a "responsible disclosure protocol" that enables labs to share sensitive information about identified risks.

Conclusion
We hope that by openly sharing our experiences evaluating our own systems across many different dimensions, we can help people interested in AI policy acknowledge challenges with current model evaluations.
</document>

First, find the quotes from the document that are most relevant to answering the question, and then print them in numbered order. Quotes should be relatively short.

If there are no relevant quotes, write "No relevant quotes" instead.

Then, answer the question, starting with "Answer:". Do not include or reference quoted content verbatim in the answer. Don't say "According to Quote [1]" when answering. Instead make references to quotes relevant to each section of the answer solely by adding their bracketed numbers at the end of relevant sentences.

Thus, the format of your overall response should look like what's shown between the <example></example> tags. Make sure to follow the formatting and spacing exactly.

<example>
Relevant quotes:
[1] "Company X reported revenue of $12 million in 2021."
[2] "Almost 90% of revene came from widget sales, with gadget sales making up the remaining 10%."

Answer:
Company X earned $12 million. [1]  Almost 90% of it was from widget sales. [2]
</example>

Here is the first question: In bullet points and simple terms, what are the key challenges in evaluating AI systems?

If the question cannot be answered by the document, say so.

Answer the question immediately without preamble.

「Chain of Thought(思考の連鎖)」で論理的な推論を引き出す

この手法は、モデルに自身の作業を検証させたり、なぜその回答に至ったかの根拠を示させたりするものです。Chain of Thoughtは、ハルシネーションを抑えるうえで非常に有効な手段です。ハルシネーションとは、モデルが生成した、虚偽・不正確、あるいは完全に作り話のコンテンツを指します。モデルは確率に基づいて動作しているため、生成された内容がもっともらしく聞こえても、実際にはハルシネーションということがあり得ます。

この例では、「Double check your work to ensure no errors or inconsistencies(誤りや矛盾がないか作業を二重チェックする)」というタスクを組み込んでいます。これにより、モデルはより正確な応答を返しやすくなります。とはいえ、これは万能の手法ではありません。これまで見てきた通り、モデルは確率的に動くため、実際には誤りがあるのに「誤りはない」と返したり、先ほど示した回答とまったく無関係な根拠を提示したりすることもあります。

Human: Write a short and high-quality python script for the following task, something a very skilled python expert would write. You are writing code for an experienced developer so only add comments for things that are non-obvious. Make sure to include any imports required.

NEVER write anything before the ```python``` block. After you are done generating the code and after the ```python``` block, check your work carefully to make sure there are no mistakes, errors, or inconsistencies. If there are errors, list those errors in <error> tags, then generate a new version with those errors fixed. If there are no errors, write "CHECKED: NO ERRORS" in <error> tags.

Here is the task:
<task>
A web scraper that extracts data from multiple pages and stores results in a SQLite database. Double check your work to ensure no errors or inconsistencies.
</task>

こちらはLlama2モデル向けの別の例です。手順を一つずつ示すよう指示している点に注目してください。

I went to the market and bought 10 apples. I gave 2 apples to your friend and 2 to the helper. I then went and bought 5 more apples and ate 1. How many apples did I remain with?

Let's think step by step.

**AWSでLLMを活用する**

AWSは、モデルをゼロから学習させなくてもLLM搭載アプリケーションを構築できる、堅牢なサービス群を提供しています。DoiTは、これらのサービスやツールを最大限に活用していただけるよう支援します。

Amazon Bedrock

Amazon Bedrockは、生成AIアプリケーションの構築とスケーリングをシンプルにするためのフルマネージドサービスです。AI21 Labs、Anthropic、Cohere、Meta、Mistral AI、Stability AI、Amazonといった主要AI企業の高性能な基盤モデルに、単一のAPIからアクセスできます。

  • モデルの選択とカスタマイズ:多彩な基盤モデルのなかから、用途に最適なものを選べます。さらに、ファインチューニングやRetrieval Augmented Generation(RAG)を通じて、自社データでモデルをプライベートにカスタマイズできるため、パーソナライズされたドメイン特化型のアプリケーションを実現できます。
  • サーバーレス体験:Amazon Bedrockはサーバーレスアーキテクチャを採用しているため、インフラ管理は不要です。使い慣れたAWSツールを使って、生成AI機能をアプリケーションに簡単に組み込み、デプロイできます。
  • セキュリティとプライバシー:Bedrockで生成AIアプリケーションを構築すれば、セキュリティ、プライバシー、責任あるAI原則の遵守を確保できます。先進的なAI機能を活用しつつ、データはプライベートかつ安全に保てます。
  • Amazon Bedrock Knowledge Base:Retrieval Augmented Generation(RAG)技術を使って生成AIアプリケーションを強化できる機能です。さまざまなデータソースを集約して中央リポジトリを構築し、基盤モデルやエージェントが最新の独自情報にアクセスできるようにすることで、文脈に即した、より正確な応答を実現します。このフルマネージド機能は、データ取り込みからプロンプト拡張までRAGのワークフロー全体を効率化し、独自のデータソース統合を組み込む手間を省きます。Amazon OpenSearch Serverless、Pinecone、Redis Enterprise Cloudといった主要なベクトルストレージデータベースとシームレスに連携できるため、組織はインフラ管理の煩雑さを抱えることなくRAGを実装できます。
  • Agents for Amazon Bedrock:アプリケーション内に自律エージェントを作成・設定し、エンドユーザーが組織のデータと入力に基づいてアクションを完了できるよう支援する機能です。これらのエージェントは、基盤モデル、データソース、ソフトウェアアプリケーション、そしてユーザーとの会話の間のやり取りをオーケストレーションします。APIを自動的に呼び出したり、ナレッジベースを参照してアクションに必要な情報を補強したりすることが可能です。プロンプトエンジニアリング、メモリ管理、モニタリング、暗号化といったタスクを肩代わりしてくれるため、開発工数を大幅に削減できます。

SageMaker JumpStart

Amazon SageMaker JumpStartは、Amazon SageMaker内に用意された機械学習ハブで、幅広いタイプの問題に対応する事前学習済みモデルやソリューションテンプレートを利用できます。インクリメンタルな学習やデプロイ前のチューニングに対応し、主要なモデルハブから事前学習済みモデルをデプロイ・ファインチューニング・評価する一連のプロセスをシンプルにします。すぐに使えるソリューションや、学習・アプリケーション開発のためのサンプルが揃っており、機械学習プロジェクトを加速できます。

Amazon CodeWhisperer

AWS CodeWhispererは、機械学習を活用したコードジェネレーターです。現在および過去の入力に基づいて、1行のコメントから完成された関数まで、リアルタイムでコードの推奨や提案を行います。

Amazon Q

Amazon Qは、ビジネスニーズに応えるよう設計された生成AI搭載アシスタントで、特定の用途に合わせてカスタマイズできます。AWS上でのアプリケーションやworkloadsの作成・拡張・運用・理解を支援し、組織のさまざまな業務に対応できる多用途なツールです。