Common Hooks

The most common hooking functions are supported out of the box by Unseal.

Some of the methods can be used directly as a function in a hook, others return such a function and some will return the hook itself. This will be indicated in the docstring.

Saving Outputs

This method can be used directly in the construction of a hook.

common_hooks.save_output(detach: bool = True) → Callable

Basic hooking function for saving the output of a module to the global context object

Parameters

cpu (bool) – Whether to save the output to cpu.
detach (bool) – Whether to detach the output.

Returns

Function that saves the output to the context object.

Return type

Callable

Replacing Activations

This method is a factory and returns a function that can be used in a hook to replace the activation of a layer.

common_hooks.replace_activation(replacement_tensor: torch.Tensor, tuple_index: Optional[int] = None) → Callable

Creates a hook which replaces a module’s activation (output) with a replacement tensor. If there is a dimension mismatch, the replacement tensor is copied along the leading dimensions of the output.

Example: If the activation has shape (B, T, D) and replacement tensor has shape (D,) which you want to plug in at position t in the T dimension for every tensor in the batch, then indices should be :,t,:.

Parameters

indices (str) – Indices at which to insert the replacement tensor
replacement_tensor (torch.Tensor) – Tensor that is filled in.
tuple_index (int) – Index of the tuple in the output of the module.

Returns

Function that replaces part of a given tensor with replacement_tensor

Return type

Callable

Saving Attention

common_hooks.transformers_get_attention(output_idx: Optional[int] = None) → Callable

Creates a hooking function to get the attention patterns of a given layer.

Parameters

heads (Optional[Union[int, Iterable[int], str]], optional) – The heads for which to save the attention, defaults to None
output_idx (Optional[int], optional) – If the attention module returns a tuple, use this argument to index it, defaults to None

Returns

func, hooking function that saves attention of the specified heads

Return type

Callable

Creating an Attention Hook

common_hooks.create_attention_hook(key: str, output_idx: Optional[int] = None, attn_name: Optional[str] = 'attn', layer_key_prefix: Optional[str] = None, heads: Optional[Union[int, Iterable[int], str]] = None) → unseal.hooks.commons.Hook

Creates a hook which saves the attention patterns of a given layer.

Parameters

layer (int) – The layer to hook.
key (str) – The key to use for saving the attention patterns.
output_idx (Optional[int], optional) – If the module output is a tuple, index it with this. GPT like models need this to be equal to 2, defaults to None
attn_name (Optional[str], optional) – The name of the attention module in the transformer, defaults to ‘attn’
layer_key_prefix (Optional[str], optional) – The prefix in the model structure before the layer idx, e.g. ‘transformer->h’, defaults to None
heads (Optional[Union[int, Iterable[int], str]], optional) – Which heads to save the attention pattern for. Can be int, tuple of ints or string like ‘1:3’, defaults to None

Returns

Hook which saves the attention patterns

Return type

Hook

Creating a Logit Hook

common_hooks.create_logit_hook(model: unseal.hooks.commons.HookedModel, unembedding_key: str, layer_key_prefix: Optional[str] = None, target: Optional[Union[int, List[int]]] = None, position: Optional[Union[int, List[int]]] = None, key: Optional[str] = None, split_heads: Optional[bool] = False, num_heads: Optional[int] = None) → unseal.hooks.commons.Hook

Create a hook that saves the logits of a layer’s output. Outputs are saved to save_ctx[key][‘logits’].

Parameters

layer (int) – The number of the layer
model (HookedModel) – The model.
unembedding_key (str) – The key/name of the embedding matrix, e.g. ‘lm_head’ for causal LM models
layer_key_prefix (str) – The prefix of the key of the layer, e.g. ‘transformer->h’ for GPT like models
target (Union[int, List[int]]) – The target token(s) to extract logits for. Defaults to all tokens.
position (Union[int, List[int]]) – The position for which to extract logits for. Defaults to all positions.
key (str) – The key of the hook. Defaults to {layer}_logits.
split_heads (bool) – Whether to split the heads. Defaults to False.
num_heads (int) – The number of heads to split. Defaults to None.

Returns

The hook.

Return type

Hook

GPT `_attn` Wrapper

common_hooks.gpt_attn_wrapper(save_ctx: Dict, c_proj: torch.Tensor, vocab_embedding: torch.Tensor, target_ids: torch.Tensor, batch_size: Optional[int] = None) → Tuple[Callable, Callable]

Wraps around the [AttentionBlock]._attn function to save the individual heads’ logits. This is necessary because the individual heads’ logits are not available on a module level and thus not accessible via a hook.

Parameters

func (Callable) – original _attn function
save_ctx (Dict) – context to which the logits will be saved
c_proj (torch.Tensor) – projection matrix, this is W_O in Anthropic’s terminology
vocab_matrix (torch.Tensor) – vocabulary/embedding matrix, this is W_V in Anthropic’s terminology
target_ids (torch.Tensor) – indices of the target tokens for which the logits are computed
batch_size (Optional[int]) – batch size to reduce memory footprint, defaults to None

Returns

inner, func, the wrapped function and the original function

Return type

Tuple[Callable, Callable]

Common Hooks

Saving Outputs

Replacing Activations

Saving Attention

Creating an Attention Hook

Creating a Logit Hook

GPT _attn Wrapper

GPT `_attn` Wrapper