Temporal Circuits: Mapping Time In GPT-2 With Attribution Graphs

by Alex Johnson 65 views

Unveiling the Secrets of Time: How Language Models Grasp Temporal Information

In the realm of artificial intelligence, understanding how language models (like GPT-2 and its larger siblings) process temporal information remains a fascinating and crucial challenge. Current probe-based methods offer glimpses into this process, but a deeper understanding requires identifying the actual circuits responsible for temporal reasoning. This exploration delves into reverse-engineering how these models represent and process time, paving the way for more robust and reliable AI systems. The goal is to go beyond simple detection and map the specific neural pathways that allow these models to understand and reason about time.

This article outlines a plan to map the temporal circuits within GPT-2 (and potentially larger models) using mechanistic interpretability techniques. This approach combines attribution graphs, similar to those used in factual recall circuit analysis, with Sparse Autoencoders (SAEs), drawing inspiration from Llama 3.1 model-diffing work. By identifying the specific neurons and connections involved in processing temporal information, we can gain a more granular understanding of how language models grasp the concept of time.

Existing research has already shown that linear probes can detect temporal scope with impressive accuracy (around 97.1% at Layer 8). Furthermore, steering vectors can influence the model's output, shifting it towards short-term or long-term perspectives. However, a crucial piece of the puzzle is still missing: identifying the specific circuits within the model that encode this temporal information. Understanding these circuits will provide valuable insights into the inner workings of language models and their ability to reason about time-dependent events. This research aims to bridge this gap by pinpointing the exact neural pathways responsible for temporal understanding, moving beyond mere detection to a more comprehensive understanding of the underlying mechanisms. Imagine being able to see the exact pathways in a neural network that light up when it understands the difference between "now" and "later" – that's the level of detail this research aims to achieve.


Key Questions: Probing the Temporal Mind of AI

This investigation seeks to answer several key questions about how language models handle temporal information:

  1. Attention to Time: Which attention heads within the model actively attend to temporal cues present in the input text? Identifying these attention heads will reveal which parts of the model are specifically focused on extracting temporal information. By pinpointing these areas, we can understand how the model identifies and processes time-related words and phrases.
  2. Dedicated Temporal Neurons?: Are there specific, dedicated "temporal neurons" responsible for processing temporal information, or is this function distributed across a broader network of neurons? Understanding the distribution of temporal processing will shed light on the model's architectural design and how it allocates resources for this critical task. Is there a specialized team of neurons handling time, or is it a shared responsibility across the entire network?
  3. Minimal Circuit Discovery: Can we identify a minimal circuit, a core set of neurons and connections, that is sufficient for accurate temporal classification? Finding this minimal circuit would reveal the essential components necessary for temporal reasoning, potentially leading to more efficient and targeted model architectures. It's like finding the smallest engine that can still power a car – identifying the most crucial elements for temporal understanding.
  4. SAE Feature Separation: Do the features extracted by Sparse Autoencoders (SAEs) cleanly separate short-term and long-term temporal information? This will help us understand if SAEs can effectively disentangle different aspects of temporal reasoning, providing interpretable representations of time-related concepts. Can these features clearly distinguish between immediate plans and future goals?

Methodology: A Three-Phased Approach to Uncover Temporal Secrets

Phase 1: Attribution Patching for Temporal Decisions

This phase focuses on identifying the critical components that influence temporal classification using activation patching. By selectively modifying the activations of different parts of the model, we can determine which components are most responsible for distinguishing between short-term and long-term temporal contexts. The methodology involves using activation patching to pinpoint the specific components within the model that are crucial for making temporal decisions. In essence, this is like carefully tweaking different parts of a machine to see which ones have the biggest impact on its ability to tell time.

The following pseudo-code outlines the experimental process using TransformerLens:

# Pseudo-code using TransformerLens
import transformer_lens as tl

model = tl.HookedTransformer.from_pretrained("gpt-2")

# Clean run: "I will plan for retirement" (long-term)
# Corrupt run: "I will eat lunch now" (short-term)

def temporal_patching_experiment():
    """Identify which components change temporal classification."""
    
    # Metric: probe score at Layer 8
    def temporal_metric(logits, cache):
        layer_8_acts = cache["blocks.8.hook_resid_post"]
        return probe.predict_proba(layer_8_acts)[:, 1]  # P(long-term)
    
    # Patch each attention head
    for layer in range(12):
        for head in range(12):
            patched_score = run_with_patch(
                clean="I will plan for retirement",
                corrupt="I will eat lunch now",
                patch_location=f"blocks.{layer}.attn.hook_z",
                head_idx=head
            )
            attribution[layer, head] = patched_score - clean_score

The expected output from this phase is an attribution matrix. This matrix will visually represent which attention heads are most causally important for accurate temporal classification. The matrix will highlight the specific areas of the model that are most sensitive to temporal cues, providing a roadmap for further investigation.

Phase 2: SAE Feature Analysis

In this phase, Sparse Autoencoders (SAEs) are employed to discover and analyze interpretable temporal features. SAEs are trained to identify the most salient and informative features in the model's activations. By examining these features, we can gain insights into how the model represents and processes temporal information at a more abstract level. It's like looking at the key ingredients a chef uses to create a dish – understanding the core elements that contribute to the final temporal understanding.

The process involves loading an existing SAE (from Neuronpedia or training a custom one) and then collecting activations on both short-term and long-term temporal prompts. By comparing the activations of different features across these two types of prompts, we can identify features that are differentially active and therefore likely to encode temporal information.

# Using existing SAE from Neuronpedia or train custom
sae = load_sae("gpt2-layer-8")

# Collect activations on temporal prompts
short_term_acts = collect_activations(SHORT_TERM_PROMPTS)
long_term_acts = collect_activations(LONG_TERM_PROMPTS)

# Find differentially active features
for feature_idx in range(sae.n_features):
    short_activation = sae.encode(short_term_acts)[feature_idx].mean()
    long_activation = sae.encode(long_term_acts)[feature_idx].mean()
    
    if abs(short_activation - long_activation) > threshold:
        temporal_features.append({
            'idx': feature_idx,
            'direction': 'short' if short_activation > long_activation else 'long',
            'diff': abs(short_activation - long_activation)
        })

The expected output is a list of SAE features that encode temporal information, along with interpretable descriptions of what each feature represents. For example, one feature might represent the concept of "immediacy," while another might represent "future planning." These features will provide a higher-level understanding of how the model encodes and processes time.

Phase 3: Circuit Mapping

This final phase integrates the findings from Phases 1 and 2 to create a comprehensive map of the temporal circuit within the language model. By combining the attribution data with the interpretable SAE features, we can trace the flow of information through the model and identify the key components involved in temporal reasoning. It's like putting together a puzzle, where each piece represents a different part of the model, to reveal the complete picture of how temporal information is processed.

The resulting circuit diagram will visually represent the pathway of temporal processing, highlighting the connections between different attention heads, MLPs, and SAE features. This diagram will provide a clear and intuitive understanding of how the model reasons about time, from the initial detection of temporal cues to the final output.

Input β†’ [Temporal Keyword Detection Heads] β†’ [Integration MLPs] β†’ [Layer 8 Representation] β†’ Output
        L1H3, L2H7 (attend to "now", "future")   L4, L5 (combine cues)   (probe location)

The ultimate goal is to create a Neuronpedia-style visualization of the temporal circuit, making it accessible and explorable for the wider research community. This visualization will allow others to delve into the details of the circuit and further investigate the model's temporal reasoning capabilities.


Building on the Shoulders of Giants: Connecting to Existing Research

This research builds upon several existing works in the field of mechanistic interpretability, extending their methodologies to the specific domain of temporal reasoning:

Prior Work Extension Here
Factual Recall Circuits Apply same methodology to temporal reasoning
Refusal Directions Compare temporal steering vectors to refusal vectors
Truth Probe Generalization Test temporal probe robustness across domains
SAE model-diffing Find temporal features that differ across model sizes

Technical Foundation: Tools and Resources

To successfully execute this research, several key technical requirements must be met:

  • TransformerLens: A powerful library for circuit analysis and manipulation.
  • SAE Toolkit: Utilizing resources from Neuronpedia and SAELens for Sparse Autoencoder analysis.
  • Existing Temporal Probe: Leveraging a pre-trained temporal probe with high accuracy (97.1% at Layer 8).
  • Temporal Prompts Dataset: A curated dataset of temporal prompts (data/raw/token_temporal_annotations.json).

Tangible Outcomes: Deliverables

This research will produce the following concrete deliverables:

  1. Attribution Matrix: A detailed matrix highlighting the causal relationships between heads/MLPs and temporal classification.
  2. SAE Feature Catalog: A comprehensive catalog of interpretable temporal features, complete with descriptions.
  3. Circuit Diagram: A clear and intuitive visualization of the temporal processing pathway.
  4. Neuronpedia Integration: Seamless integration of findings into Neuronpedia for community exploration.
  5. Robustness Analysis: An analysis of how the temporal circuits adapt across different domains, such as finance and health.

Step-by-Step: Implementation Plan

The implementation of this research will follow a structured plan:

  • [ ] Set up TransformerLens pipeline with temporal probe metric.
  • [ ] Run attention head attribution patching.
  • [ ] Run MLP attribution patching.
  • [ ] Load/train SAE for Layer 8.
  • [ ] Identify top-k temporal features in SAE.
  • [ ] Manually interpret top features.
  • [ ] Map circuit diagram.
  • [ ] Test circuit across OOD temporal prompts.
  • [ ] Document failure modes.
  • [ ] Create Neuronpedia visualization.

Measuring Success: Criteria

The success of this research will be evaluated based on the following metrics:

Metric Target
Identified causal heads β‰₯3 heads with attribution >0.1
SAE temporal features β‰₯10 clearly interpretable features
Circuit faithfulness Ablating circuit drops accuracy >20%
OOD robustness Circuit works across 3+ domains

Related Issues

  • Issue #4: Synthetic temporal neurons via SAEs
  • Issue #19: Token-level temporal scoring
  • Issue #20: Temporal logit steering

References

  • Conmy et al. (2023). Towards Automated Circuit Discovery
  • Marks et al. (2024). Sparse Feature Circuits
  • Templeton et al. (2024). Scaling Monosemanticity (Anthropic)
  • Neuronpedia Attribution Graphs

For further exploration of interpretability in large language models, consider visiting the Transformer Circuits Initiative at https://transformer-circuits.pub/. This resource offers a wealth of information on related research and techniques.