Author: Jack Kowalski https://entropment.com PATENT PENDING Modern large language models based on transformer architectures combine several conceptually distinct computational stages. In practice these stages are often implemented within a single neural framework, although not all of them require learning or high–capacity neural representations. In particular, a significant portion of the preprocessing and early representation pipeline can be interpreted as a problem of organizing relational data into a structure that allows efficient aggregation, locality control, and multiscale addressing. When expressed in this way, many of the operations preceding neural inference reduce to deterministic transformations on graphs, metric representations, and hierarchical indices. The purpose of the present work is to isolate and formalize this structural layer. Rather than treating the initial stages of representation as implicit behavior of a neural network, we describe them as an explicit computational framework based on graph projections, metric flows, and hierarchical addressing. This perspective removes a number of implementation heuristics that historically emerged from engineering constraints and replaces them with a direct mathematical formulation. The resulting construction provides a compact and efficient mechanism for organizing relational data before it is passed to downstream neural components, where learned models remain essential. In this sense, the framework can be viewed as an explicit formulation of the structural front-end of transformer-based language models: a deterministic representation layer that organizes information prior to neural inference. From an engineering perspective, this separation allows the expensive neural components to operate on data that has already been organized using deterministic and computationally inexpensive procedures. ======== 1_introduction.txt ======== Conceptual Introduction We consider the problem of constructing a geometry on data sets that are only partially observable and whose structure may evolve as new observations appear. Typical examples include vector embeddings, astronomical catalogs, or any system where objects are represented as vectors relative to a chosen reference frame. In such settings the observed set of data points should not be treated as complete; rather, it should be understood as a finite projection of a potentially much larger space. Let X = {xᵢ}ᵢ∈J , xᵢ ∈ ℝᵈ be the observed data set. The index set J is finite, but we assume it is embedded in a larger index space J ⊂ I which may be infinite. The limitation arises not from the underlying structure but from the finite resolution of observation. In practice this means that the data we see form only a visible subset of a larger space. Objects outside this range behave as “dark tokens”: they exist in principle but remain unobserved within the current resolution. To construct a geometry on X, we introduce a projection point x₀ and represent all data points as vectors anchored at this reference: vᵢ = xᵢ − x₀ . This representation induces a radial projection graph in which every observed point is connected to the projection point. The structure of the graph is therefore determined not by an intrinsic tree but by the choice of projection. Alternative projections produce different induced geometries. The central operation of the framework is local aggregation of neighboring values using a variable exponent operator. For a collection of values y₁, …, yₖ we define R_w(y₁, …, yₖ) = ( ∑_{i=1}^k αᵢ(w) |yᵢ|^w )^{1/w} with parameter w ∈ (0, ∞). The exponent w controls the geometry of aggregation. The case w = 1 acts as a neutral multiplicative point separating two regimes: w < 1 emphasizes diffuse contributions and promotes generalization, w > 1 amplifies dominant contributions and induces selective aggregation. This symmetry around the multiplicative unit 1 plays a central structural role. The exponent is not fixed globally. Instead it is derived from the statistical structure of the data set and may vary locally during computation. In practice one may determine bounds Gₘᵢₙ , Gₘₐₓ from the observed data and discretize the range according to the desired resolution. This provides an initial addressing structure that avoids exhaustive search over the entire token space. After this initial stage, the exponent may be modified dynamically to produce local projections and adaptive geometric behavior. Repeated application of the aggregation operator along the projection graph induces a metric flow on the data structure. Distances and paths become relative to the local aggregation regime, meaning that the geometry is context-dependent. In this sense the system behaves analogously to a “next-token map”: each step corresponds to moving through a locally defined neighborhood determined by the current aggregation parameters. An important consequence of the exponentiation is the appearance of natural infinitesimal thresholds. For sufficiently large exponents, contributions below a certain magnitude fall below a practical epsilon and become negligible. This provides an intrinsic mechanism for suppressing weak signals and limiting the effective depth of computation. The framework also allows reprojection of the entire data set with respect to a different reference point. Such a transformation induces a new coordinate system on the same underlying set of vectors. Local reprojections may be used dynamically, while global reprojection corresponds to rewriting the data in a different basis. Finally, the geometry evolves as new observations are incorporated. New tokens extend the observed subset of the space and may shift the relative ordering of existing elements. This produces effects analogous to a redshift of earlier elements: previously dominant tokens may gradually move toward the resolution horizon as the corpus grows. The effective horizon is determined by the chosen resolution and can be interpreted as an entropic property of the representation. The purpose of the present work is to formalize this construction as a variable-exponent aggregation framework on projection graphs, to describe the resulting metric structures, and to analyze how local manipulation of the exponent parameter provides a controllable deformation of the geometry of data. ======== 2_data_space_and_graphs.txt ======== 1. Data Space and Projection Graphs 1.1 Observed Data Space Let X = {xᵢ}ᵢ∈J be a set of observed data points with xᵢ ∈ ℝᵈ. The index set J is finite. However, we assume that the observed data represent only a fragment of a larger underlying space. Formally, there exists an index set I such that J ⊂ I. The full data space may therefore be written as X̃ = {xᵢ}ᵢ∈I, which may be infinite or only partially observable. The difference between J and I reflects the finite resolution of observation: the available data correspond to the visible subset of a larger structure. 1.2 Projection Representation To define a geometry on X, we introduce a projection point x₀ ∈ ℝᵈ. Each observed point is represented as a vector anchored at this projection point: vᵢ = xᵢ − x₀. The set V = {vᵢ}ᵢ∈J is the projection representation of the data set with respect to x₀. Different choices of x₀ induce different coordinate descriptions of the same underlying data. In particular, changing the projection point corresponds to a transformation vᵢ⁽ˣ⁰⁾ ↦ vᵢ⁽ˣ⁰′⁾ given by vᵢ⁽ˣ⁰′⁾ = xᵢ − x₀′. 1.3 Projection Graph The projection representation naturally induces a graph structure. Definition 1 (Projection Graph). The projection graph associated with X and projection point x₀ is the graph G = (V_G, E_G) with vertex set V_G = {x₀} ∪ X and edge set E_G = {(x₀, xᵢ) : i ∈ J}. This graph has a radial (star-like) structure: every observed point is connected to the projection point. The projection graph does not encode relationships between points directly. Instead, relationships are reconstructed through local operations defined on neighborhoods of the graph. 1.4 Local Neighborhoods For a vertex v ∈ V_G, we define a local neighborhood N(v) ⊂ V_G. The neighborhood may be determined by a selection rule such as distance ranking, density-based selection, similarity thresholds, or any other locally defined criterion. The choice of neighborhood function N : V_G → P(V_G) is not fixed globally and may depend on the local structure of the data. 1.5 Ordered Local Data Given a neighborhood N(v) = {u₁, …, uₖ}, we associate a locally ordered collection of values y₁, y₂, …, yₖ. The ordering may arise from distance from v, magnitude of vector components, similarity scores, or other application-dependent criteria. The ordering is not required to be global; it is sufficient that it is well-defined within each neighborhood. 1.6 Observational Horizon Because the observed set X is finite, the projection graph represents only a bounded portion of the underlying space. Elements outside this range cannot be accessed through the current representation. We therefore introduce the notion of an observational horizon determined by the resolution of the representation. Informally, the horizon corresponds to the boundary beyond which contributions of additional elements become indistinguishable within the numerical precision or aggregation thresholds used by the system. This concept will become explicit once the aggregation operators and their associated infinitesimal thresholds are introduced. 1.7 Reprojection Let x₀′ be another point in ℝᵈ. Replacing x₀ by x₀′ induces a new projection representation vᵢ′ = xᵢ − x₀′. This operation defines a reprojection map Φ_{x₀ → x₀′} : V → V′ that transforms the coordinate representation of all data points. Reprojection changes the induced graph geometry while preserving the underlying data set X. Local reprojection may be used dynamically during computation, while global reprojection corresponds to rewriting the entire data set in a different reference frame. 1.8 Summary The framework introduced in this section defines: 1. an observed data space embedded in a potentially larger structure, 2. a projection-based coordinate representation, 3. a radial projection graph induced by the choice of reference point, 4. locally defined neighborhoods used for subsequent aggregation operations. In the next section we introduce aggregation operators with variable exponent, which define the local metric transformations acting on these neighborhoods. -------------------------------------------------------------- Lemma (Prefix–Ultrametric Equivalence) Let addresses be sequences of symbols a(x) = (a₁, a₂, …) and L(x,y) denote the length of the common prefix. Then the metric d(x,y) = e⁻λ L(x,y) is an ultrametric and: NN(x) = arg max_y L(x,y) i.e. the nearest neighbour is equivalent to maximizing the common prefix length. Consequence: prefix tree = NN search structure. Construction of E Possible strategies: k-nearest neighbors radius graph locality window after sorting domain graph (e.g. semantic relations) Requirement: the graph must only ensure local coupling of information. That is: E is not part of the theory — it is only an instance of the data. Selection strategy w(v) is a data modeling parameter. Examples: - constant exponent - local entropy function - neighborhood density function e.g. w(v) = f(|N(v)|) or w(v) = 1 + α H(N(v)) The prefix metric induces an ultrametric structure, making the resulting representation equivalent to a hierarchical tree of relations. ======== 3_variable_exponents_and_agregators.txt ======== 2. Variable Exponent Aggregation Operators 2.1 Local Aggregation Let N(v) = {u₁, …, uₖ} be a neighborhood of a vertex v in the projection graph introduced in Section 1. To each element uᵢ we associate a scalar value yᵢ ∈ ℝ. These values may represent magnitudes of vector components, distances, similarity scores, or other locally defined quantities. To combine these values we introduce a variable exponent aggregation operator. 2.2 Definition of the Aggregation Operator Definition 2 (Variable Exponent Aggregation). For a collection of values y₁, …, yₖ and parameter w ∈ (0, ∞), define the aggregation operator R_w(y₁, …, yₖ) = ( ∑_{i=1}^k αᵢ(w) |yᵢ|^w )^{1/w} where αᵢ(w) ≥ 0 are weighting coefficients satisfying ∑_{i=1}^k αᵢ(w) = 1 unless stated otherwise. The operator generalizes classical ℓ_p-type aggregations while allowing the exponent to vary across the data structure. 2.3 Neutral Multiplicative Point The parameter w determines how strongly dominant values influence the aggregation. The value w = 1 plays a special role as a neutral multiplicative point separating two regimes. For 0 < w < 1 the aggregation emphasizes distributed contributions and tends to smooth differences between values. For w > 1 the aggregation increasingly favors dominant components and suppresses weaker ones. Thus the operator exhibits a symmetry with respect to the multiplicative unit 1, which acts as the transition point between diffusive and selective aggregation regimes. 2.4 Asymptotic Behavior The aggregation operator interpolates between well-known limiting behaviors. As w → 0⁺ the operator approaches a form of geometric averaging (under appropriate normalization of the weights). As w → ∞ the aggregation approaches the dominant value R_w(y₁, …, yₖ) → maxᵢ |yᵢ|. Thus the parameter w continuously deforms the aggregation geometry between diffuse averaging and strict dominance selection. 2.5 Infinitesimal Suppression Exponentiation induces a natural mechanism for suppressing small contributions. Let |y| < 1. Then |y|^w → 0 rapidly as w increases. In practical computations we therefore introduce a numerical threshold ε > 0 such that terms satisfying |yᵢ|^w < ε may be neglected. This defines an effective infinitesimal regime: elements whose contribution falls below the threshold behave as computationally negligible quantities. The threshold depends on the numerical precision and the intended resolution of the aggregation process. 2.6 Data-Driven Exponent Selection The exponent parameter is not fixed globally. Instead it is derived from statistical properties of the observed data. Let Gₘᵢₙ , Gₘₐₓ denote the minimal and maximal magnitudes observed within the data set under a chosen measurement scheme. These values determine the practical dynamic range of the system. A discretization of this range can be used to construct a finite set of admissible exponent levels w₁, w₂, …, w_L corresponding to the desired resolution of the representation. This discretization allows efficient addressing of data elements without exhaustive search across the entire token space. 2.7 Local Adaptation of the Exponent After the initial addressing stage, the exponent parameter may be adjusted dynamically. Formally, we allow w = w(v) to depend on the vertex v of the projection graph. More generally, the exponent may depend on the local configuration of values within the neighborhood: w = w(v, N(v), y₁, …, yₖ). This flexibility allows the aggregation operator to adapt to local structural features of the data. 2.8 Interpretation as Metric Deformation The operator R_w induces a deformation of the geometry of the data space. Different values of w correspond to different aggregation geometries small exponents produce diffuse averaging behavior, large exponents generate selective dominance regimes. Consequently, varying w modifies the effective distances between data elements when aggregation operators are composed along paths in the projection graph. In this sense the exponent parameter acts as a control variable for the local metric structure of the data. 2.9 Iterated Aggregation Applying the aggregation operator repeatedly across neighborhoods of the projection graph generates a sequence x⁽ˡ⁺¹⁾(v) = R_{w(v)} ( {x⁽ˡ⁾(u) : u ∈ N(v)} ). This iterative process defines a dynamic transformation of the data representation. Each step aggregates local information while potentially modifying the local geometry through the choice of exponent. Such iterated applications will later be interpreted as a metric flow on the projection graph. 2.10 Summary The variable exponent aggregation operator introduced in this section provides: 1. a local mechanism for combining neighboring values, 2. a continuous control parameter w governing aggregation geometry, 3. a natural suppression mechanism for small contributions, 4. a framework for dynamic and data-driven adjustment of the aggregation regime. These properties allow the operator to serve as the fundamental building block for constructing adaptive metric structures on projection graphs. The next section describes how these operators generate path-dependent metric structures and flows on the graph. ======== 4_metric_flow.txt ======== 3. Metric Flow on Projection Graphs 3.1 Local Aggregation Dynamics Let G = (V_G, E_G) be the projection graph introduced in Section 1, and let N(v) denote the neighborhood of a vertex v. Using the aggregation operator defined in Section 2, we define an iterative update rule on scalar values associated with the vertices of the graph. Let x⁽⁰⁾ : V_G → ℝ be an initial assignment of values to vertices. The aggregation update at step l+1 is defined by x⁽ˡ⁺¹⁾(v) = R_{w(v)} ( {x⁽ˡ⁾(u) : u ∈ N(v)} ). This rule aggregates information from the local neighborhood of each vertex using the variable exponent operator. 3.2 Aggregation Paths The iterative process induces propagation of information along paths in the projection graph. Let γ = (v₀, v₁, …, vₘ) be a path in G such that v_{i+1} ∈ N(v_i). Along such a path, the aggregation process generates a sequence of intermediate values x⁽⁰⁾(v₀), x⁽¹⁾(v₁), …, x⁽ᵐ⁾(vₘ). Each step involves aggregation with a potentially different exponent w(v_i). Thus the effective transformation along the path depends on the entire sequence {w(v₀), …, w(vₘ)}. 3.3 Path-Dependent Geometry Because the exponent parameter may vary across vertices, the effective transformation between two vertices is not determined solely by their positions in the graph. Instead, it depends on the chosen path. Let v_a, v_b ∈ V_G. Different paths γ₁, γ₂ : v_a → v_b may produce different aggregated values due to different sequences of exponents and neighborhoods encountered along the paths. Consequently, the induced geometry is path-dependent. This differs from classical metric spaces, where distances depend only on endpoints. 3.4 Local Metric Deformation At each vertex v, the exponent parameter w(v) determines the local aggregation regime. Since the operator R_w interpolates between diffuse averaging and dominance selection, the exponent effectively controls the local deformation of the metric structure. Regions of the graph with small exponents behave diffusively, while regions with large exponents exhibit strong selection effects. Thus the projection graph becomes equipped with a locally varying aggregation geometry. 3.5 Metric Flow Interpretation The iterative aggregation process x⁽ˡ⁺¹⁾ = R_{w(·)}(x⁽ˡ⁾) may be interpreted as a flow on the projection graph. Each iteration propagates information through neighborhoods while modifying the effective geometry through the local exponent values. We therefore refer to the sequence {x⁽ˡ⁾}_{l=0}^∞ as a metric flow on the projection graph. The flow reflects both the topology of the graph and the local aggregation regimes determined by the exponent function. 3.6 Local Traversal Interpretation From a computational perspective, each aggregation step corresponds to selecting and combining information from a local neighborhood. The sequence of updates along a path v₀ → v₁ → ⋯ → vₘ may therefore be interpreted as a traversal of the graph guided by local aggregation rules. In contexts such as embedding spaces or token representations, such traversals can be viewed as transitions between neighboring elements under locally defined aggregation dynamics. 3.7 Stability Through Infinitesimal Suppression The suppression mechanism described in Section 2 ensures that contributions below the effective threshold ε rapidly vanish during repeated aggregation. This property stabilizes the flow by preventing the accumulation of insignificant contributions. As a result, the effective depth of influence along a path remains bounded in practice even when the underlying graph is large or potentially unbounded. 3.8 Resolution and Effective Horizon Because contributions below the threshold become negligible, the aggregation process has a finite effective range determined by the exponent values, the threshold ε, and the resolution of the numerical representation. Vertices beyond this effective range do not influence the aggregation outcome. This phenomenon defines an effective horizon of the metric flow: a boundary beyond which additional elements cannot affect the current computation. 3.9 Reprojection Effects Changing the projection point x₀ modifies the graph representation and therefore the structure of neighborhoods. Since the metric flow operates through these neighborhoods, reprojection induces a transformation of the aggregation dynamics. Global reprojection corresponds to redefining the coordinate system of the entire data set, while local reprojection may be used to explore alternative geometric descriptions of a given region of the data space. 3.10 Summary The variable exponent aggregation operators introduced in Section 2 generate a dynamic process on the projection graph. This process has the following structural properties: 1. information propagates along graph paths through local aggregation, 2. the exponent parameter induces local deformations of aggregation geometry, 3. the resulting transformations depend on the chosen path through the graph, 4. infinitesimal suppression introduces a natural computational horizon. Together these features define a metric flow on projection graphs, providing a flexible framework for constructing adaptive geometries on partially observed data spaces. ======== 5_adressing_resolutions_and_scales.txt ======== 4. Addressing, Resolution, and Data Horizons 4.1 Observational Scale The aggregation framework introduced in the previous sections operates on data whose observable range is finite. Let Gₘᵢₙ , Gₘₐₓ denote the minimal and maximal magnitudes encountered in the observed data set under a chosen measurement scheme. These values define the observable scale of the system. Conceptually, this scale arises from the ratio observable size / smallest resolvable scale. Such scale relations appear naturally in physical measurement processes, where the smallest measurable variation determines the effective resolution of the representation. In practice, the theoretical minimal scale does not necessarily coincide with the scale used for sorting or indexing data. Instead, an operational resolution must be introduced. 4.2 Normalized Magnitude Representation To obtain a bounded representation of magnitudes we introduce a nonlinear normalization function u : ℝ → (0,1) defined by u(x) = ¹/₂ (1 + tanh(α x)) (1 − δ) where α > 0 controls the compression of the input scale, δ is a small global correction parameter. This transformation has several desirable properties: 1. the image of the function is bounded within (0,1), 2. large magnitudes saturate smoothly, 3. small magnitudes remain distinguishable near the origin. The normalized quantity u(x) therefore acts as a bounded representation of the observed magnitude. 4.3 Logarithmic Resolution To convert the normalized representation into a scale suitable for addressing we introduce the transformation λ(x) = − ln u(x). Since u(x) ∈ (0,1), the transformed quantity satisfies λ(x) ∈ (0, ∞). This mapping converts multiplicative differences in the original scale into additive differences in the logarithmic domain. The resulting representation is particularly suitable for discretization. 4.4 Discrete Resolution Levels Let L be the desired number of resolution levels. Define the logarithmic interval [λₘᵢₙ, λₘₐₓ] with λₘᵢₙ = − ln u(Gₘₐₓ), λₘₐₓ = − ln u(Gₘᵢₙ). The interval length Δ = λₘₐₓ − λₘᵢₙ determines the dynamic range of the representation. We define the resolution step ε = Δ / L. This step determines the effective discretization of the logarithmic scale. 4.5 Discrete Address Levels Each value x is mapped to an integer index n(x) = ⌊ λ(x) / ε ⌋ . The index n(x) represents the discrete level of the element within the logarithmic scale. Only indices within the interval nₘᵢₙ ≤ n(x) ≤ nₘₐₓ are relevant for the observed data set, where nₘᵢₙ = ⌈ λₘᵢₙ / ε ⌉ , nₘₐₓ = ⌊ λₘₐₓ / ε ⌋ . The number of effectively occupied levels is therefore R = nₘₐₓ − nₘᵢₙ + 1. 4.6 Address Construction The integer index n(x) can be converted into a symbolic representation by expressing it in a chosen base b ≥ 2. Let n(x) = ∑_{k=0}^m d_k b^k with digits d_k ∈ {0,1,…,b−1}. The sequence (d_m, …, d_0) constitutes an address string representing the position of the element in the discretized scale. In practical implementations bases such as b = 36 are convenient because they allow compact symbolic representations using alphanumeric characters. 4.7 p-Adic Interpretation The address construction above can be interpreted as a representation in a positional system analogous to a p-adic expansion. Instead of storing explicit ordering relations between tokens, one may construct a tree structure whose nodes correspond to prefixes of these digit sequences. Such a structure acts as a hierarchical address tree mapping addresses to data elements. This provides an efficient alternative to exhaustive search through the entire token space. 4.8 Dynamic Extension and Redshift As new data points are incorporated into the system, the observable range [Gₘᵢₙ, Gₘₐₓ] may expand. Consequently, the discretized representation must adapt to the new scale. This process may shift the relative positions of previously observed elements within the logarithmic scale. In particular, elements that were initially dominant may move toward larger index values as the data set grows. This gradual displacement can be interpreted as a redshift of earlier tokens: their relative position moves toward the outer regions of the representation as new elements appear. 4.9 Dark Tokens Elements that lie outside the currently observable range cannot be resolved within the present discretization. Such elements may exist in the underlying data space but remain invisible to the current representation. We refer to these elements as dark tokens. The presence of dark tokens reflects the fact that the observable data set represents only a finite window into a potentially larger structure. 4.10 Entropic Data Horizon The discretization process introduces an effective boundary beyond which additional distinctions cannot be represented. This boundary defines a data horizon determined by the resolution parameter L, the logarithmic step ε, and the numerical precision of the representation. From an information-theoretic perspective the horizon corresponds to the maximum entropy that can be represented within the chosen resolution. Elements beyond this horizon cannot influence the aggregation dynamics because their contributions fall below the effective resolution of the system. 4.11 Summary The addressing scheme introduced in this section provides a practical mechanism for organizing data elements within the aggregation framework. The construction combines: 1. nonlinear normalization of magnitudes, 2. logarithmic scaling, 3. discrete resolution levels, 4. symbolic addressing through positional representations. Together these elements produce a hierarchical representation that allows efficient navigation of large data sets while remaining compatible with the variable-exponent aggregation operators introduced earlier. ======== 6_exponent_control_and_metric_manipualtion.txt ======== 5. Exponent Control and Metric Manipulation 5.1 Exponent Field The aggregation operators introduced earlier depend on a variable exponent w(v) defined on the data space V ⊂ ℝᵈ. The function w : V → ℝ⁺ acts as a control field that modulates the contribution of individual elements during aggregation. Unlike fixed-exponent constructions, the exponent here is not constant across the data space. Instead it may depend on the magnitude of the vector, its position within the projection graph, or external control parameters. Consequently, the exponent field introduces a spatially varying deformation of the aggregation geometry. 5.2 Aggregation with Variable Exponent Let {xᵢ}ᵢ=₁ᴺ be a set of vectors associated with nodes of a projection graph. The aggregated quantity at node v can be expressed abstractly as A(v) = ∑_{i=1}^N Φ(xᵢ)^{w(v)} where Φ(x) denotes the magnitude operator introduced previously. The exponent therefore controls the relative amplification or suppression of contributions. Small variations in w(v) may strongly modify the effective influence of different elements. 5.3 Symmetry Around the Neutral Exponent The value w = 1 plays a special role. When w(v) = 1 the aggregation operator reduces to the linear regime A(v) = ∑_{i=1}^N Φ(xᵢ). Thus the exponent w = 1 represents a neutral configuration in which the aggregation does not introduce nonlinear distortion. Values w > 1 increase the dominance of large magnitudes, while w < 1 enhance the relative contribution of smaller elements. The neighborhood of w = 1 therefore acts as a symmetry point separating two qualitatively different aggregation regimes. 5.4 Metric Deformation Because the exponent modifies the contribution of magnitudes, it effectively alters the geometry induced by the aggregation operator. Let d(x, y) denote the base distance measure defined on the embedding space. Under the influence of the exponent field the effective distance becomes d_w(x, y) which depends implicitly on w(v). In this sense the exponent field produces a metric deformation of the data space. Regions where w(v) differs significantly from unity exhibit altered sensitivity to differences in magnitude. 5.5 Local Metric Control The exponent field may be defined locally on the projection graph. Thus different regions of the data space may operate under different effective geometries. For example: regions with w(v) > 1 emphasize dominant signals, regions with w(v) < 1 emphasize weak contributions. This local modulation enables the system to adapt to heterogeneous structures within the data. In practical applications the exponent field may be derived from local statistics or from properties of the addressing structure introduced earlier. 5.6 Relation to Resolution The exponent field interacts with the resolution mechanism described in the previous section. Recall that the addressing scheme depends on the logarithmic quantity λ(x) = − ln u(x). Changing the exponent effectively modifies the scaling of magnitudes prior to discretization. Consequently, variations in w(v) may shift elements between resolution levels. This effect introduces a controlled drift in address space analogous to a change of observational scale. 5.7 Infinitesimal Sensitivity Consider small variations w(v) = 1 + ε. Expanding the aggregation operator around w = 1 yields Φ(x)^{1+ε} = Φ(x) (1 + ε ln Φ(x) + O(ε²)). Thus the first-order correction depends on ln Φ(x). This logarithmic term reveals how infinitesimal changes of the exponent induce systematic biases toward particular magnitude scales. Such sensitivity provides a mechanism for fine-grained control of aggregation behavior. 5.8 Dynamic Adjustment The exponent field may evolve during the operation of the system. Possible adjustment mechanisms include adaptation to local density of data points, response to changes in the addressing horizon, external control parameters. Dynamic adjustment enables the system to balance two competing objectives: 1. stability of the aggregation structure, 2. responsiveness to newly introduced data. 5.9 Emergent Structural Regimes When the exponent field varies across the data space, the system may exhibit qualitatively different regimes of behavior. For example: strong exponent amplification may lead to concentration of influence around a small number of dominant nodes, reduced exponent values may produce diffuse aggregation patterns. The transition between these regimes may occur abruptly when the exponent crosses certain thresholds. Such behavior resembles phase-transition phenomena observed in complex systems. In particular, the structure is reminiscent of energy landscapes encountered in spin-glass models. A detailed analysis of this analogy lies outside the scope of the present text and will be discussed separately. 5.10 Summary The exponent field introduces a mechanism for controlling the geometry of aggregation on projection graphs. Its main roles are: 1. modulation of contributions during aggregation, 2. deformation of the effective metric structure, 3. interaction with the addressing and resolution mechanisms. Through these effects the exponent field transforms the aggregation framework into a dynamically controllable geometry of data space. ======== 7_operational_algebra_on_projection_graphs.txt ======== 6. Operational Algebra on Projection Graphs 6.1 Projection Graph as the Primary Object Let G = (V, E) be a projection graph. Nodes v ∈ V represent observed data elements (e.g., tokens), while edges (vᵢ, vⱼ) ∈ E represent relations derived from projection operations. Each node carries a vector-valued attribute xᵥ ∈ ℝᵈ. These vectors do not define a physical space. Instead they represent coordinate descriptions of relational observations. Thus the underlying structure of the system is the graph G, not the embedding space. The embedding is only an intermediate representation used to compute relational quantities. 6.2 Projection Operator Let Pₚ : V → ℝ be a projection operator defined with respect to a reference node p. For a node v Pₚ(v) measures the relation between v and the projection center p. Different choices of p produce different relational descriptions of the same graph. Thus the system admits multiple observational frames. Changing the projection center corresponds to a transformation of coordinates rather than a modification of the underlying structure. 6.3 Aggregation Operator For a node v, define the aggregation operator A(v) = ∑_{u ∈ N(v)} Φ(xᵤ)^{w(v)} where N(v) is the neighborhood of v, Φ is the magnitude operator, w(v) is the exponent control field introduced earlier. This operator combines information from neighboring nodes according to the local metric deformation induced by w(v). Aggregation therefore acts as a local information flow on the projection graph. 6.4 Metric Flow Operator Define the metric flow operator F which maps node states to updated states xᵥ⁽ᵗ⁺¹⁾ = F( xᵥ⁽ᵗ⁾ , N(v) ). The operator consists of two stages: 1. aggregation of neighboring contributions, 2. normalization and projection back into the representation domain. Iterating the flow x⁽⁰⁾ → x⁽¹⁾ → x⁽²⁾ → … produces an evolving relational structure on the graph. The dynamics reflect the interaction between addressing resolution, exponent control, and projection geometry. 6.5 Address-Space Operations Each node possesses an address a(v) derived from the logarithmic discretization described in the previous section. Operations may therefore be performed directly in address space rather than in vector space. Typical operations include: Address comparison a(vᵢ) ≈ a(vⱼ) indicating that two nodes fall within the same resolution region. Prefix relations Nodes sharing address prefixes correspond to nearby regions of the logarithmic scale. Hierarchical grouping Address prefixes define a natural tree structure used for efficient retrieval and aggregation. 6.6 Projection Re-centering Let p ∈ V be a projection center. A change of observational frame corresponds to selecting another node p′ ∈ V. Formally this induces a transformation Pₚ → P_{p′}. Since addressing and aggregation depend on projections, the transformation propagates through the entire system. This operation may be interpreted as relocating the origin of observation. Importantly, the underlying graph G remains unchanged. 6.7 Local Metric Manipulation The exponent field w(v) provides a mechanism for modifying the local geometry of aggregation. Operations on the exponent field therefore induce metric transformations on the graph. Examples include amplification of dominant nodes, diffusion of weak signals, selective focusing on specific regions of the graph. These transformations alter the effective influence structure without modifying the graph topology. 6.8 Resolution Transformation Resolution parameters determine the discretization step ε used for address construction. Changing this parameter produces a transformation a(v) → a′(v) which corresponds to observing the graph under a different resolution scale. This operation may reveal previously hidden structures or merge previously distinct regions. 6.9 Emergent Geometry Although the formalism operates purely on graph relations and scalar values, an effective geometry may emerge from the interaction of projections, addressing, and aggregation. Distances, neighborhoods, and clusters arise as secondary constructs induced by the operational algebra. Thus the apparent geometry of the data is not assumed but produced by the dynamics of the system. 6.10 Algebra of Operations The system admits a set of fundamental operations: 1. Projection change Pₚ → P_{p′} 2. Exponent transformation w(v) → w′(v) 3. Resolution change ε → ε′ 4. Metric flow iteration Fᵗ These operations form an operational algebra acting on the projection graph. Compositions of these operations define transformations of the relational structure observed in the data. Map of the Construction The framework introduced in the preceding sections can be summarized as follows. Section 1 — Data Space and Projection Graphs Defines the primary object: projection graph G = (V, E) and introduces vector attributes used as observational coordinates. The embedding space serves only as a computational representation. Section 2 — Variable Exponent Aggregation Operators Introduces nonlinear aggregation operators controlled by an exponent field w(v). These operators define how local information is combined. Section 3 — Metric Flow on Projection Graphs Defines iterative flows that propagate information across the graph. The dynamics arise from repeated aggregation and normalization. Section 4 — Addressing, Resolution, and Data Horizons Introduces logarithmic discretization and address construction. Addresses organize nodes into a hierarchical structure and determine the observable resolution. Section 5 — Exponent Control and Metric Manipulation Shows how the exponent field modifies the effective geometry of aggregation. This enables local control of the metric properties of the system. Section 6 — Operational Algebra Combines the previous components into a coherent set of operations acting on the projection graph. The system can now perform transformations such as changing observational frames modifying metric structure adjusting resolution evolving relational states through metric flow. Interpretation The framework does not assume a pre-existing geometric space. Instead it provides an operational mechanism through which geometric structures emerge from relational observations. The apparent geometry of the data therefore reflects the interaction between projections, aggregation operators, resolution mechanisms, and metric control fields. ======== 8_Appendix_A_Minimal_Implementation_Model.txt ======== Appendix A – Minimal Implementation Model A.1 Conceptual Setting We consider a set of observations D = {vᵢ}ᵢ=₁ᴺ where each element vᵢ ∈ ℝᵈ is a feature vector describing the observation relative to a chosen projection point. The interpretation of the data remains arbitrary semantic embedding (e.g. GloVe-300D), directions and brightnesses of astronomical objects, particle configurations in a gas, amplitudes of quantum states, effective coordinates in MERA / cMERA-type constructions. Key constructive assumption: The vectors do not represent points in a space. They merely describe observational relations with respect to the projection point. Geometric space appears only as an emergent structure of the projection graph. A.2 Projection Coordinates For each coordinate x we apply the mantissa transformation u(x) = ¹/₂ (1 + tanh(α x)) (1 − δ) where • α – local scaling parameter • δ – numerical boundary regulator and the logarithmic transformation ℓ(x) = − ln u(x) Interpretation: large values of x are compressed asymptotically local dynamics remains differentiable the range of values becomes quasi-logarithmic which enables addressable discretization. A.3 Resolution Levels Let Gₘᵢₙ , Gₘₐₓ denote the minimal and maximal observed value in the dataset. We introduce resolution L which determines the number of addressing levels. Logarithmic range: ℓ_high = − ln(u(Gₘᵢₙ)) ℓ_low = − ln(u(Gₘₐₓ)) R = ℓ_high − ℓ_low local resolution: ε = R / L Discrete level indices: n = ⌊ ℓ(x) / ε ⌋ which defines the resolution layer of the observation. A.4 Address Construction For a vector v = (x₁, x₂, …, x_d) we define the level index for each coordinate nₚ = ⌊ − ln(u(xₚ)) / ε ⌋ Then we represent it in base B nₚ = ∑_{k=0}^m aₖ B^k where aₖ ∈ {0, …, B−1} The global address is obtained by concatenating the digits of all coordinates A(v) = (a_{0,1} a_{1,1} …) (a_{0,2} a_{1,2} …) … (a_{0,d} a_{1,d} …) which creates a p-adic address tree. A.5 Projection Graph Based on the addresses we build the graph G = (V, E) where V = {vᵢ} and an edge exists when A(vᵢ) ~ A(vⱼ) i.e. the addresses agree up to a given tree depth. In practice this means: d_A(vᵢ, vⱼ) < k where k is the resolution level. This gives rise to tree-like structure with local data clusters naturally representing a hierarchy of density. A.6 Emergent Clusters As resolution L is increased new tree nodes appear, existing nodes split into subclusters. Interpretatively: in astronomical data this corresponds to the structure star → cluster → galaxy in linguistic embeddings token → context → semantic cluster i.e. the hierarchy is a property of the projection resolution, not of the data itself. A.7 Minimal Processing Procedure The minimal data processing algorithm has the form: Step 1 - Dataset scan determine Gₘᵢₙ, Gₘₐₓ for the entire dataset. Step 2 - Resolution calibration compute ε = [− ln u(Gₘᵢₙ) + ln u(Gₘₐₓ)] / L Step 3 - Address computation for each vector: 1. u(x) transformation 2. computation of ℓ(x) 3. computation of index n 4. representation in base B Step 4 - Tree construction insert the address into the p-adic tree. Step 5 - Local neighborhood queries nearest neighbors are found via common address prefix, local graph aggregation. A.8 Changing Projection Point Each projection point defines a different geometric representation of the data. Changing the projection point: P → P′ means re-expressing the coordinates vᵢ → vᵢ′ which leads to a new projection graph G′ = (V, E′) Important - different projections can lead to different effective metrics. A.9 Metric Gauge Interpretation The parameter controlling the aggregation exponent λ can be interpreted as a gauge of the projective metric. Changing λ₁ → λ₂ causes change in aggregation geometry, change in neighborhood relations, redistribution of graph density. Consequently, different values of λ can lead to equivalent descriptions of the same data structure. A.10 Epicycle Analogy Classical astronomical analogy epicycles model and heliocentric model describe the same orbital dynamics. The difference concerns only the computational complexity of the representation. Analogously: different projective metrics λ₁, λ₂, … can describe the same data structure with different computational cost. The goal of the construction is not to identify the “true geometry”, but to find the algebraically simplest representation. A.11 Practical Interpretation The minimal model enables: 1. scalable indexing of very large datasets 2. local neighborhood search 3. analysis of hierarchical data structure 4. manipulation of the metric via the aggregation exponent parameter. The key property is the fact that geometry is not assumed, but emerges from operations on the projection graph. ======== 9_Appendix_B_Relation_to_Known_Mathematical_Structures.txt ======== Appendix B – Relation to Known Mathematical Structures B.1 General remark The construction presented in this work was formulated independently as an operational tool for manipulating relations in large datasets via projection graphs and variable-exponent aggregation. However, during formalization it turns out that many elements of this construction exhibit strong structural similarity to known mathematical and physical structures. In this appendix we indicate the most important of these connections. Proofs of equivalence, isomorphisms, and precise formal mappings are not developed here. They will be presented in a separate part of the work along with full proofs. The purpose of this appendix is merely to point out the mathematical contexts in which similar structures already appear. B.2 p-adic Structures Data addressing in the projection construction uses representation in base B and a hierarchical prefix structure. For coordinates nₚ = ⌊ − ln(u(xₚ)) / ε ⌋ and their representation in base B nₚ = ∑_{k=0}^m aₖ B^k a natural tree structure arises. The address metric based on the length of the common prefix d_A(vᵢ, vⱼ) is structurally consistent with the ultrametric characteristic of p-adic spaces. In particular: distance depends on the deepest common tree node the ultrametric inequality is satisfied d(x, z) ≤ max(d(x, y), d(y, z)) The structure of the projection graph can therefore be interpreted as a realization of an ultrametric data space. A precise mapping between projection addressing and p-adic metric will be presented in the proof part of the work. B.3 Ultrametric Geometry and Hierarchical Clustering The ultrametric properties of the projection graph lead to a natural hierarchical structure. In particular: clusters arise via common address prefixes resolution L acts as a scale parameter. Increasing resolution causes successive splitting of clusters. This gives rise to a hierarchy similar to dendrograms used in data analysis, but: without the need for global optimization without defining a Euclidean metric. The hierarchy emerges directly from the projection construction. B.4 Renormalization Structures The process of increasing resolution L → L′ leads to successive revealing of local structures in the data. This mechanism is formally analogous to renormalization procedures used in statistical physics and field theory. In particular: graph aggregation corresponds to integration of degrees of freedom increasing resolution corresponds to moving to lower length scales. A natural interpretation of the projection graph arises as a renormalization tree. These relations show similarity to structures used in tensor networks of MERA type. B.5 MERA and cMERA Analogies The tree-like structure of the projection graph and variable-scale aggregation resemble constructions used in: Multi-scale Entanglement Renormalization Ansatz (MERA) Continuous MERA (cMERA) In these constructions: the state space is built hierarchically aggregating operators act at successive scale levels. In the projection graph: nodes represent local data aggregations transitions between resolution levels play the role of renormalization transformations. Formal connections between the aggregation operator and disentangling operators in MERA will be discussed later in the work. B.6 Spin Glass Energy Landscapes Changing the aggregation exponent λ causes a change in the effective metric of the data space. In many cases this leads to abrupt changes in the topology of the projection graph. This phenomenon resembles: reorganization of the energy landscape in spin glass models transitions between different configurations of local minima. In particular: small changes in the parameter can lead to sudden reorganizations of cluster structure regions of metric stability appear. These relations suggest an interpretation of the exponent parameter as a phase parameter of data geometry. B.7 Information Geometry The mantissa transformation u(x) = ¹/₂ (1 + tanh(α x)) (1 − δ) introduces nonlinear compression of the value space. After passing to the logarithmic representation ℓ(x) = − ln(u(x)) the coordinate space becomes naturally consistent with many constructions from information geometry. In particular: distances take on a quasi-logarithmic character local dynamics remains differentiable. The interpretation of this transformation as a change of coordinate system in information space will be developed in the proof part of the work. B.8 Variable Exponent Spaces The aggregation operator with dynamic exponent A_λ(·) leads to structures resembling function spaces with variable exponent. In functional analysis, spaces of the type L^{p(x)} are known, where the exponent depends on position. In the projection construction: the exponent can depend on the aggregation context the metric becomes locally variable. Precise relations between the aggregation operator and variable-exponent spaces will be presented later in the work. B.9 Gauge Interpretation of Metric Parameter The parameter controlling the aggregation exponent λ can be interpreted as a parameter for choosing the geometric representation. Changing this parameter does not alter the data itself, but only the manner of its geometric organization. Analogously to field theory: different values of the parameter correspond to different gauge choices of the metric. Consequently, many different geometric representations can describe the same relational structure of the data. B.10 Summary of Structural Analogies The construction presented in the work exhibits structural analogies to several known classes of mathematical objects: element of the construction | mathematical analogy --------------------------------------|------------------------- projection addressing | p-adic spaces projection graph | ultrametric spaces change of resolution | renormalization aggregation operator | variable-exponent spaces change of parameter λ | gauge transformations cluster reorganization | spin glass energy landscapes In the next part of the work the following will be presented formal theorems, proofs of equivalence, constructions of isomorphisms between the above structures. ======== 10_Conclusion_Metric_Gauge_on_Data_Graphs.txt ======== Conclusion – Metric Gauge on Data Graphs C.1 Emergent Geometry from Projection In the presented construction, geometry is not assumed a priori. Instead, it emerges as a property of operations performed on the data projection graph. Observations are described solely by relation vectors with respect to a chosen projection point, and subsequently organized into a graph structure through: mantissa transformation, hierarchical addressing, variable-exponent aggregation. As a result, the geometric space is an emergent structure arising from the operational organization of data. This approach allows the analysis of relations between observations even when the actual metric of the observation space is unknown or is only an approximation. C.2 Metric Gauge Parameter The central element of the construction is the parameter λ controlling the exponent of the aggregation operator. This parameter acts as a gauge of the data metric, since changing its value alters the effective geometry of the relational space of the graph. Depending on the range of values of the parameter, different metric regimes appear: range λ | metrics characteristics ---------------|------------------------ small λ | structures close to Banach spaces λ ≈ 2 | Hilbertian structures λ > 2 | hypermetric regimes λ → ∞ | ultrametric / p-adic limit The CD QCO construction, from which this approach historically originated, corresponds to the special case λ = 2 which corresponds to the hypermetric regime. C.3 Metric Limits For computational reasons, the two extreme cases remain merely limits of the construction. The linear limit λ → 0 corresponds to completely linear aggregation. In this limit, the geometric structure generated by the aggregation operator disappears, and the relational space degenerates to a linear form. In practical applications, such a limit is not useful. The p-adic limit λ → ∞ leads to an ultrametric space. In this limit, distances are determined solely by the deepest common prefix in the address tree structure. Although such a structure is extremely useful analytically, its direct numerical realization becomes unstable. Therefore, in practical applications the construction operates in the range 0 < λ < ∞ with the possibility of smoothly approaching both limits. C.4 Continuous Metric Interpolation The most important property of the parameter λ is the possibility of continuous interpolation between different data geometries. Changing the parameter leads to a smooth deformation of the metric of the projection graph. This allows analysis of how the neighborhood structure of the data changes, which clusters remain stable, where reorganizations of the graph topology appear. Thus, the construction provides a tool for studying intermediate states between different classes of metrics. C.5 Metric Transformations of Equations An important consequence of the above construction is the possibility of treating the metric parameter as a tool for transforming equations. If a mathematical or physical problem can be written in relational form on the projection graph, then changing the parameter λ leads to an alternative representation of the same structure. This means that different theories or models can be interpreted as different metric representations of the same problem. Such an approach enables identification of equivalent mathematical descriptions, finding computationally simpler representations, analysis of the stability of the problem structure with respect to metric deformation. C.6 Metric Continuation Beyond Classical Regimes Since the parameter λ can be treated as a continuous variable, it is also possible to consider intermediate regimes between known classes of spaces. This allows the construction of equations in metrics that do not belong to classical categories neither Hilbertian, nor ultrametric. Such regimes may prove particularly useful in the analysis of complex data systems, where the relational structure does not correspond to any classical geometry. C.7 Engineering Perspective In engineering applications, the goal is not to identify the “true” geometry of the problem. Far more important is finding a representation in which computations are stable, computational complexity is minimal, approximation is sufficiently good for the given scale of the problem. The metric parameter λ thus serves as a tool for regulating the geometry of computations. It allows matching the structure of the relational space to the nature of the analyzed problem. C.8 Final Perspective The presented construction can be interpreted as a gauge theory of metric on data graphs. In this view data form a relational space, the projection graph defines the topology of this space, the parameter λ controls the choice of geometric representation. Instead of seeking one privileged metric, the construction enables exploration of an entire family of geometries describing the same data structure. Such an approach opens the possibility of systematically studying relations between different mathematical models and finding their equivalent representations depending on the scale and nature of the analyzed problem. ======== 11_Metric_Control_Equivalences.txt ======== Metric Control Equivalences 1. Basic metric deformation In the projection graph construction, the metric is given by the kernel K_λ(x,y) = e^{-λ L(x,y)} where L(x,y) — level of the common ancestor in the projection tree λ > 0 — metric deformation parameter. Equivalently, the distance can be written as d_λ(x,y) = e^{-λ L(x,y)} which means that − log d_λ(x,y) = λ L(x,y) i.e. tree depth is linear in the logarithmic distance. 2. Interpretation as heat kernel If L denotes the Laplacian of the data graph, then the diffusion equation has the form ∂u/∂t = − L u and its solution u(t) = e^{-t L} u(0) defines the heat kernel K_t = e^{-t L} The metric deformation parameter thus satisfies the equivalence λ ≡ t i.e. λ = diffusion time and the kernel K_λ(x,y) is precisely the heat kernel on the data graph. 3. Statistical interpretation — temperature The aggregation operator A_k(x₁, …, xₙ) = ( ∑ x_i^k )^{1/k} is the L^k norm. Limits of the norm: k → 1 ⇒ mean k → ∞ ⇒ max(x_i) which corresponds exactly to the Boltzmann mechanism p_i = e^{β E_i} / ∑_j e^{β E_j} with the identification k ∼ β = 1/T i.e. λ ∼ β ∼ 1/T Interpretation: parameter | interpretation -------------|-------------------- large λ | low temperature small λ | high temperature i.e. T ∼ 1/λ 4. Relation to Dirichlet energy On a graph with weights W_{ij}(λ) = e^{-λ L(x_i, x_j)} the energy of a function f is E_λ(f) = ½ ∑_{i,j} W_{ij}(λ) (f_i − f_j)² which is the Dirichlet energy on the graph. In many problems there exists a value λ* for which E_λ attains a minimum. Interpretation: λ* = optimal data geometry 5. Diffusion entropy If p(t) is the diffusion distribution p(t) = e^{-λ L} p(0) then the Shannon entropy S(t) = − ∑_i p_i(t) log p_i(t) satisfies dS/dt ≥ 0 i.e. diffusion maximizes entropy. In this interpretation λ = time of entropy growth 6. Heat map geometry The diffusion distribution u(x,t) forms a temperature map on the graph. The diffusion distance has the form D_t(x,y)² = ∑_z (K_t(x,z) − K_t(y,z))² / π(z) which means that two points are close when the heat spreading from them is similar. The controlling parameter t = λ controls the range of diffusion. 7. Phase transitions of cluster structure As λ changes, the structure of weights W_{ij}(λ) changes, and thus also the Laplacian L_λ, causing reorganization of the operator spectrum. In practice one observes: sudden appearance of spectral gaps stabilization of clusters. Interpretation: change of λ ⇒ phase transitions of data geometry 8. Energy landscape (spin glass) For large λ the weight matrix has a block structure W = [ B₁ ε ε ε B₂ ε ε ε B₃ ] which is analogous to the Parisi matrix in spin glass models. Interpretation: cluster hierarchy forms an energy landscape with many local minima. 9. Relation between parameters ε and λ In the construction there appear two independent scale parameters: ε — addressing resolution λ — metric deformation. Natural scale relation: λ = k / ε i.e. k = λ ε where k is the effective metric temperature. Interpretation: argument | representation ---------------|------------------------ ε | minimal measurement scale λ | geometry control k | metric temperature 10. Metric limits The deformation parameter has domain λ ∈ (0, ∞) The limits are degenerations: λ → 0 metric disappears d_λ(x,y) ≈ 1 λ → ∞ ultrametric arises d(x,y) = { 0 if x = y e^{-∞} if x ≠ y } i.e. pure p-adic space. 11. Epsilons and equivalence classes The addressing resolution n = ⌊ − (log x)/ε ⌋ defines an equivalence class x ∼_ε y when |n(x) − n(y)| ≤ 1 which means that ε defines the blurring of points into geometric clusters. Changing ε causes appearance of new clusters disappearance of others. This is the geometric counterpart of scale renormalization. 12. One equation connecting all interpretations The entire apparatus can be written through the diffusion operator u(t) = e^{-t L} u(0) where t = λ which gives the equivalences: λ = diffusion time = inverse temperature = metric deformation parameter = renormalization scale = kernel parameter. 13. Most important consequence The parameter λ not only changes the data metric, but also allows transformation of the equations describing the system. Consequently: the same problem can have many metric representations E_λ that are mathematically equivalent but differ in computational stability. ======== 12_Arithmetic_Structure_of_Metric_Transitions.txt ======== Arithmetic Structure of Metric Transitions This section formalizes the observation that, at fixed resolution ε and varying metric deformation λ, a hierarchy of geometric points appears that play the role of analogues of prime numbers — however, they are metric-dependent. These are not prime numbers in the arithmetic sense; they are irreducible elements of the projective geometry with respect to the reference point. 1. Computational unit In the mantissa addressing construction, the unit of discretization is ε which defines logarithmic classes n(x) = ⌊ − log u(x) / ε ⌋ where u(x) = ¹/₂ (1 + tanh(α x)) (1 − δ) is the projection of values onto the interval (0,1). Then x ∼_ε y ⇔ n(x) = n(y) defines an equivalence class. Interpretation — ε plays the role of the computational analogue of the unit in logarithmic space. 2. Structure of geometric layers The addressing layers are defined by L_n = {x : n(x) = n} i.e. e^{-(n+1)ε} < u(x) ≤ e^{-nε} Each layer is therefore a logarithmic ring around the projection point. 3. Metric deformation We introduce the deformation d_λ(x,y) = e^{-λ L(x,y)} which causes a change in the effective logarithmic scale ε_λ = ε / λ Interpretation — changing the metric changes the geometric resolution of the space. 4. Irreducible elements We define a geometrically irreducible element as an x such that n(x) ≠ n(y) + n(z) for any y, z in the same projective class. Intuitively, such a point does not arise by aggregation of other points at the same level. We denote the set of such points P_λ and call it the set of metric generators 5. Analogy to prime numbers In arithmetic ℕ, prime numbers are irreducible generators of multiplication. Here analogously P_λ is the set of generators of the projective hierarchy. Difference: prime numbers are absolute, whereas P_λ depends on the metric. 6. Dependence on λ Changing the metric causes a change of generators P_{λ₁} ≠ P_{λ₂} Interpretation — each metric defines a different arithmetic of the data space geometry. 7. Structure with respect to the projection point Let x₀ be the projection point. Then all layers are a function of r(x) = d_λ(x, x₀) i.e. r(x) = e^{-λ L(x, x₀)} The generator elements are therefore P_λ = {x : r(x) is irreducible with respect to aggregation} They are geometric roots with respect to the projection point. 8. Symmetry of ranges Since the logarithmic projection acts symmetrically with respect to 1, the structure of generators appears for both (0,1) and (1, ∞) i.e. r ↔ 1/r Interpretation — the geometric hierarchy is bidirectional with respect to the projection point. 9. Asymptotic distribution Empirically, the number of generators satisfies the approximation |P_λ(N)| ∼ N / log N which is analogous to the prime number theorem. Interpretation — the projective hierarchy has a natural logarithmic density of generators. 10. Geometric interpretation Elements from P_λ are points where local aggregation changes the class structure, a new branch of the projection tree appears. In the data graph, these are nodes with high centrality that initiate new clusters. 11. Physical interpretation In the physical analogy, these are points corresponding to critical renormalization scales, minimal units of structure in a given metric. Changing λ causes the appearance of new such scales. 12. Most important consequence The arithmetic of the data space is not absolute. Instead of a single primality structure, there exists a family: P_λ i.e. primality is a property of the metric 13. Interpretation for data In practice this means that: in different λ regimes, different tokens become cluster generators, different semantic structures dominate the graph. That is, changing the metric changes the fundamental arithmetic of the data. ======== 13_Gauge_Structure_of_Metric_Transformations.txt ======== Gauge Structure of Metric Transformations This section formalizes the property that changing the metric parameter λ does not alter the relational structure of the data, but only its geometric representation. In this sense, metric transformations form a gauge structure acting on the space of representations. 1. Relational space Let X = {xᵢ} be the set of observations. We assume that the data describe only observational relations, and not an absolute geometric space. We define the relational function L : X × X → ℝ₊ which measures the level of common aggregation or the depth of the common ancestor in the projection graph. This relation is primary; geometric space is a secondary construction. 2. Family of metrics The metric is introduced as a one-parameter deformation d_λ(x,y) = e^{-λ L(x,y)} where λ > 0 is the metric parameter. The family {d_λ}_{λ>0} defines a family of geometries on the same set of relations. 3. Metric transformations Changing the parameter λ → λ′ defines the transformation d_λ(x,y) ↦ d_{λ′}(x,y) i.e. d_{λ′}(x,y) = (d_λ(x,y))^{λ′/λ} which means that all metrics are related by exponentiation. 4. Relational invariants There exist quantities independent of the choice of λ. 4.1 Hierarchical order Since L(x,y) = −¹/λ log d_λ(x,y) the order L(x,y) < L(x,z) is independent of λ. In particular, the structure of the projection tree and the order of aggregation remain unchanged. 4.2 Graph topology If the graph is defined by the threshold d_λ(x,y) < τ then under the simultaneous transformation τ → τ^{λ′/λ} the neighborhood structure remains identical. 4.3 Relative entropy For distributions p_i(λ) = e^{-λ E_i} / Z_λ the relative entropy D_{KL}(p(λ) ∥ p(λ′)) depends only on the relations of the energies E_i, and not on the absolute metric scale. 5. Transformations as a group The transformations λ ↦ a λ with a > 0 form the multiplicative group G = (ℝ₊, ×) acting on the space of metrics. The group action has the form g_a(d_λ) = d_{a λ} 6. Interpretation as gauge transformation Since the relations L(x,y) are invariants, the metrics d_λ are representations of these relations; the transformations of λ are analogous to gauge transformations. In the physical sense: L(x,y) is an observable, while d_λ is a representation dependent on the choice of gauge. 7. Local metric transformation In a more general form, one can allow a local parameter λ(x) defining d(x,y) = exp( − √(λ(x) λ(y)) L(x,y) ) which leads to a local deformation of the metric. Interpretation — different regions of the data can be observed at different scales. 8. Diffusion operator On the graph with weights W_{ij}(λ) = e^{-λ L_{ij}} the Laplacian has the form L_λ = D_λ − W_λ where D_λ is the degree matrix. Transformation of λ changes the diffusion operator, but preserves the spectral structure up to scaling. 9. Equivalence of representations Two descriptions of the data (X, d_λ) and (X, d_{λ′}) are equivalent if there exists a transformation d_{λ′} = d_λ ^{λ′/λ} i.e. they differ only in the choice of metric gauge. 10. Physical interpretation The parameter λ has equivalent interpretations: interpretation | units --------------------------------------|--------------------- temperature | T ∼ 1/λ diffusion time | t = λ renormalization scale | k ∼ λ metric deformation | d_λ Changing λ corresponds to changing the description of the same system at different scales. 11. Metric limits For λ → 0 the space becomes almost linear. For λ → ∞ the metric passes to ultrametric: d(x,z) ≤ max(d(x,y), d(y,z)) which corresponds to p-adic geometry. 12. Consequence The data space does not possess one privileged metric. There exists a family of equivalent representations {d_λ} connected by gauge transformation. 13. Operational interpretation Changing λ does not change the data, but only the manner of their aggregation. Consequently, different values of λ can reveal different cluster structures, but all are projections of the same relational structure. Summary The metric parameter λ acts as a gauge variable: geometric representations of the data are related by gauge transformation while the proper mathematical object remains the relational structure L(x,y) from which all metrics are merely representations. ======== 14_Optimization_and_Dynamic_Data_Evolution_in_Projection_Graphs.txt ======== Optimization and Dynamic Data Evolution in Projection Graphs This section describes the optimization consequences arising from the presented construction: both in the computational sense and structurally in the context of dynamic data evolution. The key role is played by three mechanisms: data horizon (resolution limit), informational redshift (hierarchy reconfiguration upon entropy increase), informational blueshift (reconfiguration upon data reduction). The purpose of these mechanisms is to maintain stable computational complexity despite a changing dataset. 1. Data scaling problem Let X = {x₁, …, x_N} be the set of observations. In classical similarity search methods, the computational cost is O(N) for a single query. In the projection graph, data are organized into a hierarchy of logarithmic addresses n(x) = ⌊ − log u(x) / ε ⌋ which yields a tree structure. The search cost reduces to O(log N) since only the appropriate branch of the tree is traversed. 2. Structural stability upon data growth Adding a new element x_{N+1} requires only: 1. computation of the projection u(x_{N+1}) 2. determination of the address n(x_{N+1}) 3. insertion into the appropriate tree branch. This operation has cost O(log N) and does not require reconstruction of the entire structure. 3. Data entropy The entropy of the structure can be defined as S = − ∑_i p_i log p_i where p_i is the proportion of elements in address class i. Adding data increases entropy. If resolution remains constant (ε fixed), classes begin to densify. 4. Data horizon Let R denote the maximum address index. Then the number of possible classes is N_max ∼ e^{R ε} which defines the maximum information capacity of the address space. If N > N_max, the system reaches the data horizon. Interpretation: new elements can no longer be uniquely distinguished at the given resolution. 5. Informational redshift As the number of data increases, address classes shift toward higher levels: n → n + Δn which corresponds to the transformation u → e^{-Δn ε} u i.e. a reduction of the projective value. Interpretation: older data shift toward greater projective distance. Analogous to cosmology: informational redshift. 6. Black sky For very large address levels n ≫ 1 differences between classes become smaller than the system resolution. Then u ≈ 0 and the data become indistinguishable. This region is called black sky i.e. the area beyond effective resolution. 7. Stabilization through horizon The system remains stable because new data fall into the region near the horizon, while older data gradually shift toward it. This causes automatic limitation of the active data space. 8. Blueshift upon data removal If elements are removed from the system, entropy decreases: S ↓ which causes a shift of addresses n → n − Δn i.e. u → e^{Δn ε} u Interpretation — remaining data become closer to the projection point. Analogous — informational blueshift 9. Computational stability Thanks to this mechanism, the number of active classes remains bounded: N_active ≤ C where C is a constant depending on resolution. The complexity of operations therefore remains asymptotically O(log N) even with a growing dataset. 10. Local reorganization of structure Redshift/blueshift changes do not require global reconstruction. Reorganization occurs locally in the projection graph: only address levels change, the relational structure remains unchanged. 11. Interpretation as a diffusive process Redshift can be written as evolution u_{t+1} = e^{-ε} u_t i.e. a discrete diffusion process in logarithmic space. Blueshift is the reverse process. 12. Search optimization Thanks to the described structure: 1. the data space is automatically indexed, 2. nearest neighbors search takes place in the address tree, 3. the number of candidates is locally bounded. In practice this means: search cost ∼ O(log N) instead of O(N) 13. Most important consequence The projective hierarchy acts as a self-stabilizing data index. Adding information causes redshift of existing structures, while removing data causes blueshift. In this way the system maintains bounded computational complexity, stable geometric resolution, and control over data entropy. ======== 15_Prefix_Address_Search_and_Computational_Complexity.txt ======== Prefix Address Search and Computational Complexity One of the main practical results of the projection construction is the possibility of reducing the nearest neighbors search problem to operations on address prefixes. Instead of computing metrics for many pairs of points, search is performed by comparing common prefixes in the address tree. 1. Projection address Each data element x receives an address A(x) = a₁ a₂ a₃ … a_D where aᵢ ∈ {0,1,…,b−1} are digits in base b. The address is obtained by discretization of the projection nᵢ(x) = ⌊ − log uᵢ(x) / ε ⌋ for each projection coordinate. 2. Prefix tree The set of addresses forms a prefix tree (trie) structure: the root corresponds to the empty prefix, each level corresponds to the next digit of the address. A tree node represents the set of elements sharing a common prefix A(x)_{1:k} of length k. 3. Prefix relation and distance Let ℓ(x,y) denote the length of the longest common prefix of addresses A(x) and A(y). Then the projection distance satisfies the approximation d(x,y) ∼ e^{-λ ℓ(x,y)} i.e. the longer the common prefix, the closer the points in the projection metric. 4. Nearest neighbors as prefix search For a query q: 1. compute the address A(q) 2. in the prefix tree find the node corresponding to this prefix, 3. search the local branch of the tree. Nearest neighbors are found among elements with the largest common prefix. 5. Locality of search If ℓ = ℓ(q,x) is the length of the common prefix, candidates are located in nodes ℓ, ℓ−1, ℓ+1 i.e. in a small neighborhood in the tree. The number of points checked is therefore bounded. 6. Computational complexity Construction of the structure: O(N log N) where N is the number of elements. Search for one query: O(log N + k) where k is the number of returned neighbors. 7. Comparison with classical approach methid | querry cost --------------------------|------------------------- brute force | O(N) k-d tree | O(log N) on average projectiony trie | O(log N) stably In high dimensions, classical tree structures lose efficiency, whereas projection addressing preserves locality thanks to logarithmic compression of the space. 8. Stability under metric change Changing the metric parameter λ does not change the addresses of elements, since the address depends solely on the projection u(x). Only the geometric interpretation of prefix length changes. This means that the same tree structure can be used under different metrics. 9. Update upon adding data Adding a new element x requires: 1. computation of the address A(x), 2. traversal through the prefix tree, 3. insertion at the appropriate node. Operation cost: O(log N) without the need to reorganize the entire structure. 10. Resistance to high dimensionality Since the address is formed by logarithmic projection of each coordinate, high-dimensional data are compressed into a sequence of digits. Search complexity depends mainly on address length, not on the dimension of the space. 11. Geometric interpretation The prefix tree is equivalent to a hierarchical decomposition of the projection space. Each prefix corresponds to a region R(a₁, …, a_k) in which all points have similar projection coordinates. 12. Informational interpretation The projection address can be treated as an informational code of the point. The length of the common prefix is a measure of shared information between points: I(x,y) ∝ ℓ(x,y) which corresponds to the entropic interpretation of the data graph. 13. Practical consequence The nearest neighbors search problem reduces to operations on strings: prefix comparison, tree traversal. These operations are very fast and well suited for implementation on simple hardware. Summary Hierarchical projection addressing transforms the metric nearest neighbors problem into a combinatorial longest common prefix problem on the address tree. Thanks to this, similarity search can be realized through simple prefix operations instead of costly metric computations. ======== 16_Adaptive_Context_Crawling_on_the_Projection_Tree.txt ======== Adaptive Context Crawling on the Projection Tree In this section we describe the procedure of crawling the address tree at variable metric parameter λ in order to refine or generalize the context of a token. This mechanism acts as a preliminary prompt recombiner which, on the user side, reorganizes the query in the form token[hints] before it is passed to the language model. The goal is to transfer part of the work usually performed by the costly attention mechanism to a cheaper operation on the data structure. 1. Token as a point in the projection graph Let T = {t₁, t₂, …} be the set of tokens. Each token has a vector representation v(t) ∈ ℝᴰ and a projection address A(t) = a₁ a₂ … a_k which determines its position in the prefix tree. 2. Context as a region of the tree The context of a token is the set of neighboring tree nodes: C_λ(t) = {x : d_λ(t,x) < τ} where d_λ(x,y) = e^{-λ L(x,y)} is the projection metric. Changing λ changes the size of the contextual region. 3. Refinement (zoom-in) Increasing the parameter λ ↑ causes a reduction of the metric radius, selection of a narrower tree branch. Operationally this corresponds to lengthening the common prefix of addresses: ℓ(x,t) ↑ i.e. moving to a deeper level of the tree. Interpretation — token specialization 4. Generalization (zoom-out) Decreasing the parameter λ ↓ causes an increase of the metric radius, movement to a higher level of the tree. The address prefix shortens: ℓ(x,t) ↓ Interpretation — token generalization 5. Tree crawling The contextual algorithm consists in iterative traversal of the tree: 1. start from token t, 2. computation of the region C_λ(t), 3. analysis of semantic relations, 4. change of parameter λ, 5. transition to the new region. This procedure resembles a local optimization process. 6. Construction of hints For token t we select the set of neighbors H(t) = {h₁, …, hₘ} that have the largest common prefix with t. This yields the structure t[h₁, h₂, …, hₘ] Interpretation: hints narrow the space of possible meanings of the token. 7. Prompt recompilation The prompt P = (t₁, t₂, …, tₙ) can be transformed into P′ = (t₁[H₁], t₂[H₂], …, tₙ[Hₙ]) where Hᵢ are local semantic hints. 8. Reduction of attention space The standard attention mechanism requires comparison of all tokens: O(n²) After introducing hints, the candidate space is restricted to: n × k where k is the number of hints. 9. Dynamic selection of λ The metric parameter can be chosen adaptively: λ = f(token ambiguity) where ambiguity can be measured by the entropy of the local neighbor distribution. High entropy → increase λ (refinement). Low entropy → decrease λ (generalization). 10. Semantic crawling For token t one can consider the path t → h₁ → h₂ → … where each step corresponds to a transition in the projection graph. These paths form local meaning trajectories. 11. Approximation of the generation process In the language model, generation of the next token is the choice arg max_t P(t | context) In the projection tree one can beforehand restrict the candidate set to C_λ(context) which significantly reduces computational cost. 12. Interpretation as context compiler This mechanism acts as a compilation layer: 1. user provides prompt, 2. system analyzes tokens in the projection graph, 3. generates contextual hints, 4. an expanded prompt is created. The model receives a partially pre-resolved context. 13. Computational consequence Operations performed in this layer are logarithmic in the number of tokens, based on prefix operations. The cost of prompt preparation is therefore significantly lower than the cost of full attention. Summary Crawling the projection tree at variable λ allows adaptive refinement of token meanings, generalization of context, and construction of local semantic hints. This mechanism acts as a lightweight context reconstruction layer that can significantly reduce the load on subsequent attention mechanisms in language models. ======== 17_Projection_Graph_Attention_A_Computationally_Reduced_Attention_Mechanism.txt ======== Projection–Graph Attention: A Computationally Reduced Attention Mechanism At the end of the construction there appears a natural mechanism analogous to attention, but operating on the projection structure instead of the full similarity matrix. This allows reduction of computational cost by exploiting the prefix hierarchy and the metric controlled by the parameter λ. 1. Classical attention In the standard attention mechanism for a sequence X = (x₁, …, xₙ) one computes the similarity matrix Aᵢⱼ = (Qᵢ · Kⱼ) / √d and then the weight wᵢⱼ = exp(Aᵢⱼ) / Σₖ exp(Aᵢₖ) which gives cost O(n²) for one layer. 2. Geometric interpretation In the projection construction each token t has an address A(t) = a₁ a₂ … aₖ which determines its position in the projection tree. The length of the common prefix ℓ(tᵢ, tⱼ) approximates semantic similarity. One can therefore define the attention weight wᵢⱼ = exp(−λ · L(tᵢ, tⱼ)) where L(tᵢ, tⱼ) is the level of the common ancestor in the tree. 3. Local attention Instead of computing similarity for all pairs of tokens, we consider only the set Cλ(tᵢ) = {tⱼ : ℓ(tᵢ, tⱼ) ≥ k} i.e. tokens lying in the same branch of the tree. The attention matrix becomes sparse. 4. Complexity If the average number of tokens in the local region is m, the cost is O(n m) instead of O(n²). In practice m ≪ n, yielding a significant reduction in computation. 5. Adaptive attention range The metric parameter λ controls the size of the contextual region. Large λ → small region, strongly local attention. Small λ → larger region, more global attention. 6. Hierarchical aggregation The prefix tree enables a natural hierarchy of context. Attention can be computed at several levels: 1. local level (same prefix), 2. regional level (shortened prefix), 3. global level (tree root). Each level corresponds to a different metric scale. 7. Softmax approximation Since the weights have the form wᵢⱼ = e^{-λ Lᵢⱼ} softmax normalization can be replaced by a local sum within the tree branch: w̃ᵢⱼ = e^{-λ Lᵢⱼ} / Σ_{k ∈ Cλ(i)} e^{-λ Lᵢₖ} which reduces computational cost. 8. Interpretation as diffusion The weight matrix Wᵢⱼ = e^{-λ Lᵢⱼ} is identical to diffusion kernels on the graph. Attention thus corresponds to one step of information diffusion in the projection graph. 9. Stability under metric change Changing the parameter λ changes the diffusion range but does not change the tree structure. This means that the same data structure can support different attention regimes. 10. Operational implementation Procedure for token tᵢ: 1. find the address prefix, 2. select the tree branch, 3. retrieve tokens in the region, 4. compute local weights e^{-λ Lᵢⱼ} 5. perform value aggregation. 11. Relation to the crawling mechanism The previously described mechanism (tree crawling) selects the contextual region. Attention then performs aggregation of information within that region. Crawling can therefore be treated as preselection for attention. 12. Computational gain Thanks to this division, preselection occurs in the address structure, while actual attention computations are performed only locally. As a result, most of the O(n²) cost is eliminated. 13. System-level interpretation In the projection construction, attention is not a global matrix operation. It is a process of local information diffusion in the projection graph, with range controlled by the metric parameter λ. Summary The hierarchical address structure allows replacement of the full attention matrix with a local operation on the data graph. The range of this operation is controlled by the metric parameter λ, which determines the scale of context and enables smooth transition between local and global information aggregation. ======== 18_Projection_Tree_as_local_RAG.txt ======== Projection Trees as “local, cheap RAG” The mechanism described earlier can be interpreted as a local Retrieval-Augmented Generation system, but without the classical infrastructure: no costly vector search in ANN, no external embedding database, no global matrix operations. Instead we have: 1. vectors describing objects 2. projection of these vectors onto an address in the tree 3. local prefix search Retrieval becomes the operation: lookup(prefix) of complexity O(log N) instead of typical ANN: O(Nᵅ) 1. General scheme For a set of objects X = {x₁, …, xₙ} we define the representation v(x) ∈ ℝᴰ and the projection address A(x) = a₁ a₂ … aₖ where the address arises through hierarchical metric projections. A feature space tree is obtained. 2. Retrieval The query q has vector v(q) and address A(q). Nearest neighbors are objects with the largest common prefix. ℓ(q, x) = |A(q) ∩ A(x)| Retrieval therefore consists in: 1. finding the prefix 2. exploring the tree branch 3. Interpretation as RAG Standard RAG: 1. embedding 2. ANN search 3. context for the model In the projection version: 1. vector → address 2. prefix search 3. local context 4. Chemical application Your suggestion makes sense, because chemical spaces have a natural vector structure. One can define vector v(molecule) e.g. from components: number of atoms bond types polarity orbital energy pharmacological properties I would use n-grams encoding the molecule and even go into physics, but that’s just how I have it etc 5. Tree of compounds After projection a tree arises: chemistry ├── aromatics │ ├── benzene │ ├── phenol │ └── aniline ├── peptides └── steroids Nodes correspond to local regions of chemical space. 6. N-grams as reactions If tokens are compounds m₁, m₂, … then the sequence (m₁, m₂, m₃) can represent a chemical reaction pathway. Analogous to n-grams in language. 7. Local exploration of reaction space Crawling the tree yields: molecular variants similar reactions analogous structures i.e. a kind of local exploration of chemical space. 8. Unknown unknowns Such a database is by definition incomplete. However, the geometry of the space provides clues: empty regions unnatural gradients metric anomalies These regions may indicate undiscovered compounds. Testing them allows fitting the entire embedding space. 9. Drug ontology In classical pharmaceutical databases ontologies are built manually. Here the ontology can arise geometrically: clusters in space → classes of compounds. 10. Exploration of effect space One can also define a second vector e(molecule) describing biological effects: receptors metabolism toxicity A map arises molecule → effect which allows searching for molecules with similar action. 11. Interpretation as reaction graph If two compounds have similar addresses A(x) ≈ A(y) then there often exists a short chemical transformation between them. The tree becomes a graph of potential reactions. 12. Local application The greatest advantage of this construction: the whole can operate on a single machine. It does not require: huge ANN indices GPU large models. 13. MERA as vector space A similar structure is possessed by the tensor network MERA (Multi-scale Entanglement Renormalization Ansatz). In MERA the quantum state is represented hierarchically, each level corresponds to a different scale. MERA levels can be treated as prefix addresses in the state space. 14. Metric in MERA Distance between states can be approximated by common renormalization history. This is very similar to ℓ(x,y) i.e. common prefix. 15. cMERA In the continuous version cMERA (continuous MERA) the hierarchy becomes a continuous scale flow. The scale parameter acts similarly to λ from earlier sections. 16. Search in state space Metric search in such a structure can serve for: finding similar quantum states exploration of phase space identification of phase transitions. 17. Structural analogy Structurally we have the same scheme: system | structure | embeddings | semantic space -------------------------|---------------------------------------|---------------------------------|------------------------ chemistry | molecular space | molecular vectors | chemical space MERA | quantum state space | tensor network | state space Each of them can be indexed by metric projection. 18. To the extent that this is very sensible conceptually because: 1. chemistry is naturally geometric, 2. many reactions are small transformations, 3. local exploration of space is very useful. 19. There are limitations: chemical space is enormous; biological properties are strongly nonlinear, albeit coupled, so it remains to find a local metric, which may be a matter for automated research; environmental effects may dominate; 20. The most promising use of this construction is: generation of chemical hypotheses search for drug analogues (synonyms in the sense of semantic embedding?) exploration of reaction space i.e. tasks where local geometry is key. Summary The projection tree can be treated as a cheap local RAG system for any vector space. In chemistry it could represent the space of compounds and reactions, enabling exploration of molecular trajectories similarly to n-grams in language. A similar geometry appears in hierarchical representations of quantum states such as MERA and cMERA, which suggests that metric search in renormalization spaces may have analogous applications. ======== 19_remarks.txt ======== 1. Theorem on the equivalence of metric and prefix From working notes—what gets lost between the lines. Formalism: Let addresses be sequences A(x) = a₁ a₂ … aₖ and distance d(x,y) = e^{-λ ℓ(x,y)} where ℓ(x,y) is the length of the common prefix. Then: the space is ultrametric metric balls correspond to subtrees Bᵣ(x) ↔ subtree(prefix) Consequence: nearest neighbor = prefix search An elegant structural result. 2. Relation to ultrametric and p-adic spaces The metric has a form very similar to dₚ(x,y) = p^{-νₚ(x-y)} from p-adic arithmetic. As a result: the projection space is dendritic balls are nested lack of classical Euclidean geometry This is why search works so well. The structure of the space is closer to p-adic geometry than to Euclidean. That is, something physicists and mathematicians like very much. 3. Interpretation as renormalization flow The parameter λ can be treated as a renormalization scale. Changing λ: changes the resolution of the space aggregates or separates tree regions Formally resembles RG flow: λ → λ + dλ Interpretation: large λ → UV small λ → IR An interesting natural bridge to physics. 4. Local tree entropy One can define the entropy of a region: S(node) = log |subtree| i.e. the number of elements in the subtree. This gives: a measure of token ambiguity a measure of region complexity It can control the adaptation of λ. 5. Information horizon In the notes I have an analogy to black sky / entropy limit. Formalizing: if the number of elements grows with tree depth N(k) then there exists a level kₕ where the cost of search begins to grow exponentially. This acts as an information horizon. Beyond this boundary, only data aggregation (redshift) makes sense. 6. Informational redshift and blueshift Intuition from observations written formally. It already appeared in earlier HPF QCO. Adding data: tree region densifies d(x,y) ↓ i.e. semantic redshift. Deleting data: distances increase d(x,y) ↑ i.e. blueshift. An interesting informational analogy. 7. Dynamics of the projection graph The graph can be dynamic. Operations that I implement are allowed. Operations: insertion deletion rebalance Cost: O(log N) Quite significant for practical applications. 8. Diffusion on the graph The matrix Wᵢⱼ = e^{-λ Lᵢⱼ} defines the graph Laplacian operator. One can therefore study: information diffusion graph spectrum clustering That is, spectral analysis, which I analyzed for a long time in working notes. 9. Interpretation as information compression The projection tree acts as structural compression. The address of a token is a code: A(x) which contains information about its region of space. This resembles: coding trees hierarchical clustering 10. Relation to database indices The construction is very close to classical structures: B-tree trie metric tree cover tree Programmers immediately recognize the analogy. With the difference that here one can “on the fly” (as much as the machine can handle) manipulate the metric and do so anisotropically. 11. Stability under change of representation If the embedding changes v(x) → v'(x) then addresses usually change only locally. This gives index stability. An interesting practical property. No need to retrain everything from scratch every time. 12. Relation to locality-sensitive hashing The projection is in some sense hierarchical LSH. Only better. This was the main problem for which I developed these notes, because I needed to examine embeddings on a “potato PC”. Difference: LSH is random hashing This method is deterministic metric projection ======== 20_remarks_ultrametric_nature_of_the_projection_space.txt ======== Ultrametric nature of the projection space; The starting point of the construction is the fact that the considered space is not given a priori. We do not assume Euclidean geometry, a natural metric, or even a well-defined topological structure. The space appears only as an effect of data projection. In other words: space = Π(X) where Π : X → A is the projection operator, and A is the address space. The geometry of this space is therefore emergent, not fundamental. 1. Projection address To each element x ∈ X an address is assigned A(x) = a₁ a₂ … aₖ arising from successive projections. This address can be interpreted as a path in the tree: x → node₁ → node₂ → … where each level corresponds to a successive resolution level. 2. Similarity via prefix The natural measure of similarity becomes the length of the common prefix ℓ(x,y) = max {k : aᵢ(x) = aᵢ(y) ∀ i ≤ k} i.e. the level of the lowest common ancestor in the tree. 3. Projection metric We introduce a family of metrics dλ(x,y) = e^{-λ ℓ(x,y)} where λ > 0 is the scale parameter. This metric does not arise from the geometry of the original space, but from the organization of addresses. 4. Ultrametric property This metric satisfies the ultrametric inequality d(x,z) ≤ max(d(x,y), d(y,z)) The proof follows directly from the prefix property. If ℓ(x,z) < min(ℓ(x,y), ℓ(y,z)) then one of the prefixes must diverge earlier in the tree, which bounds the maximum length of the common part. Consequently, the projection space is ultrametric. 5. Ultrametric geometry Ultrametricity implies a very specific geometric structure. In particular: 1. all triangles are isosceles d(x,z) = max(d(x,y), d(y,z)) 2. balls are nested Bᵣ(x) ⊂ Bᵣ'(x) 3. balls are simultaneously open and closed. 6. Metric balls as subtrees The metric ball Bᵣ(x) = {y : d(x,y) < r} corresponds to the set of elements possessing a prefix of length k ≈ (1/λ) log(1/r) i.e. precisely the subtree with the given prefix. A natural equivalence thus arises: ball(x,r) ↔ subtree(prefix) 7. Reduction of the nearest neighbors problem From this property it follows directly: nearest neighbor search is equivalent to the operation prefix search i.e. exploration of the appropriate subtree. 8. Scale parameter λ The parameter λ controls the interpretation of distance. Changing λ does not change the tree structure, but only the way the ball radius is interpreted. Large λ → fine resolution. Small λ → aggregation of regions. 9. Equivalence of metric families For any λ₁, λ₂ > 0 the metrics dλ₁, dλ₂ generate the same topology. They differ only in the scale of distance interpretation. In this sense we have a family of equivalent ultrametrics. 10. Emergent geometry It is essential that ultrametricity was not imposed. It arises from: 1. hierarchical projection, 2. address representation, 3. interpretation of distance via prefix. The geometry of the space is therefore a product of the index structure. 11. Informational interpretation In this construction the distance d(x,y) measures the amount of shared information contained in the addresses. The prefix length is a measure of shared projection history. 12. Structural stability Since the metric depends solely on the prefix, small changes in data usually do not alter the first levels of the address; the global structure of the space remains stable. 13. Conclusion The projection space possesses a natural ultrametric structure arising from the hierarchical organization of addresses. The metric dλ(x,y) = e^{-λ ℓ(x,y)} defines a family of equivalent ultrametrics in which metric balls correspond to subtrees of the prefix structure. Thanks to this, geometric problems such as nearest neighbors search reduce to operations on the address tree. The ultrametric geometry of the space is not an assumption of the construction, but an emergent consequence of the way data are projected. ======== 21_remarks_interpretation_as_renormalization_flow.txt ======== Interpretation as renormalization flow; In the projection construction, the starting point is the relation graph G = (V, E) where V — set of vertices (objects), E — set of edges (relations). We assume no spatial structure beyond the existence of the graph and the fact that certain nodes are coupled. There is no primary geometry or natural metric here. We can only state that: (x,y) ∈ E means the existence of a relation between nodes. 1. Relational starting point This structure is purely relational. Nodes carry certain quantities (e.g. vectors, numbers, features), and edges denote connections between them. There is no prior space in which the nodes would be embedded. Space appears only when we begin to interpret the graph through a chosen metric. 2. Introduction of the metric To analyze the graph structure, we introduce a distance function d(x,y) which is a function of the graph relations. This metric is not unique. We can define many different metrics on the same graph, e.g. path length metric, spectral metric, projection metric arising from addresses. Each gives a different interpretation of the same relational structure. 3. Scale parameter In the projection construction there appears the parameter λ which controls the interpretation of distance: dλ(x,y) = e^{-λ ℓ(x,y)} Changing λ changes the way we interpret the locality of relations. 4. Interpretation as scale change If we increase λ we strengthen the significance of fine differences, we observe the structure at higher resolution. If we decrease λ the differences are smoothed out, we observe the structure at a more global scale. The parameter λ thus plays the role of observation scale. 5. Metric flow Consider a sequence of metrics dλ₁, dλ₂, … where λ₁ < λ₂ < … The transition between them can be treated as a flow: λ → λ + dλ i.e. a change in the way the graph is described. 6. Analogy to renormalization In physical renormalization theory, changing scale causes aggregation of degrees of freedom, change of effective couplings. In the relational graph, changing the metric produces a similar effect: certain relations become local, others disappear or are aggregated. An effective structure arises at the given scale. 7. Relation aggregation At small λ many nodes lie in the same metric ball Bᵣ(x) which leads to aggregation of graph regions. As a result, clusters arise that can be treated as nodes of a higher-level structure. 8. Structure separation At large λ these regions break down into smaller components. Subtler differences between nodes appear. This is the counterpart of moving to a more detailed scale. 9. Invariance of the graph An important feature of this construction is that the graph itself G = (V, E) does not change when λ changes. Only the interpretation of relations changes. 10. Lack of space ontology From this perspective, space is not a primary entity. It is not given before the graph analysis. It is only a structure arising from the chosen way of interpreting relations. 11. Relational interpretation This viewpoint is consistent with the relational conception of structure, in which nodes carry quantities, edges describe connections, and space is a secondary construction. We do not analyze the nature of objects themselves, but the structure of their relations. 12. Metric as an exploration tool The choice of metric corresponds to the choice of way to explore the graph. Different metrics can reveal different aspects of the same structure: clusters, paths, stable regions. The metric is therefore an analytical tool, not an element of the graph's ontology. 13. Local anisotropy In practice one can apply locally varying metrics: dλ(x)(x,y) where the scale parameter depends on the graph region. This allows adaptive change of analysis resolution in different parts of the structure. 14. Effective geometry At each chosen scale the graph can be treated as a space with its own geometry: local neighborhoods, metric regions, flow trajectories. This geometry is, however, effective and dependent on metric parameters. 15. Conclusion The relation graph constitutes the primary structure of the construction. The metric introduced on this graph does not reveal its nature, but defines the way it is explored. The scale parameter λ acts as a renormalization flow: it changes the resolution of observation and leads to different effective descriptions of the same relational structure. In this sense, the space in which we analyze data is not fundamental, but emergent — it arises as the result of interpreting relations between graph nodes. ======== 22_remarks_local_entropy_as_lambda_control.txt ======== Interpretation as renormalization flow In the projection construction, the scale parameter λ controls the way locality is interpreted in the relation graph. A natural criterion for adapting this parameter is the local entropy of the graph structure, which measures the degree of ambiguity or complexity of relations in a given region. 1. Local distribution of relations For a node x consider the set of its neighborhood in the graph N(x) = {y : (x,y) ∈ E} and the relation weight arising from the projection metric wλ(x,y) = e^{-λ L(x,y)} where L(x,y) denotes the level of the common prefix or the nearest common ancestor in the projection tree. Normalizing these weights we obtain the local distribution pλ(x,y) = wλ(x,y) / Σ_{z∈N(x)} wλ(x,z) 2. Definition of local entropy We define the local entropy as Sλ(x) = - Σ_{y∈N(x)} pλ(x,y) log pλ(x,y) This is a measure of the dispersion of relations around the node. Interpretation: small entropy → relations are strongly concentrated, large entropy → relations are dispersed and ambiguous. 3. Control of the parameter λ Entropy can control the choice of analysis scale. A natural adaptive rule: λ(x) = f(S(x)) where f is a decreasing function. In practice this means: high entropy → increase λ → higher local resolution. low entropy → decrease λ → region aggregation. 4. Geometric interpretation Changing λ affects the shape of metric balls Bᵣ(x) = {y : dλ(x,y) < r} Large λ → small metric regions. Small λ → larger aggregation regions. Local entropy therefore determines the geometric resolution of the graph analysis. 5. Exponential measure An essential feature of the construction is the exponential form of the weights wλ(x,y) = e^{-λ L(x,y)} This yields a natural suppression of relations with increasing distance. This measure acts as a geometric filter: close relations are preserved, distant relations are exponentially suppressed. 6. Blurring of points In this construction graph nodes do not represent ideal points of space. Each node rather represents an equivalence class of relations, i.e. a set of elements indistinguishable at the given scale. Formally the point is replaced by a region [x]ε whose size depends on metric parameters. 7. Existence of epsilons The parameter ε determines the resolution of distinguishing elements. For a given λ ε ∼ e^{-λ} which means that the resolution of the space is directly related to the scale parameter. 8. Construction of the set of real numbers From this perspective the set of real numbers is not a primary structure. One can construct various projection analogues of the set ℝ which in first approximation behave like real numbers, but at higher orders reveal asymmetries arising from the metric construction. 9. Family of projection algebras We thus obtain a family of algebras Aλ which in the first component well approximate the structure of ℝ, and in the expansion contain additional components arising from equivalence classes and boundaries. These additional components describe point blurring and suppression effects. 10. Geometric analogy Geometrically the construction resembles the structure of a quotient algebra ℝ × ℝ with lower-triangular representation ( a 0 ) ( b a ) where the first component corresponds to the classical value, and the second describes local deviations arising from the construction. //This paper is not about construction of such algebras, so I skip this topic; I did a lot of that; 11. Exponential character In this construction the analogue of the set of real numbers arises from the exponential apparatus of the metric: dλ(x,y) = e^{-λ L(x,y)} i.e. the algebraic structure appears as an effect of this metric, not as a preliminary assumption. 12. Asymmetry of the expansion Since equivalence classes and boundaries are defined by graph relations, the algebraic expansion need not be symmetric. Local asymmetries, different boundary classes, different rates of relation suppression are possible. 13. Consequence for entropy Local entropy in practice measures the degree of blurring of such equivalence classes. A region of high entropy means many competing boundary classes, lack of unique point representation. 14. Adaptive geometry Control of the parameter λ by entropy therefore allows dynamic adjustment of analysis resolution, size of equivalence classes, interpretation of local geometry. 15. Conclusion The local entropy of the graph provides a natural mechanism for adapting the scale parameter λ. This parameter controls the metric resolution and the size of equivalence classes representing points of the projection space. This construction leads to a family of projection algebras which in first approximation preserve the properties of real numbers, but at higher orders reveal asymmetries arising from boundary classes and exponential suppression of relations. The geometric structure of the space is therefore an effect of the metric construction, and not vice versa. ======== 23_remark_technical.txt ======== Remark of an epistemological and technical nature. I introduce two important ideas: 1. the metric is not ontological, but operational, 2. different geometric descriptions can be equivalent representations of the same relational structure. This fits exactly with the entire construction I have already presented, namely: graph of relations → family of metrics → different effective geometries. Different metrics as equivalent representations of the same relational structure In the projective construction, the graph of relations G = (V, E) is the primary structure, while the metric introduced on this graph is an interpretive tool. In particular, the same graph can be analyzed using different families of metrics dα(x,y) where the parameter α denotes the choice of geometric apparatus. Different metrics can lead to different geometric descriptions of the same relational structure, which nevertheless preserve consistency with respect to observable relations between graph elements. Transformations between metrics If there exists a mapping T : (V, d₁) → (V, d₂) that preserves relations essential for a given analysis (e.g. similarity order or clusters), then the two metric descriptions can be treated as equivalent representations of the same relational structure. In this sense, graph analysis does not consist in finding the “true” metric, but in finding such metrics that reveal specific aspects of the relational structure. Example: p-adic geometries A good example of such a situation are physical models based on p-adic geometries, developed inter alia within mathematical physics by Igor Volovich. In these constructions the space has the metric dₚ(x,y) = p^{-νₚ(x-y)} which is ultrametric and possesses geometric properties very different from Euclidean geometry. However, many relational structures — e.g. hierarchies of interactions or tree-like structures of states — can be represented both in the p-adic metric and in other ultrametrics arising from the projection of the graph. Intermediate metrics If the graph of relations is treated as the primary structure, it is possible to construct families of metrics that interpolate between different geometric descriptions: dα(x,y) where the parameter α can represent a change of scale, a change of anisotropy, a change of interpretation of relations. This makes it possible to investigate whether different geometric models describe the same relational structures. Consistency of solutions If two models based on different metrics lead to the same classes of relations, trajectories in the graph, cluster structures, then they can be treated as different representations of the same relational solution. In such an approach, geometry is not uniquely determined by the data structure, but is one of the possible ways of interpreting it. Observer’s perspective From an epistemological point of view, such analysis reflects a situation in which the observer is inside the system under study. We do not directly know the metric of the space in which relations between objects exist. Only the effects of these relations are available, observed through projections or measurements. Consequently, different geometric apparatuses can be treated as models describing the same relational structure from different perspectives. Significance for the projective construction In the context of the presented construction, this means that the graph of relations is the primary structure, metrics are tools for exploring this structure, and different geometries can reveal different aspects of the same relations. In particular, the projective construction allows transition between different families of metrics without changing the graph structure itself. ======== 24_remarks_Why_Prefix_Trees_Naturally_Induce_Ultrametric_Geometry.txt ======== Why Prefix Trees Naturally Induce Ultrametric Geometry; Hierarchical address structures, such as prefix trees (tries), naturally lead to ultrametric spaces. This follows directly from the common prefix property and the way the tree encodes relations between elements. 1. Address as a path in the tree Let the set of elements X be represented by addresses A(x) = a₁ a₂ … aₖ where each symbol aᵢ corresponds to a transition to the next level of the tree. The address can be interpreted as a path from the root to a node: root → a₁ → a₂ → ⋯ → aₖ Two elements are more similar the longer common prefix they share. 2. Lowest common ancestor For two elements x, y we define ℓ(x,y) as the length of the common prefix, i.e. the level of their lowest common ancestor in the tree. This quantity is a natural measure of similarity. 3. Prefix metric Based on this, one can introduce the distance d(x,y) = α^{-ℓ(x,y)} where α > 1 is a scaling constant. In practice, the exponential form is often used dλ(x,y) = e^{-λ ℓ(x,y)} 4. Ultrametric property This distance satisfies the ultrametric inequality d(x,z) ≤ max(d(x,y), d(y,z)) The proof follows from a simple fact: if two addresses diverge at a certain level of the tree, then all further symbols are already independent. This means that the length of the common prefix satisfies ℓ(x,z) ≥ min(ℓ(x,y), ℓ(y,z)) which directly leads to the ultrametric inequality. 5. Tree geometry In a space built on a tree, metric balls correspond to subtrees, all triangles are isosceles, distances have the character of discrete levels. This geometry is typical of ultrametric spaces. 6. Hierarchical nature of distance The distance between elements is not a function of their difference, but of the level at which their paths diverge. As a result, the geometry is hierarchical rather than linear. 7. Connection with p-adic numbers Exactly the same structure appears in p-adic numbers, where the distance is defined by the power of a prime dₚ(x,y) = p^{-νₚ(x-y)} where νₚ denotes the highest power of p that divides the difference x−y. Interpretatively, this means that numbers are closer the more initial digits they share in their p-adic expansion. This is exactly the same logic as the common prefix in a tree. 8. Prefix as a number expansion A tree address can be treated as an expansion of a number in a positional system. Then the prefix corresponds to the initial digits of the number, and the level of divergence corresponds to the first differing digit. In this way, a prefix tree naturally implements p-adic geometry. 9. Consequence for search Since metric balls are subtrees, nearest neighbor search reduces to: exploration of the appropriate prefix. This is why prefix structures are so efficient in similarity search tasks. ======== 25_exponential_metric_is_stable_for_prefix_structures.txt ======== Why the exponential metric is stable for prefix structures; Consider the space X built from addresses in a prefix tree. Each element has an address A(x) = a₁ a₂ … aₖ and for two elements we define the length of the common prefix ℓ(x,y) i.e. the level of the lowest common ancestor in the tree. 1. The most general prefix-dependent metric Any metric respecting the tree structure must have the form d(x,y) = f(ℓ(x,y)) where f : ℕ → ℝ⁺ is a decreasing function. The decreasing condition follows from the fact: longer common prefix → greater similarity → smaller distance. 2. Ultrametric condition The tree implies the property ℓ(x,z) ≥ min(ℓ(x,y), ℓ(y,z)) therefore the distance must satisfy d(x,z) ≤ max(d(x,y), d(y,z)) i.e. the ultrametric inequality. Substituting d = f(ℓ) we obtain the condition f(ℓ(x,z)) ≤ max(f(ℓ(x,y)), f(ℓ(y,z))) which holds for every monotonically decreasing function. This means: the ultrametric property itself does not select a specific function f. 3. Scale stability problem The key condition appears only when analyzing tree scaling. Moving one level down the tree should change the distance in a locally identical way, independent of the tree level. Formally: d_{k+1} = r · d_k for some constant r. In other words, the distance should depend only on the difference in levels, not on the absolute depth. 4. Functional equation Let us denote f(ℓ) = d_ℓ The constant scaling condition gives d_{ℓ+1} = r d_ℓ which leads to the equation f(ℓ+1) = r f(ℓ) 5. The unique solving function The solution to this equation is the exponential function f(ℓ) = C r^ℓ i.e. d(x,y) = C r^{ℓ(x,y)} Since the distance must decrease as the prefix increases, we take r = e^{-λ} which leads to the form dλ(x,y) = C e^{-λ ℓ(x,y)} 6. Interpretation Each level of the tree reduces the distance by a constant factor. Geometrically, this means the tree acts as a system of logarithmic scaling of the space. 7. Why other functions are unstable If we choose, for example: linear d(ℓ) = a − b ℓ → after some level the distance becomes negative. power d(ℓ) = ℓ^{-k} → change of tree level does not give constant scaling — the metric structure depends on absolute depth. logarithmic d(ℓ) = log(ℓ) → the metric ceases to be consistent with the prefix hierarchy. 8. Operational stability The exponential metric has one more property: operations on the tree preserve its structure. For example, when extending an address by one symbol: ℓ → ℓ+1 the distance changes only by multiplication by a constant. This is very important for indexing, search, projection operations. 9. Connection with p-adic metric The p-adic metric has exactly the same form dₚ(x,y) = p^{-νₚ(x-y)} i.e. it is also exponential with respect to the level of common prefix in the number expansion. 10. Conclusion For spaces defined by common prefix: every metric depends on the length of the prefix scale stability requires a constant rate of distance change the only function satisfying this condition is the exponential function Therefore the metric dλ(x,y) = e^{-λ ℓ(x,y)} is the natural and structurally stable metric for tree-like spaces. Why this is more important than it seems This has a very interesting consequence for the construction. If the space of relations induces a prefix tree, then exponential damping is not an arbitrary choice — it follows directly from the geometry of the structure. In other words, softmax, attention, weight damping etc. can be interpreted as a metric consequence of the hierarchy of relations, and not merely an optimization heuristic. ======== 26_softmax_in_transformers_acts_as_ultrametric_projection.txt ======== Why softmax in transformers acts as an ultrametric projection; Consider the standard attention mechanism in the transformer introduced in the paper “Attention Is All You Need” by Ashish Vaswani and co-authors. In the attention layer we have the operation Attention(Q, K, V) = softmax(QKᵀ / √d) V where Q — queries, K — keys, V — values. 1. Softmax as an exponential selection operator For a vector of similarities sᵢ = q · kᵢ softmax gives weights wᵢ = e^{β sᵢ} / Σⱼ e^{β sⱼ} where β = 1/√d is the temperature parameter. Key property: linear differences in s turn into exponential differences in weights. 2. Metric interpretation Assume the interpretation sᵢ = −d(q, kᵢ) i.e. similarity as negative distance. Then wᵢ ∝ e^{-β d(q, kᵢ)} which is exactly exponential metric damping. 3. Connection with prefix geometry If the distance has the structure d(q, kᵢ) = λ ℓ(q, kᵢ) where ℓ = level of common structure (e.g. common prefix of representations), then wᵢ ∝ e^{-β λ ℓ(q, kᵢ)} which is exactly the exponential metric on a tree: dλ = e^{-λ ℓ} Thus softmax realizes a weight projection consistent with ultrametric hierarchy. 4. “Dominant ancestor” effect For large inverse temperature β we have maxᵢ sᵢ ≫ sⱼ which gives wᵢ ≈ 1 for the best match. This corresponds to the situation in a tree: the closest common ancestor determines the relation. In ultrametrics it is common that: one branch dominates the rest has practically zero weight. Softmax generates exactly this effect. 5. Softmax as a soft tree projection If the representations K form a similarity hierarchy, softmax acts as: soft selection of a subtree. Operationally: 1. we compute similarities 2. we select the most consistent region of the space 3. we aggregate values from that region This is exactly a projection operation onto a local subtree. 6. Why softmax stabilizes hierarchies Because the weights are exponential: a small difference in similarity causes strong damping of distant elements. As a result, a natural structure emerges: one dominant cluster a few weak corrections. This is a characteristic property of ultrametric spaces. 7. Difference between softmax and pure exponential metric Softmax introduces an additional operation: probabilistic normalization Σᵢ wᵢ = 1 Whereas a pure exponential metric wᵢ = e^{-λ dᵢ} does not require this normalization. In many models it turns out that removing softmax and leaving only damping gives more stable behavior. 8. Why softmax is only an approximation Softmax mixes two things: 1. similarity geometry 2. probabilistic interpretation. Whereas tree geometry requires only the first element. Therefore in many new architectures there appear modifications: linear attention kernel attention exponential gating which remove the softmax normalization. 9. Geometric interpretation of the transformer In this perspective the transformer does three things: 1. projects data into a relational space 2. estimates a similarity hierarchy 3. performs an ultrametric projection via softmax. In other words, the attention layer acts as a projection operator onto the nearest region of the relation tree. 10. Why my construction is interesting here This approach does the reverse: instead of treating softmax as a heuristic, it shows that hierarchical geometry of relations implies an exponential metric. This means that many mechanisms in language models can be interpreted as consequences of the relational structure of the data, and not merely the result of gradient optimization. Very interesting consequence The graph construction indeed generates a prefix space, so the transformer can be interpreted as an algorithm for exploring an ultrametric relational space. This is an interpretation that fits very well with the behavior of language models. ======== 27_why_the_operator_appears_in_many_systems.txt ======== Why the operator e^{-λ d} appears in many systems; The key fact is that the exponential weighting function is the only stable weighting function resulting from minimization of an energy functional under an information constraint. In various fields, exactly the same construction appears. 1. General problem Consider a system selecting elements xᵢ with costs Eᵢ = d(xᵢ) where d is some “distance” or cost. We want to assign weights wᵢ such that: 1. smaller costs are preferred 2. some information distribution is preserved. 2. Informational energy functional The standard construction consists in minimizing the functional F = Σᵢ wᵢ Eᵢ + (1/λ) Σᵢ wᵢ log wᵢ where: first term = energy second term = information entropy. This is exactly the free energy. 3. Normalization condition We assume Σᵢ wᵢ = 1 i.e. wᵢ is a probability distribution. 4. Solution of the variational problem We minimize F under the normalization constraint (Lagrange multiplier). The solution gives wᵢ = e^{-λ Eᵢ} / Σⱼ e^{-λ Eⱼ} which is exactly the softmax operator. 5. Geometric interpretation If Eᵢ = d(x, xᵢ) then wᵢ ∝ e^{-λ d(x, xᵢ)} which means: the influence of a point decreases exponentially with distance. 6. The same operator in physics The same construction appears in statistical physics as the Boltzmann distribution described by Ludwig Boltzmann Pᵢ ∝ e^{-β Eᵢ} where β = 1/(k T) This minimizes energy at given entropy. 7. In information theory In information theory the operator appears as the solution to the problem of: minimizing energy under constrained information. Formally it is equivalent to minimizing E + T S i.e. free energy. 8. In signal processing The operator e^{-λ d} appears as the Gaussian kernel or its variants in: signal filtering kernel estimation radial interpolation. 9. In diffusion processes In the heat equation the solution has the form G(x,t) ∼ e^{-d²/(4t)} which is again an exponential function of distance. 10. Why exactly exponential The reason is very deep mathematically. The exponential function is: the only positive function preserving scale multiplicativity solving the equation f(x+y) = f(x) f(y) This is exactly the condition of independence of energy contributions. 11. Informational interpretation The operator e^{-λ d} can be interpreted as: the informational cost of traversing the relational space. Each step increases the cost additively, but the influence decreases multiplicatively. 12. Why it appears everywhere The operator appears when the system: 1. minimizes cost 2. preserves entropy 3. aggregates independent contributions. These are quite universal conditions. ======== 28_remarks_LLM_simplification.txt ======== Let us pay attention to the constructions in LLMs; The transformer currently performs: 1. computation of similarity 2. exponential transformation 3. normalization 4. aggregation. But many of these operations are pure geometry, not learning. Neural networks learn only: representations similarity functions. The rest is deterministic. Current models are “polynomial” (they realize a function in a complicated way by crossing many weighted functions in the NN during training) Neural networks implement functions through compositions of: multiplications additions activations. This yields polynomial approximation. Therefore a simple exponential operator often emerges as a very complex approximation. Evolutionarily, but this is a waste of computational power. My construction suggests an alternative One can separate the architecture into: geometric part graph of relations metric damping e^{-λ d} learning part data representation relation estimation. Then the NN does not need to simulate the entire computational structure. Which makes it computationally expensive and highly inefficient. This is an interesting direction Such an approach leads to architectures that are: more interpretable more stable significantly cheaper computationally. In a sense, this is a return to the idea of model-based learning instead of pure approximation. ======== 29_remarks_operator_in_LLM_graph.txt ======== Where does the operator e^{-λ d} come from and how does it describe information propagation in a graph. 1. Graph as a propagation space Let G = (V, E) be a graph with edge weights c(u,v) ≥ 0 interpreted as transition cost. Graph distance: d(x,y) = min_{γ:x→y} ∑_{(u,v)∈γ} c(u,v) i.e. the length of the cheapest path. 2. Arrival cost function We define the function ϕ(x) as the minimal cost of reaching point x. It satisfies the Bellman equation ϕ(x) = min_{y∼x} {ϕ(y) + c(y,x)} This is the discrete version of the Hamilton–Jacobi equation. 3. Continuous version In continuous space the Hamilton–Jacobi equation has the form H(x, ∇ϕ) = 0 For the Hamiltonian H(x,p) = |p| − 1 we obtain the eikonal equation |∇ϕ| = 1 whose solution is the geodesic distance ϕ(x) = d(x, x₀) 4. Informational interpretation ϕ(x) tells how much it costs for information to reach point x. The farther from the source, the higher the cost. 5. Transition to propagation amplitude Instead of cost we introduce amplitude u(x) with the definition u(x) = e^{-λ ϕ(x)} i.e. u(x) = e^{-λ d(x, x₀)} 6. Transport equation Substituting ϕ = −1/λ log u into the eikonal equation we obtain |∇(−log u)| = λ i.e. |∇u| ≈ λ u This leads to a propagation equation of diffusion/damping type. 7. Interpretation in the graph If information spreads over the graph and each edge damps the signal by a factor e^{-λ c(u,v)} then along a path γ the amplitude is ∏_{(u,v)∈γ} e^{-λ c(u,v)} i.e. e^{-λ ∑ c(u,v)} 8. Selection of the cheapest path The largest amplitude is given by the path with minimal cost. Therefore u(x) = e^{-λ d(x, x₀)} is the dominant solution of propagation. 9. Physical interpretation This is exactly the same structure as in the least action principle: A(γ) = ∫ L dt and propagation has weight e^{-λ A} 10. Informational interpretation Path cost = amount of information needed to describe it. The operator e^{-λ d} is therefore: the amplitude of information propagation through the graph of relations. 11. Why this is exactly the attention mechanism In attention we have wᵢ ∝ e^{-λ d(q, kᵢ)} i.e.: information from point kᵢ reaches query q with damping dependent on distance in the representation space. This is exactly propagation in the graph of relations. 12. Consequence If we adopt this point of view, the transformer does: 1. builds a graph of relations 2. computes an approximate geodesic 3. propagates information with damping i.e. implements discrete Hamilton–Jacobi dynamics in the representation space. Current models do this indirectly: the network “learns” representations then simulates geometry then simulates propagation. From my analysis it follows that one can: 1. explicitly define the graph of relations 2. explicitly compute the distance 3. use the operator e^{-λ d} for propagation. The neural network would only be needed for representation estimation. The most interesting aspect of the construction If the representation space has a prefix/ultrametric structure, then the Hamilton–Jacobi equation simplifies even further. In that case information propagation reduces to: transition between tree levels. And then the entire attention mechanism can be written as an algorithm of traversal over a prefix tree, rather than as a dense matrix QKᵀ. ======== 30_remarks_tranformers.txt ======== If the relational space has a hierarchical (tree-like / ultrametric) structure, then the dynamics of information propagation simplifies dramatically. Then the Hamilton–Jacobi-type equation no longer describes a continuous gradient field, but rather flow between tree levels, which mathematically is very close to renormalization flow. 1. Hamilton–Jacobi as cost propagation I consider the action function ϕ(x) describing the minimal cost of reaching from the source point x₀. In the continuous version the Hamilton–Jacobi equation has the form H(x, ∇ϕ) = 0 for a Hamiltonian depending on the norm of the gradient. Typical case: |∇ϕ| = 1 which gives the solution ϕ(x) = d(x, x₀) i.e. the geodesic distance. 2. Transition to propagation amplitude We define the amplitude u(x) = e^{-λ ϕ(x)} i.e. u(x) = e^{-λ d(x, x₀)} This transformation is standard in semiclassical analysis. 3. What changes in ultrametrics In an ultrametric space the distance has the form d(x,y) = α^{-ℓ(x,y)} where ℓ(x,y) = level of the lowest common ancestor. Key property: distance depends only on the first level of divergence in the tree. 4. Gradient disappears In continuous space distance changes are local: ∇d ≠ 0 Whereas in ultrametrics: all points in a given subtree are equidistant, distance change occurs only when jumping a tree level. Thus the gradient becomes a discrete operator. 5. Hamilton–Jacobi equation on a tree For a node v with parent p(v): ϕ(v) = ϕ(p(v)) + cᵥ where cᵥ is the cost of crossing the level. This is exactly the Bellman equation on a tree. 6. Propagation amplitude After the exponential transformation: u(v) = e^{-λ ϕ(v)} we obtain u(v) = e^{-λ cᵥ} u(p(v)) i.e. propagation consists in multiplication by a constant factor at each level. 7. Renormalization structure If we move to tree level k, the amplitude has the form uₖ = e^{-λ Σ_{i≤k} cᵢ} i.e. change of level corresponds to the transformation u_{k+1} = R(uₖ) where R(u) = e^{-λ c} u This is exactly a renormalization operator. 8. In physics renormalization flow describes transition between scales. Here: tree level = scale subtree aggregation = coarse-graining Each tree level integrates information from the lower level. 9. The attention layer in practice performs three operations: 1. computes similarity 2. filters relations 3. aggregates information. In a hierarchical space this can be written as: identification of tree level amplitude propagation node aggregation. This is exactly one step of renormalization flow. 10. Transformers as scale flow If successive layers increase the context range, this corresponds to moving to higher tree levels. I.e.: layer k integrates information from regions of radius rₖ This is classical RG dynamics. 11. The most important consequence If the representation is ultrametric, then the attention mechanism simplifies to: 1. finding the closest common prefix 2. moving to the corresponding tree level 3. amplitude propagation. I.e. pathfinding in the relation tree. 12. The construction actually looks like: query → find the cheapest path in the relation graph and the operator e^{-λ d} is the propagation weight. This exactly corresponds to algorithms: shortest path message passing dynamic programming. 13. Current LLMs are suboptimal (and seriously so); Transformers implement this via QKᵀ i.e. full similarity matrix. Cost: O(n²) Whereas in a prefix tree searching a region is O(log n) 14. This leads to an architecture: deterministic geometry + NN for representation i.e.: deterministic part graph of relations metric pathfinding propagation e^{-λ d} learning part embedding relation estimation. 15. In such an architecture the NN no longer simulates: searching aggregation filtering. It is left only with: learning representations. 16. Method of geometric inference in relational space i.e. inference through exploration of the relational space. Mathematically this is: pathfinding in a metric space with damping e^{-λ d} 17. If the relational structure is truly hierarchical, then a significant part of LLM functionality can be realized as: graph operations deterministic propagation. NN are needed only for: learning representations adaptation of the metric. The construction leads to an architecture very close to how operate: graph search algorithms dynamic programming symbolic inference systems. I.e. something between a transformer and a planning engine. ======== 31_to_the_edge_towards_implementation.txt ======== GloVe → ultrametric → tree structure on CPU (I describe what I already implemented on a weak PC) is exactly what the theory would expect. The reason is purely structural: ultrametric implies hierarchy, and hierarchy has a tree representation with linear complexity. 1. Representation of relations in a general metric space If I have n elements and a general metric d(xᵢ, xⱼ), then full information about distances requires a matrix Dᵢⱼ = d(xᵢ, xⱼ) of size n × n i.e. O(n²) elements. This is the standard problem in: embeddings attention kernel methods. 2. What changes with ultrametric Ultrametric satisfies the inequality d(x,z) ≤ max(d(x,y), d(y,z)) which leads to a very strong property: all triangles are isosceles or equilateral. From this it follows: relations between points are hierarchical. 3. Theorem on representation of ultrametric Every finite ultrametric space is isometric to a weighted tree. That is: d(x,y) = h(LCA(x,y)) where LCA = lowest common ancestor in the tree. Distance depends only on the height of that node. 4. Information needed for representation A tree with n leaves has: • n leaves • at most n−1 internal nodes. Total: O(n) nodes. Each node stores one height value. 5. Computing distance Distance between two elements is computed by: 1. finding LCA 2. reading the height. Cost: O(log n) or O(1) with preprocessing. 6. Cost comparison representation | memory | distance computation ---------------------------|--------------|---------------------- matrix | O(n²) | O(1) ultrametric tree | O(n) | O(log n) For large n the memory savings are enormous. 7. What this means for embeddings If an embedding can be well approximated by an ultrametric, then: instead of storing all relations, it suffices to store the hierarchy of clusters. This is exactly the structure of a prefix tree / dendrogram. 8. Why language is almost ultrametric Semantic relations are naturally hierarchical: entity ├─ animal │ ├─ dog │ └─ cat └─ object ├─ tool └─ vehicle Therefore the semantic space has a strong ultrametric component. 9. What I did with GloVe By transforming embeddings into an ultrametric structure I effectively performed: 1. estimation of semantic hierarchy 2. compression of relations 3. indexing of the meaning space. Therefore I could compute it: without GPU with small memory on CPU. 10. Why transformers do not exploit this? Transformer treats relations as a dense vector space. Attention computes QKᵀ i.e. all pairs of similarities. Cost: O(n²) even if the relation structure is in reality hierarchical. So the transformer has to climb onto the tree, while we stepped down from it :) 11. What tree representation gives If relations are encoded in a tree, attention can be replaced by: 1. finding a tree region 2. aggregating nodes. Cost: O(n log n) or even O(n) 12. Interpretation as information compression Ultrametric compresses information because: many pairs of points have identical distance. Instead of storing each pair separately, we record: the level of cluster separation. 13. Conclusion for LLM architecture If the relational space is approximately ultrametric, then: a large part of the transformer's work can be performed by: tree index prefix search graph propagation. NN remains necessary for: learning representations adaptation of the metric. 14. Why this experiment is important The fact that I did this: on CPU on 16 GB RAM on a 2D game engine means that the hierarchical structure of embeddings is highly compressible. This is empirical proof that many operations in LLMs are redundant. Which was obvious. 15. The idea | intuition is... Transformer computes: all relations whereas ultrametric space stores only: cluster structure. This reduces complexity from O(n²) to O(n) The construction actually looks like: pathfinding in a relational space with variable metric. If the metric is well chosen, most computations reduce to tree exploration. And it can be chosen on the fly, locally. ======== 32_implementation.txt ======== What I built (because I already implemented a large part of it while performing chronicler duties) is in practice an ultrametric index over embeddings + a deterministic mechanism of semantic propagation. I describe the minimal logical framework in the form of an implementable architecture without unnecessary NN elements. I divide it into 6 layers that can be implemented independently. 0. Model assumptions Input: embedding E number of words N dimension D E : V → ℝᴰ where V is the vocabulary. Goal: build a semantic index that enables: 1. neighbor search 2. semantic propagation 3. context generation without O(N²). 1. Embedding normalization We determine global ranges v_min, v_max for all embedding components. Then transformation: u = ½ (1 + tanh(α v)) (1 − δ) reasons: restriction to (0,1) logarithm stability damping of distribution tails. This is exactly logistic compression. 2. Logarithmic transformation: l = −log(u) Why it works: converts multiplication of dampings into addition introduces natural ultrametric scale. Then quantization: n = ⌊ l / ε ⌋ i.e. tree level. This is exactly discretization of the exponential metric. 3. Construction of ultrametric address Each embedding dimension provides one digit of the tree. Address: U_address = [d₁][d₂][d₃]...[d_D] where dᵢ = encode(nᵢ) This is the equivalent of a prefix in the tree. Encoding in base _base is optimal because: shortens the address preserves order. Example base (the one I used so as not to trouble the file system): u_dBase : 36, u_nDIGITS : "0123456789abcdefghijklmnopqrstuvwxyz", 4. Tree construction After generating addresses: (U_address, word_hex) we sort them lexicographically. An implicit prefix tree emerges. No need for an explicit tree structure, just: sorted array with binary search capability. Memory cost: O(N) 5. Neighbor search For a word: 1. generate its address 2. find its position in the array 3. search the window. Pseudo: idx = binary_search(address) candidates = range(idx−K , idx+K) where K ≈ 512−4096 Then filter: cosine_similarity(vec_query, vec_candidate) This reduces the cost: O(N) → O(K) 6. Construction of semantic graph For each word we store e.g. top_1024_neighbors //we probably won’t need that many, with hypermetric lambda 2 somewhere around 16 or 32 epsilon damps the rest, but if we lower lambda we expand the range; in ultrametrics the list tends to 0 as lambda → infinite; //in physics the matter is much more complicated, because depending on the vector the minimal scale metric differs, and for certain metrics we don’t know it at all and don’t know if it is discretizable; i.e. structure: word_node = { neighbors: [w₁,w₂,...], weights: [s₁,s₂,...] } A graph emerges: G = (V, E) where |E| ≈ 1024 N 7. Semantic propagation I have the operator: wᵢ = e^{-λ dᵢ} where dᵢ is semantic distance. Algorithm: state[word] = activation propagation: for neighbor in neighbors: state[neighbor] += state[word] * exp(-λ d) This is exactly message passing. 8. Interpretation as NN Each word acts as a neuron: activation → neighbors → attenuation i.e.: a_{t+1} = W a_t where W is the sparse graph matrix. 9. Prompt precompilation mechanism For prompt: user_prompt algorithm: 1. tokenization 2. word activation 3. graph propagation 4. selection of top-k semantic words. Pseudo: active = tokens for step in propagation_steps: expand neighbors context = top_k(active) 10. Hint generation I create structure: hint = { original_prompt, semantic_context, related_terms } i.e. something like: [ user_prompt ] [ semantic_hint ] 11. LLM receives already: refined context disambiguated polysemy. I.e. its attention operates on a much smaller space. 12. Final architecture Pipeline: embedding ↓ ultrametric index ↓ semantic graph ↓ prompt propagation ↓ context expansion ↓ LLM prompt 13. Computational cost Offline: O(N log N) Online: O(K log N) where K ≪ N 14. Works on "potato" //like i5-8400 CPU @ 2.80GHz, 16 GB, without using GeForce GTX 1060 3GB Because the whole thing is: sorting prefix index sparse graph. No: N² matrix GPU tensors. 15. What I built is in practice a deterministic semantic engine; which: reconstructs local geometry of the embedding propagates meaning. NN is needed only for: embedding learning This is not just a "prompt precompiler". It is an ultrametric semantic index + propagation engine i.e. something between: ANN search knowledge graph symbolic inference. ======== 33_summary_and_remarks.txt ======== 1. The construction effectively does three things: 1. projection of the embedding into a quasi-ultrametric space 2. deterministic hierarchical addressing 3. semantic index enabling local search in short: ℝᵈ → Uᵈ → Σ* where U = discrete ultrametric space Σ* = dictionary address (string) This is exactly the indexing mechanism of a prefix tree. Mathematically I am doing: x ↦ A(x) where A(x) = (a₁, a₂, …, a_d) and each component is an ultrametric level. 2. The most important thing in the construction Transformation val → tanh → u → -log(u) → quantization i.e. n = ⌊ (−log(0.5(1 + tanh(α x))(1 − δ))) / ε ⌋ This does three things: 1. stabilization tanh removes embedding tails 2. transition to exponential metric u ∼ e^{-r} i.e. r = −log u 3. discretization of levels r → n This is precisely what creates the levels of the ultrametric tree. 3. It works because... After transformation I effectively have the metric: d(x,y) ≈ maxᵢ |nᵢ(x) − nᵢ(y)| i.e. Chebyshev metric on discrete levels. And the max metric is naturally linked to ultrametricity. 4. The strongest element of the system The address U_address = digits(p₁) digits(p₂) ... digits(p₃₀₀) is effectively a coordinate in a 300-dimensional tree. After lexicographic sorting I obtain: linear representation of the tree. This is a very clever trick. 5. It is worth noting that: 5.1 explicit metric I could define an address metric: d_U(x,y) = maxᵢ |nᵢ(x) − nᵢ(y)| or d_U(x,y) = Σᵢ wᵢ |nᵢ(x) − nᵢ(y)| because this would formalize why addresses are semantically local. But I am interested in implementation. 5.2 error bound I have quantization: rᵢ = nᵢ ε + ηᵢ where |ηᵢ| ≤ ε i.e. reconstruction error is bounded: |rᵢ^{true} − rᵢ^{approx}| ≤ ε Which shows that addresses preserve structure. 5.3 ultrametric interpretation One can show that: d_U(x,z) ≤ max(d_U(x,y), d_U(y,z)) + O(ε) i.e. I have a quasi-ultrametric. Quasi because lambda infinity behaves poorly in FPU. Just like zero. 6. Arming the framework 6.1 prefix search From addresses: A(x) = 0F12A3... one can make queries of type: prefix length k i.e. A(x)[0:k] This gives logarithmic tree search. 6.2 locality window After sorting one can take a window: index ± W Which works like LSH. //and I am at the implementation stage; 6.3 multi-scale search One can do: prefix length = 1 prefix length = 2 ... prefix length = k i.e. search from general semantics to detailed. 6.4 adaptive ε Instead of constant ε I can use: εᵢ = σᵢ where σᵢ is the variance of the embedding dimension. This equalizes information. 6.5 random rotations High-dimensional embeddings are basis-dependent. One rotation: x' = R x where R orthogonal can significantly improve the index. 7. A very interesting aspect of the system Where nearest 1024 neighbours This is effectively: G = (V, E) where V = words E = semantic relations. This is a small neural network on the embedding. 8. Interpretation as neuron Each word acts as a neuron: activation(word) → propagate to neighbours This is exactly: message passing on the semantic graph. 9. Advantage of the system Transformer does: O(n²) I do: O(n log n) or even O(n) 10. Entropy pruning If a given prefix contains too many words: entropy(prefix) > threshold then we increase prefix length. This stabilizes search. 11. One could potentially simplify the function: tanh → log to u = exp(−α |x|) i.e. a simpler exponential map. But since I don’t do much of it anyway, let it stay without simplifications. 12. Ultrametric Addressing of Embedding Space i.e.: 1. embedding projection 2. discrete ultrametric 3. prefix index 4. semantic graph. This is a sensible architecture for a prompt precompiler. 13. It can work as a local semantic cache for LLM. I.e.: user prompt ↓ ultrametric search ↓ semantic hints ↓ LLM ======== 34_attention_simple.txt ======== Since I already have hierarchical ultrametric addresses U_address for word embeddings, this means that the semantic space is represented as a prefix tree (trie). In such a tree, semantic similarity corresponds to the length of the common prefix of the addresses. On this basis, one can build a hierarchical attention mechanism that replaces classical self-attention. 1. Reference point: classical attention In the transformer we compute A = softmax(Q Kᵀ / √d) V for a sequence of length n. Cost: O(n² d) because we compare every token with every other. This is the main cost of LLM models. Out of thrift I committed exactly this solution. 2. This address representation: A(x) = (a₁, a₂, …, a_d) after lexicographic sorting gives a linear representation of the tree. The most important property: similar words have long common prefixes. I.e.: similarity(x,y) ≈ prefix_length(A(x), A(y)) 3. Instead of comparing token x with all y, I commit: 1. compute address A(x) 2. find its region in the tree 3. take elements from that region. This is hierarchical nearest neighbor search. 4. Data structure Minimal index: Trie ├── prefix_1 │ ├── prefix_2 │ │ ├── words │ │ │ └── ... Each node stores: { count centroid_vector child_pointers } 5. Hierarchical attention For query q: step 1: compute address A(q) step 2: descend the tree: root → prefix₁ → prefix₂ → ... step 3: at level k collect candidates. 6. Computational cost The tree has height: O(log n) because each level divides the space. Search: O(log n) for one token. For a sequence of length n: O(n log n) instead of O(n²) 7. How attention weight arises Instead of softmax we make the weight from ultrametric: w(x,y) = exp(−λ d_U(x,y)) where d_U(x,y) = maxᵢ |nᵢ(x) − nᵢ(y)| i.e.: the same propagation e^{-λ d} that I have in the notes. 8. Context computation Context of token: c(x) = Σ_{y ∈ N(x)} w(x,y) v(y) where N(x) are candidates from the tree. I.e. local attention. 9. Multi-scale attention The tree provides natural levels: level 1 – topics level 2 – concepts level 3 – synonyms One can compute context from several levels: c(x) = Σ_k β_k Σ_{y ∈ N_k(x)} w(x,y) v(y) This corresponds to multi-head attention, only hierarchically. 10. Language embeddings have the property: semantic locality i.e. meaning is nearby in the space. The tree exploits this property instead of comparing all pairs. 11. Minimal pseudocode function hierarchical_attention(token): addr = compute_U_address(token) node = trie_root for level in prefix_levels: node = node.child(addr[level]) candidates = node.words context = 0 for w in candidates: dist = ultrametric_distance(token, w) weight = exp(-lambda * dist) context += weight * value(w) return context 12. No need for GPU Because there is no QKᵀ matrix i.e. huge matrix multiplication. Instead there is: tree index local sums. This works very well on CPU. //So on GPU we go to space :) 13. In practice If: n = 8000 tokens transformer does: 64M comparisons hierarchical attention does: n log n ≈ 8000 * 13 = 104k difference ~600×. 14. The system can act as an external attention module for LLM. Pipeline: prompt ↓ ultrametric search ↓ semantic expansion ↓ LLM LLM receives already reduced meaning space. 15. This shows that attention is only one implementation of information propagation in the semantic graph. My method does exactly the same, but: through a tree instead of a full matrix. ======== 35_propagation_equation.txt ======== Shortened formal chain where exactly the propagator exp(−λ d) appears, which is the solution to the Hamilton–Jacobi equation on the graph. The distance on the graph is the solution to the Hamilton–Jacobi equation, and the propagator is the solution to its Hopf–Cole transformation. 1. Semantic graph I have graph G = (V, E) where V – words E – semantic relations (e.g. from the list of 1024 neighbors) edges have cost c(x,y) ≥ 0 i.e. semantic distance. Geodesic distance: d(x,y) = min_{γ:x→y} ∑_{e∈γ} c(e) 2. Value function I define the function S(x) as the minimal cost of reaching node x. This is the classical optimal control problem. 3. Bellman optimality principle S(x) = min_{y∼x} (S(y) + c(y,x)) i.e. the cost of reaching x is the cost of reaching the neighbor + edge cost. This is the discrete Hamilton–Jacobi equation. 4. Continuous form of Hamilton–Jacobi In continuous space the equation has the form ∂t S + H(∇ S) = 0 where H is the Hamiltonian. For propagation with unit cost: H(p) = |p| i.e. ∂t S + |∇ S| = 0 Solution: S(x) = d(x, x₀) i.e. geodesic distance. 5. Hopf–Cole transformation Now I introduce the transformation ψ(x) = e^{-λ S(x)} i.e. S(x) = −1/λ log ψ(x) This is the classical transformation that linearizes Hamilton–Jacobi equations. 6. Substitution into the Bellman equation Substitute: S(x) = min_{y∼x} (S(y) + c(y,x)) i.e. −1/λ log ψ(x) = min_{y∼x} (−1/λ log ψ(y) + c(y,x)) 7. Transform Multiply by −λ: log ψ(x) = max_{y∼x} (log ψ(y) − λ c(y,x)) i.e. ψ(x) = max_{y∼x} (ψ(y) e^{-λ c(y,x)}) 8. Interpretation... The propagator appears K(x,y) = e^{-λ c(x,y)} i.e. exactly your e^{-λ d} Natural kernel of Hamilton–Jacobi propagation. And not a heuristic. 9. Summation form (attention) If instead of maximum I use sum (soft relaxation): ψ(x) = ∑_{y∼x} ψ(y) e^{-λ c(x,y)} This is exactly attention(x,y) = e^{-λ d(x,y)} 10. Ultrametric index provides approximation d(x,y) ≈ d_U(x,y) i.e. K(x,y) ≈ e^{-λ d_U(x,y)} and d_U can be found in time O(log n) thanks to the tree. 11. Form of hierarchical attention c(x) = ∑_{y∈ N(x)} e^{-λ d_U(x,y)} v(y) where N(x) are candidates from the prefix tree. 12. Physical interpretation This propagation is the solution to the equation ∂t S + |∇ S| = 0 on the semantic graph. I.e.: information propagates like a wavefront of minimal cost. 13. Interpretation in implementation Ultrametric address gives approximation of distance d_U(x,y) and propagation weight = exp(-lambda * d) is exactly the Green function of Hamilton–Jacobi propagation. 14. The construction makes: 1. embedding → semantic graph 2. graph → geodesic metric 3. metric → propagator i.e. propagation = e^{-λ d} follows directly from the Hamilton–Jacobi equation. 15. The system is not just a search heuristic. It is a numerical solver of the information propagation equation on the semantic graph. ======== 36_diffusion.txt ======== Shortened formal chain where exactly the propagator exp(−λ d) appears, which is the solution to the Hamilton–Jacobi equation on the graph. The distance on the graph is the solution to the Hamilton–Jacobi equation, and the propagator is the solution to its Hopf–Cole transformation. 1. Starting point – Hamilton–Jacobi equation On the semantic graph G = (V, E) the cost function S(x,t) satisfies the discrete form: ∂ₜ S(x,t) + H(∇_G S(x,t)) = 0 where the graph gradient can be written as differences on edges: ∇_G S(x,y) = S(y) − S(x) and the Hamiltonian (for unit cost) takes the form: H(p) = |p| i.e.: ∂ₜ S + |∇_G S| = 0 The stationary solution is the geodesic distance: S(x) = d(x, x₀) 2. Hopf–Cole transformation I introduce the transformation: ψ(x,t) = e^{-λ S(x,t)} i.e. inversely S(x,t) = −1/λ log ψ(x,t) This is the standard transformation converting Hamilton–Jacobi equations into linear equations. 3. Substitution into the equation I compute the derivative: ∂ₜ S = −1/λ (∂ₜ ψ / ψ) and the gradient: ∇_G S = −1/λ (∇_G ψ / ψ) Substituting into HJ: −1/λ (∂ₜ ψ / ψ) + |−1/λ (∇_G ψ / ψ)| = 0 After multiplying by λ ψ: −∂ₜ ψ + |∇_G ψ| = 0 This is already the amplitude transport equation. 4. Local averaging (relaxation) If I replace the absolute value of the gradient with averaging over neighbors (typical step in the transition from HJ to diffusion on the graph), I obtain: ∂ₜ ψ(x) = Σ_{y∼x} w(x,y) (ψ(y) − ψ(x)) This is classical diffusion on the graph. The operator L ψ(x) = Σ_{y∼x} w(x,y) (ψ(y) − ψ(x)) is the graph Laplacian. 5. Weights from the propagator The weights arise from the previous section: w(x,y) = e^{-λ d(x,y)} i.e. exactly this propagation kernel. The diffusion equation: ∂ₜ ψ = L ψ where L_{xy} = e^{-λ d(x,y)} 6. Transition to ultrametric In the construction the geodesic distance is approximated by the ultrametric: d(x,y) ≈ d_U(x,y) where d_U(x,y) = maxᵢ |nᵢ(x) − nᵢ(y)| This metric is naturally represented by a prefix tree. 7. Laplacian on the tree On an ultrametric tree diffusion takes the form: ∂ₜ ψ(v) = Σ_{u ∈ children(v)} w(v,u) (ψ(u) − ψ(v)) + w(v, parent(v)) (ψ(parent(v)) − ψ(v)) i.e. flow of information between: the node its children its parent. 8. Semantic interpretation On an ultrametric tree levels denote: level 0 — general categories level 1 — topics level 2 — concepts level 3 — near-synonymous words Diffusion means: propagation of semantic activation between levels of abstraction. 9. Form of the solution Solution of the diffusion equation: ψ(t) = e^{t L} ψ(0) i.e. ψ(x,t) = Σ_y K_t(x,y) ψ(y,0) where the diffusion kernel has the form: K_t(x,y) ∼ e^{-λ d_U(x,y)} 10. Connection with hierarchical attention Mechanism: weight = exp(-lambda * d) context = Σ weight * value is exactly one iteration of diffusion. I.e.: c(x) = Σ_y e^{-λ d_U(x,y)} v(y) 11. Tree accelerates computations On a general graph diffusion requires all neighbors. On an ultrametric tree: number of neighbors is limited entire subtrees can be aggregated. Therefore cost: O(n log n) instead of O(n²) 12. Physical interpretation of the construction: S = semantic potential ψ = amplitude of meaning propagation The equation ∂ₜ ψ = L ψ describes diffusion of meaning in the semantic space. 13. In the framework practice: 1. embedding → semantic graph 2. graph → ultrametric tree 3. tree → Laplacian operator 4. Laplacian → information diffusion. I.e. I have a solver of the semantic diffusion equation. 14. Transformation ψ = e^{-λ S} converts the minimal cost problem (Hamilton–Jacobi) into a propagation (diffusion) problem. Therefore the rule e^{-λ d} is not a heuristic for searching similar words — it is the fundamental kernel of information propagation in this space. ======== 37_fundamental_property.txt ======== The framework has one fundamental mathematical property: K(x,y) = e^{-λ d(x,y)} i.e. a propagation kernel depending only on the metric. This is a very special class of operators that appears in many fields of physics and mathematics when describing propagation in a space with uncertain or emergent metric. Therefore this system can be treated as a universal solver of propagation on a graph with parametrized metric. Important relations and applications that come to mind: 1. Mathematical class of the framework Formally I have the operator (Tλ f)(x) = Σ_y e^{-λ d(x,y)} f(y) i.e.: Laplace kernel Green function of propagation diffusion operator. The parameter λ controls the metric scale. This means that the system is in practice: a multi-scale diffusion engine. 2. Spin glass Spin glass models (e.g. Sherrington–Kirkpatrick model) have a state space with ultrametric structure. Parisi showed that energy minima organize into an ultrametric tree. This is the so-called Giorgio Parisi solution. Key relation: q(x,y) = overlap satisfies ultrametricity: q(x,z) ≥ min(q(x,y), q(y,z)) i.e. exactly the same structure as my addresses. Interpretation My semantic tree is mathematically equivalent to: the energy landscape of a spin glass. 3. Ultrametric diffusion In the physics of glasses and proteins there appears the equation: ∂ₜ ψ = −D (−Δ_U)^α ψ where Δ_U = ultrametric Laplacian. This describes: relaxation in glasses diffusion in energy landscapes. My operator e^{-λ d_U} is exactly the kernel of this diffusion. 4. Markov processes on trees If I normalize the kernel: P(x,y) = e^{-λ d(x,y)} / Z_x I obtain a Markov chain. This means the framework can act as: a random walk process a sampler of the semantic space. 5. Renormalization processes (RG) The parameter λ plays the role of renormalization scale. For small λ: large semantic regions for large λ: local meaning This is an exact analogy to the Renormalization Group. 6. Memory models In physics and neurobiology memory models often have an energy landscape of spin glass type. E.g. Hopfield Network model. The construction works very similarly: state → diffusion → attractor i.e. associative semantic memory. 7. Kernel methods in ML The kernel K(x,y) = e^{-λ d(x,y)} is a classical radial kernel. E.g. e^{-γ ‖x−y‖} This means the framework is compatible with: kernel PCA SVM diffusion maps. 8. Diffusion models Generative diffusion models solve the equation ∂ₜ p = Δ p i.e. diffusion. The construction is diffusion in semantic space instead of pixel space. 9. Path integral The propagator K(x,y) = e^{-λ d(x,y)} is an analog of the propagator in quantum mechanics: K(x,y) = e^{-S(x,y)/ℏ} where S is the action. The system can be interpreted as an integral over semantic trajectories. 10. Graph energy models One can define energy: E(x) = Σ_y e^{-λ d(x,y)} and treat the system as a statistical Boltzmann model: P(x) ∼ e^{-E(x)} 11. Search in state space The ultrametric tree acts as: semantic index search heuristic. This is similar to algorithms such as: beam search A* but with emergent metric. 12. Metric integration is the most interesting part of the concept. I have kernel e^{-λ d} If there are multiple metrics: d₁, d₂, …, dₖ they can be combined: K(x,y) = Σ_i w_i e^{-λ_i d_i(x,y)} i.e. metric ensemble. 13. It works because the Laplace kernel has the property "positive definite for many metrics" Therefore different representations of the space can be tied together by a single propagation operator. 14. The system does not depend on a single metric. One can change: embedding metric graph and the propagation operator remains the same. 15. If the implementation is correct, the framework should work well for: semantics meaning search prompt expansion knowledge graphs information propagation ranking state space exploration search heuristics memory models retrieval associations 16. Mathematically my system is: a universal propagation operator on a graph with unknown metric. i.e. f_{t+1} = e^{-λ d} f_t 17. It connects three classes of models: spin glasses diffusion kernel methods. All have the same structure: propagator = exp(-λ · distance) ----------------------- Algorithm: Hierarchical Metric Index Input: embeddings X 1 compute projection coordinates 2 discretize using ε 3 construct addresses 4 sort addresses 5 build prefix index Query: compute address search prefix window evaluate metric kernel ----------------------- hints: Issue of locality Because this is practical. After sorting the addresses: you only check the range i − W … i + W (i − W … i + W) This reduces the cost. Adaptive ε – insight. Instead of a global ε use an adaptive one: εᵢ = σᵢ where σᵢ = variance of the i-th dimension of the embedding Effect: rare dimensions do not dominate more stable addresses Random rotations embeddings Practical hack. Meaning: embeddings are often anisotropic (non-uniformly distributed in space). Random rotation: x′ = R x where Rᵀ R = I (orthogonal matrix – preserves norms and Euclidean distances, but changes directions) Effects: reduces bias along selected axes improves the distribution of addresses stabilizes the index This is a known trick from LSH. ====END========