CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
2FGI_1 5GVX_1 7XDM_1 Letter Amino acid
24 17 31 E Glutamic acid
18 23 31 P Proline
17 38 27 R Arginine
20 37 33 D Aspartic acid
13 12 43 I Isoleucine
13 7 19 M Methionine
11 24 41 T Threonine
5 4 5 W Tryptophan
13 3 5 Y Tyrosine
20 56 79 A Alanine
5 3 19 C Cysteine
9 16 17 Q Glutamine
8 19 26 H Histidine
7 13 11 F Phenylalanine
23 47 56 V Valine
12 2 27 N Asparagine
19 42 61 G Glycine
34 40 61 L Leucine
22 7 34 K Lycine
17 20 30 S Serine

2FGI_1|Chains A, B|PROTEIN (FIBROBLAST GROWTH FACTOR (FGF) RECEPTOR 1)|Homo sapiens (9606)
>5GVX_1|Chain A|Trehalose-phosphate phosphatase|Mycobacterium tuberculosis (83332)
>7XDM_1|Chain A|Carbon monoxide dehydrogenase 2|Carboxydothermus hydrogenoformans (129958)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
2FGI , Knot 137 310 0.84 40 199 302
MVAGVSEYELPEDPRWELPRDRLVLGKPLGEGAFGQVVLAEAIGLDKDKPNRVTKVAVKMLKSDATEKDLSDLISEMEMMKMIGKHKNIINLLGACTQDGPLYVIVEYASKGNLREYLQARRPPGLEYSYNPSHNPEEQLSSKDLVSCAYQVARGMEYLASKKCIHRDLAARNVLVTEDNVMKIADFGLARDIHHIDYYKKTTNGRLPVKWMAPEALFDRIYTHQSDVWSFGVLLWEIFTLGGSPYPGVPVEELFKLLKEGHRMDKPSNCTNELYMMMRDCWHAVPSQRPTFKQLVEDLDRIVALTSNQE
5GVX , Knot 170 430 0.80 40 185 386
MGSSHHHHHHSSGLVPRGSHMASMTGGQQMGRGSEFELVVRKLGPVTIDPRRHDAVLFDTTLDATQEMVRQLQEVGVGTGVFGSGLDVPIVAAGRLAVRPGRCVVVSAHSAGVTAARESGFALIIGVDRTGCRDALRRDGADTVVTDLSEVSVRTGDRRMSQLPDALQALGMADGLVARQPAVFFDFDGTLSDIVEDPDAAWLAPGALEALQKLAARCPIAVLSGRDLADVTQRVGLPGIWYAGSHGFELTAPDGTHHQNDAAAAAIPVLKQAAAELRQQLGPFPGVVVEHKRFGVAVHYRNAARDRVGKVAAAVRTAEQRHALRVTTGREVIELRPDVDWDKGKTLLWVLDHLPHSGSAPLVPIYLGDDITDEDAFDVVGPHGVPIVVRHTDDGDRATAALFALDSPARVAEFTDRLARQLREAPLRAT
7XDM , Knot 248 656 0.81 40 261 583
MGSSHHHHHHSSGLVPRGSHMARQNLKSTDRAVQQMLDKAKREGIQTVWDRYEAMKPQCGFGETGLCCRHCLQGPCRINPFGDEPKVGICGATAEVIVARGLDRSIAAGAAGHSGHAKHLAHTLKKAVQGKAASYMIKDRTKLHSIAKRLGIPTEGQKDEDIALEVAKAALADFHEKDTPVLWVTTVLPPSRVKVLSAHGLIPAGIDHEIAEIMHRTSMGCDADAQNLLLGGLRCSLADLAGCYMGTDLADILFGTPAPVVTESNLGVLKADAVNVAVHGHNPVLSDIIVSVSKEMENEARAAGATGINVVGICCTGNEVLMRHGIPACTHSVSQEMAMITGALDAMILDYQCIQPSVATIAECTGTTVITTMEMSKITGATHVNFAEEAAVENAKQILRLAIDTFKRRKGKPVEIPNIKTKVVAGFSTEAIINALSKLNANDPLKPLIDNVVNGNIRGVCLFAGCNNVKVPQDQNFTTIARKLLKQNVLVVATGCGAGALMRHGFMDPANVDELCGDGLKAVLTAIGEANGLGGPLPPVLHMGSCVDNSRAVALVAALANRLGVDLDRLPVVASAAEWMHEKAVAIGTWAVTIGLPTHIGVLPPITGSLPVTQILTSSVKDITGGYFIVELDPETAADKLLAAINERRAGLGLPW

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(2FGI_1)}(2) \setminus P_{f(5GVX_1)}(2)|=91\), \(|P_{f(5GVX_1)}(2) \setminus P_{f(2FGI_1)}(2)|=77\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:1111100001100101011000111101110111101111011110000100100111011000100001001100101101110000110111100001110111001001010001010011110000010001000100001100100110110011000010001110011100001101101111001001000000001011101111011100100000011011111101101110101111100110110010010010000001011100010111000101001100100111100000
Pair \(Z_2\) Length of longest common subsequence
2FGI_1,5GVX_1 168 4
2FGI_1,7XDM_1 172 4
5GVX_1,7XDM_1 166 22

Newick tree

 
[
	2FGI_1:85.66,
	[
		5GVX_1:83,7XDM_1:83
	]:2.66
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{740 }{\log_{20} 740}-\frac{310}{\log_{20}310})=118.\)
Status Protein1 Protein2 d d1/2
Query variables 2FGI_1 5GVX_1 141 126
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]