CoV2D BrowserTM

CoV2D project home | Random page
Parikh vectors
6CWJ_1 3OGR_1 6YII_1 Letter Amino acid
5 48 14 Y Tyrosine
14 47 20 N Asparagine
29 51 42 D Aspartic acid
3 2 7 C Cysteine
30 104 32 G Glycine
8 51 11 F Phenylalanine
12 62 12 P Proline
1 18 2 W Tryptophan
6 28 18 Q Glutamine
21 41 37 E Glutamic acid
22 42 41 R Arginine
7 25 13 H Histidine
18 42 28 K Lycine
16 61 30 T Threonine
47 80 62 A Alanine
25 56 33 I Isoleucine
21 90 37 L Leucine
11 6 7 M Methionine
26 83 35 S Serine
41 66 33 V Valine

6CWJ_1|Chains A, B, C, D|Cyanuric acid amidohydrolase|Moorella thermoacetica (strain ATCC 39073 / JCM 9320) (264732)
>3OGR_1|Chain A|Beta-galactosidase|Trichoderma reesei (51453)
>6YII_1|Chain A|Transcriptional regulator|Pseudomonas aeruginosa (287)
Protein code \(c\) LZ-complexity \(\mathrm{LZ}(w)\) Length \(n=|w|\) \(\frac{\mathrm{LZ}(w)}{n /\log_{20} n}\) \(p_w(1)\) \(p_w(2)\) \(p_w(3)\) Sequence \(w=f(c)\)
6CWJ , Knot 158 363 0.85 40 199 343
HMQKVEVFRIPTASPDDISGLATLIDSGKINPAEIVAILGKTEGNGCVNDFTRGFATQSLAMYLAEKLGISREEVVKKVAFIMSGGTEGVMTPHITVFVRKDVAAPAAPGKRLAVGVAFTRDFLPEELGRMEQVNEVARAVKEAMKDAQIDDPRDVHFVQIKCPLLTAERIEDAKRRGKDVVVNDTYKSMAYSRGASALGVALALGEISADKISNEAICHDWNLYSSVASTSAGVELLNDEIIVVGNSTNSASDLVIGHSVMKDAIDADAVRAALKDAGIRSDDEMDRIVNVLAKAEAASSGTVRGRRNTMLDDSDINHTRSARAVVNAVIASVVGDPMVYVSGGAEHQGPDGGGPIAVIARV
3OGR , Knot 365 1003 0.83 40 304 898
TSIGLHGGRPRDIILDDAKGPLQNIVTWDEHSLFVHGERVVIFSGEVHPFRLPVPSLYLDVFHKIKALGFNTVSFYVDWALLEGKPGRFRADGIFSLEPFFEAATKAGIYLLARPGPYINAEVSGGGFPGWLQRVKGKLRTDAPDYLHATDNYVAHIASIIAKAQITNGGPVILYQPENEYSGAAEGVLFPNKPYMQYVIDQARNAGIIVPLINNDAFPGGTGAPGTGLGSVDIYGHDGYPLGFDCAHPSAWPDNGLPTTWRQDHLNISPSTPFSLVEFQGGAFDPFGGWGFEQCSALVNHEFERVFYKNNMAAGVTIFNIYMTFGGTNWGNLGHPGGYTSYDYGASIREDRRIDREKYSELKLQGQFLKVSPGYITATPENATQGVYSDSQNIVITPLLAKESGDFFVVRHANYSSTDTASYTVKLPTSAGDLTIPQLGGSLTLTGRDSKIHVTDYPVGKFTLLYSTAEIFTWNEFAEKTVLVLYGGAQELHEFAVKNPFGSSKTAKAKKIEGSNVTIHTTSNLTVVLQWTASSARQVVQLGSLVIYMVDRNSAYNYWVPTLPGSGKQSAYGSSLMNPDSVIINGGYLIRSVAIKGNALSVQADFNVTTPLEIIGIPKGISKLAVNGKELGYSVSELGDWIAHPAIEIPHVQVPELTKLKWYKVDSLPEIRSNYDDSRWPLANLRTSNNTYAPLKTPVSLYGSDYGFHAGTLLFRGRFTARTARQQLFLSTQGGSAFASSVWLNDRFIGSFTGFDAASAANSSYTLDRLVRGRRYILTVVVDSTGLDENWTTGDDSMKAPRGILDYALTSSSGANVSISWKLTGNLGGEDYRDVFRGPLNEGGLFFERQGFHLPSPPLSDFTHGPSSSSSSSSPLDGIAHAGIAFYAAKLPLHLPAQEYDIPLSFVFDNATAAAPYRALLYVNGFQYGKYVSNIGPQTEFPVPEGILDYNGDNWIGVALWALESRGAKVPGLALKSKSPILTGRERVEVVKGPHFKKRHGAY
6YII , Knot 207 514 0.83 40 234 469
MSKSWNHDRAAKHIDQKIADVEEITIKDYVRDMSLESIPTSTAYRVDGVHMYADIMNLEDMLNITAVEGTECHKRTLRFLDQHYRAVKRILNKVDARRVDFHSQRLHSLFTKPYNTESGAETKRVQRAVATAQLIIDVLAETGDDDEQIPAAKVRIGIDTGLALAVNNGRSGYREPLFLGDPANHAAKLASNNKARGIYLTNNARKAIGLAESDEPEKSALTAIEIKACQDAAKLDVTSDEIVEEWREDLKKNPIGGYQFSRQTPPLRDMDIYSLTPANSKRQEMVSLYADIDGFTAYVADHINEKTDDVVRTLHVIRSELERVVTSDFEGRRVRFIGDCVQALSCDGTAHTTDEEKSVSEATRLAGALRSSFNLAIERLNAEGIETGDLGLAIGFDLGPIAVTRLGAKGNRVRCAIGRSVIESEKRQCACSGVETAIGQVAYDAASKAVQNLFGKSRKTSHLDYNEATEALADDGDASAKQARSEAYAGSAAIIRADERQVQPHSRQKVDGSR

Let \(P_w(n)\) be the set of distinct subwords (intervals) in a word \(w\). Let \(p_w(n)\) be the cardinality of \(P_w(n)\). Let \(f(c)\) be the sequence in FASTA with 4-symbol Protein Data Bank code \(c\).

\(|P_{f(6CWJ_1)}(2) \setminus P_{f(3OGR_1)}(2)|=26\), \(|P_{f(3OGR_1)}(2) \setminus P_{f(6CWJ_1)}(2)|=131\). Let \( Z_k(x,y)=|P_x(k)\setminus P_y(k)|+|P_y(k)\setminus P_x(k)| \) be a LZ76 style (set of subwords) Jaccard distance numerator for \(P(k)\).Hydrophobic-polar version of Sequence 1:010010110110101001011101100101011011111100010101001001110001110110011100001100111110110011101010111000111111110011111110001110011010010011011001100101001001011010011101001001000100111000000110001101111111110101001000110001010001100011101100011111000001001111001100110101101110011100000100110111010110010101000011000010000010111011110111011101011100011011111111101
Pair \(Z_2\) Length of longest common subsequence
6CWJ_1,3OGR_1 157 5
6CWJ_1,6YII_1 135 5
3OGR_1,6YII_1 126 4

Newick tree

 
[
	6CWJ_1:76.30,
	[
		6YII_1:63,3OGR_1:63
	]:13.30
]

Let d be the Otu--Sayood distance d.
Let d1 be the Otu--Sayood distance d1. (This makes the 4TYN sequence AAAAAA a close match...)
A roughly speaking expected distance is \((0.85)(0.8)(\frac{1366 }{\log_{20} 1366}-\frac{363}{\log_{20}363})=259.\)
Status Protein1 Protein2 d d1/2
Query variables 6CWJ_1 3OGR_1 324 218.5
Was not able to put for d
Was not able to put for d1

In notation analogous to [Theorem 16, Kjos-Hanssen, Niraula and Yoon (2022)],
\[ \delta= \alpha \mathrm{min} + (1-\alpha) \mathrm{max}= \begin{cases} d &\alpha=0,\\ d_1/2 &\alpha=1/2 \end{cases} \]