Am Neumarkt
Machine learning and other gibberish on Telegram; https://t.me/amneumarkt
515
514
#visualization
Finland vs UK, and Turkiye/Turkey is such a nice example of data quality issues.
513
#misc
Background: https://www.reddit.com/r/MapPorn/s/7SA4Ri7bZW
Someone made a visualization of “the percentage of workforce who are women by country”.
The visualization itself is terrible, especially the color scheme. However, it sparked a lot of discussions and this one is top.
511
#fun
I was digging into my google drive files and found this funny story in one of my slide decks.
Just in case, Guido is the creator of Python.
I do not have any reference for this screenshot as I didn’t include one in my slide. So I would call this unverified.
510
#research
https://clarivate.com/highly-cited-researchers/
Update: I got the full list of researchers as a csv file: https://drive.google.com/file/d/1iequOfuGECEcDWt9AhljJzxHQYdArguN/view?usp=sharing
509
#misc
OpenAI announces leadership transition https://openai.com/blog/openai-announces-leadership-transition
507
#fun
There are normal and professional contests, and then this:
The International Obfuscated C Code Contest
Alright, someone please host a Python version of it. It could go completely wild.
506
#dl
Google & USC benchmarked a prompt based forecasting method, and the results are amazing.
Cao D, Jia F, Arik SO, Pfister T, Zheng Y, Ye W, et al. TEMPO: Prompt-based Generative Pre-trained Transformer for time series forecasting. arXiv [cs.LG]. 2023. Available: http://arxiv.org/abs/2310.04948
504
#misc
4090 ??? Why?
https://www.sec.gov/Archives/edgar/data/1045810/000104581023000217/nvda-20231017.htm
503
#misc
Thinking about the past few days, I realized I never used stackoverflow during work. Our company pays for our github copilot services so I completely switched to the copilot chats. So sad but I think I can totally live in a world without stackoverflow nowadays.
https://www.theverge.com/2023/10/16/23919004/stack-overflow-layoff-ai-profitability
502
#llm
You probably heard of the famous ChatDev (https://github.com/OpenBMB/ChatDev ) that works like a whole company with all sorts of employees solving different parts of a complicated problem. If not, checkout the video demo in the GitHub repo.
Now, Microsoft joined the game. They built a framework for such models. The screenshot attached is a demo from the documentation. I tried it, and it is super easy to get started.
497
#ai
Plotting Progress in AI - Contextual AI https://contextual.ai/plotting-progress-in-ai/
496
#misc
When I was doing my PhD, I spent a lot of hours on Loeb’s papers. I was always hoping to see some evidence that there is already a party going on in the universe but we, as human beings, haven’t found the way to chime in. I even wrote some stories on this idea, e.g., https://blog.leima.is/stories/dream-a-new-world/
I don’t know. It is quite shameful to believe in something like this as an adult. But then again, who knows. At least I can have some fun.
491
490
#ml
Hand-Crafted Transformers
HandCrafted.ipynb - Colaboratory https://colab.research.google.com/github/newhouseb/handcrafted/blob/main/HandCrafted.ipynb
489
#ml
A family tree shows how transformers are evolving.
(HTML is probably the worst name for a model.)
488
#ml
Yes, Transformers are Effective for Time Series Forecasting (+ Autoformer) https://huggingface.co/blog/autoformer
487
486
#forecasting
Created some sections on forecasting with trees. This first draft provides some first steps to applying trees to forecasting problems as well as some useful theories about tree-based models.
485
#visualization
This is a great post. I certainly donât agree with the author that this is âthe greatest statistical graphics ever createdâ. But I canât prove myself either.
Also, just like coding, âdesign it twice (or even more times)â is a great way to produce the best charts. That being said, we should keep making different versions and start to compare them with each other.
https://nightingaledvs.com/defying-chart-design-rules-for-clearer-data-insights/
483
#visualization
Demographic projection for Germany
https://service.destatis.de/bevoelkerungspyramide/index.html#!y=2023&v=4&l=en&g
480
#timeseries
Finding a suitable forecasting metric to evaluate the forecasting models is often the key to a forecasting project. Right? We use metrics when developing models, we also use metrics to monitor models.
There are a bunch of metrics people choose from or adapt from. To be faster when choosing and adapting metrics, I created a page on the properties of different metrics for time series forecasting problems. For reproducibility, I also included all the code used to write this page.
https://dl.leima.is/time-series/timeseries-metrics.forecasting/
478
#ml
Yeh, Catherine, Yida Chen, Aoyu Wu, Cynthia Chen, Fernanda ViĂ©gas, and Martin Wattenberg. 2023. âAttentionViz: A Global View of Transformer Attention.â ArXiv [Cs.HC]. arXiv. http://arxiv.org/abs/2305.03210.
475
#misc
Google “We Have No Moat, And Neither Does OpenAI” https://www.semianalysis.com/p/google-we-have-no-moat-and-neither
474
#ai
New Bing can deal with this kind of imposed misinformation. The references can be used to confirm the answer. This is more reliable than chatGPT.
473
#misc
âThe Godfather of AIâ Quits Google and Warns of Danger Ahead - The New York Times https://www.nytimes.com/2023/05/01/technology/ai-google-chatbot-engineer-quits-hinton.html
472
#coding
I had some discussions with serval people about writing good code during machine learning experimentation.
Whenever it comes to the part of writing formal code, opinions diverge. So, should we write good code that is easy to read with typing and tests, even in experiments?
The spirit of experimentation is fast and reliable. So naturally, the question comes down to what kind of coding style allows us to develop and run experiments, fast.
My experience with running experiments is that we will never run the code just once. Instead, we always come back to it and run it with different configurations or parameters. In this circumstance, how good shall my code be?
For typing and tests, I type most of my args but only write tests needed to develop and debug a function or class.
- Typing is important because people spend time figuring out what to put in there as an argument for a function. With typing, it is much faster.
- Here is an example for tests: If I need to know the shape of the tensor deep in a method of a class, I would spend some seconds writing a simple test that allows me to put breakpoints in the method to investigate inside.
But, the above is a bit trivial. How about the design of the functions and classes? I suggest taking your time writing those that are repeated in every experiment. We will hit some ceiling in development speed real quick, if we always use the first and most naive design for these. In practice, I would say, design it twice and write it once. One such example is data preprocessing. When dealing with the same data and problems, data transformations are usually quite similar in each experiment but a bit different in details. Finding the patterns and writing some slightly generic functions would be helpful. There is always the risk of over-engineering. I prefer to improve things little by little. I might generalize a function a little bit in one experiment. And also, don’t hesitate to throw away your code to rewrite. Rewriting will âŠ
471
#misc
I was working and didn’t have time watching the live streaming when they launched the starship. After some Twitter browsing, I have to say, this thing is beautiful.
https://twitter.com/nextspaceflight/status/1649052544755470338
470
#ai
AI Frontiers: AI for health and the future of research with Peter Lee
A very cool discussion on the topic of large language models.
They mentioned the early stage test of Davinci from OpenAI. The model was able to reason for AP in biology and many of the reasoning was surprising to them. Then Ashley asked the person from OpenAI why is Davinci reason like that and the person replied they don’t know.
Not everyone expected that kind of reasoning in LLM. In hindsight, “It is just a language model” is a very good question. Nowadays with GPT models, it seems that this question is not a question anymore because it is becoming a fact. What is in the training texts and what is language? Karpathy even made a joke about this:
The hottest new programming language is English https://twitter.com/karpathy/status/1617979122625712128?lang=en
469
#academia
Data science weekly mentioned this paper. https://arxiv.org/abs/2304.06035
Quote from the abstract:
A growing number of AI academics can no longer find the means and resources to compete at a global scale. This is a somewhat recent phenomenon, but an accelerating one, with private actors investing enormous compute resources into cutting edge AI research.
At first, a thought it’s an April fools day paper. But it seems serious. For example, the author mentioned the strategy “Analysis Instead of Synthesis”. This has already happened to many fields. Global scale and money burning experiments in physics left many teams no choice but to take other teamsâ data and analyze them.
This is actually quite crazy. Thinking about how AI/ML is developing, it’s almost like a paradigm shift of research. I read about a discussion on Reddit on a similar topic. Some people are concerned that medical research is also gonna shift to the private sector because of AI, leaving many people no choice but to join these big medical corporates.
On the other hand, requirements of computing resources also made smaller companies hard to compete in some fields. We need such a guide for business too.
467
#tool
Read on reddit about this but never really looked into details. It is actually amazing.
Just watch the video in readme.
465
#code
To me, high cognitive load reduces my code quality. In thoery, there are many tricks to reduce cognitive load, e.g., better modularity. In practice, they are not always carried out. Will chatGPT help? Letâs see.
https://www.caitlinhudon.com/posts/programming-beyond-cognitive-limitations-with-ai
464
#ts
I love the last paragraph, especially this sentence:
Unfortunately, I canât continue my debate with Clive Granger. I rather hoped he would come to accept my point of view.
Rob J Hyndman - The difference between prediction intervals and confidence intervals https://robjhyndman.com/hyndsight/intervals/
463
#data
Quite useful.
I use pyarrow a lot and also a bit of polars. Mostly because pandas is slow. With the new 2.0 release, all three libraries are seamlessly connected to each other.
https://datapythonista.me/blog/pandas-20-and-the-arrow-revolution-part-i
462
#ai
The performance is not too bad. ButâŠgiven this is about academic topics, it sounds terrible to have this level of hallucination.
459
#ai
A lot of big names signed it. (Not sure how they verify the signee though)
Personally, I’m not buying it.
https://futureoflife.org/open-letter/pause-giant-ai-experiments/
458
#dl
I am experimenting with torch 2.0 and searching for potential training time improvements in lightning. The following article provides a very good introduction.
https://lightning.ai/pages/community/tutorial/how-to-speed-up-pytorch-model-training/
457
#ml
PĂ©rez J, BarcelĂł P, Marinkovic J. Attention is Turing-Complete. J Mach Learn Res. 2021;22: 1â35. Available: https://jmlr.org/papers/v22/20-302.html
456
#misc
This is how generative AI is changing our lives. Now thinking about it, those competitive advantages from our satisfying technical skills are fading away.
What shall we invest into for a better career? Just integrated whatever is coming into our workflow? Or fundamentally change the way we are thinking?
454
#misc
https://twitter.com/MushtaqBilalPhD/status/1637715463399456768?t=dCY5NP4Ddd_HtTj9PUugxQ&s=19
(Fact check needed)
451
#dl
https://github.com/Lightning-AI/lightning/releases/tag/2.0.0
You can compile (torch 2.0) LightningModule now.
import torch import lightning as L model = LitModel()
This will compile forward and {training,validation,test,predict}_step
compiled_model = torch.compile(model) trainer = L.Trainer() trainer.fit(compiled_model)
450
#ml
https://mlcontests.com/state-of-competitive-machine-learning-2022/
Quote from the report:
Successful competitors have mostly converged on a common set of tools â Python, PyData, PyTorch, and gradient-boosted decision trees.
Deep learning still has not replaced gradient-boosted decision trees when it comes to tabular data, though it does often seem to add value when ensembled with boosting methods. Transformers continue to dominate in NLP, and start to compete with convolutional neural nets in computer vision.
Competitions cover a broad range of research areas including computer vision, NLP, tabular data, robotics, time-series analysis, and many others. Large ensembles remain common among winners, though single-model solutions do win too.
There are several active machine learning competition platforms, as well as dozens of purpose-built websites for individual competitions. Competitive machine learning continues to grow in popularity, including in academia.
Around 50% of winners are solo winners; 50% of winners are first-time winners; 30% have won more than once before.
Some competitors are able to invest significantly into hardware used to train their solutions, though others who use free hardware like Google Colab are also still able to win competitions.
447
438
#data
In physics, people claim that more is different. In the data world, more is very different. I’m no expert in big data, but I learned the scaling problem only when I started working for corporates.
I like the following from the author.
data sizes increase much faster than compute sizes.
In deep learning, many models are following a scaling law of performance and dataset size. Indeed, more data brings in better performance. But the increase in performance becomes really slow. Business doesn’t need a perfect model. We also know computation costs money. At some point, we simply have to cut the dataset, even if we have all the data in the world.
So …, data hoarding is probably fine, but our models might not need that much.
437
#fun
The authors got some styles.
Source: https://twitter.com/mraginsky/status/1181712367966674945
434
#ml
google-research/tuning_playbook: A playbook for systematically maximizing the performance of deep learning models. https://github.com/google-research/tuning_playbook
433
#ml
Haha icecube
IceCube - Neutrinos in Deep Ice | Kaggle https://www.kaggle.com/competitions/icecube-neutrinos-in-deep-ice?utm_medium=email&utm_source=gamma&utm_campaign=comp-icecube-2023
432
#data
Just got my ticket.
I have been reviewing proposals for PyData this year. I saw some really cool proposals so I finally decided to attend the conference.
430
#ml
Top-10 Things in 2022 | Anima on AI https://anima-ai.org/2022/12/31/top-10-things-in-2022/
428
427
#ml
GPT writing papers… Both fancy and scary.
https://huggingface.co/stanford-crfm/pubmedgpt?text=Neuroplasticity
426
#data
I like the idea. My last dashboarding tool for work was streamlit. Streamlit is lightweight and fast. But it requires Python code and a Python server.
Evidence is mostly markdown and SQL. For many lightweight dashboarding tasks, this is just sweet.
Evidence is built on node. I could run a server and provide live updates but I can already build a static website by running npm run build.
Played with it a bit. Nothing to complain about at this point.
425
#visualization
Visualizations of energy consumption and prices in Germany. Given the low temperature atm, it maybe interesting to watch them evolve.
https://www.zeit.de/wirtschaft/energiemonitor-deutschland-gaspreis-spritpreis-energieversorgung
424
#fun
Denmark…
I thought French was complicated, now we all know Danish leads the race.
https://www.reddit.com/r/europe/comments/zo258s/how_to_say_number_92_in_european_countries/
422
#ml
In his MinT paper, Hyndman said he confused these two quantities in his previous paper. đ
MinT is a simple method to make forecasts with hierarchical structure coherent. Here coherent means the sum of the lower level forecasts equals the higher level forecasts.
For example, our time series has a strucutre like sales of coca cola + sales of spirit = sales of beverages. If this relations holds for our forecasts, we have coherent forecasts.
This may sound trivial, the problem is in fact hard. There are many trivial methods such as only forecasting lower levels (coca cola, spirit) then use the sum as the higher level (sales of beverages). These are usually too naive to be effective.
MinT is a reconciliation method that combines high level forecasts and the lower level forecasts to find an optimal combination/reconciliation.
420
#ml
You spent 10k euros on GPU then realized the statistical baseline model is better. đ€Ł
https://github.com/Nixtla/statsforecast/tree/main/experiments/m3
414
#visualization
What is going on with trolli? Was ist passiert?
https://www.visualcapitalist.com/gen-z-favorite-brands-compared-with-older-generations/
413
#ml #forecasting
Liu, Yong, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2022. âNon-Stationary Transformers: Exploring the Stationarity in Time Series Forecasting.â ArXiv [Cs.LG], May. https://doi.org/10.48550/ARXIV.2205.14415.
412
#ml
https://arxiv.org/abs/2210.10101v1
Bernstein, Jeremy. 2022. âOptimisation & Generalisation in Networks of Neurons.â ArXiv [Cs.NE], October. https://doi.org/10.48550/ARXIV.2210.10101.
411
#ml
đ±đ±đ± Fancy
Video: https://fb.watch/giO0tV4N4T Press: https://research.facebook.com/publications/dressing-avatars-deep-photorealistic-appearance-for-physically-simulated-clothing/
Xiang, Donglai, Timur Bagautdinov, Tuur Stuyck, Fabian Prada, Javier Romero, Weipeng Xu, Shunsuke Saito, et al. 2022. âDressing Avatars: Deep Photorealistic Appearance for Physically Simulated Clothing.â ArXiv [Cs.GR], June. https://arxiv.org/abs/2206.15470
410
#ml
https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/
I find this post very useful. I have always wondered what happens after my dataloader prepared everything for the GPU. I didnât know that CUDA has to copy the data again to create page-locked memory.
I used to set pin_memory=True in a PyTorch DataLoader and benchmark it. To be honest, I have only observed very small improvements in most of my experiments. So I stopped caring about pin_memory.
After some digging, I also realized that performance from setting pin_memory=True in DataLoader is ticky. If we donât use multiprocessing nor reuse the page-locked memory, it is hard to expect any performance gain.
(some other notes: https://datumorphism.leima.is/cards/machine-learning/practice/cuda-memory/)
409
#ml
Amazon has been updating their Machine Learning University website. It is getting more and more interesting. They have added an article about linear regression recently. There is a section in this article about interpreting linear models and it is just fun.
https://mlu-explain.github.io/
( Time machine: https://t.me/amneumarkt/293 )
408
#showerthoughts
I’ve never thought about dark mode in LaTeX. It sounds weird at first, but now thinking about this, it’s actually a great style.
This is a dark style from Dracula. https://draculatheme.com/latex
407
#MLÂ
This is interesting.
Toy Models of Superposition. [cited 15 Sep 2022]. Available: https://transformer-circuits.pub/2022/toy_model/index.html#learning
406
#python
Faster conda
https://www.anaconda.com/blog/a-faster-conda-for-a-growing-community
405
#fun
Germany is so small. My GitHub profile ranks 102 in Germany by public contributions.
403
#visualization
Tracking of an Eagle over a 20 year period. Source: https://twitter.com/Loca1ion/status/1566346534651924480?s=20&t=AKXn9U-L3fyhrJzeAXySlA
395
#fun
Some results from the stable difussion model. See comments for some examples.
394
#visualization
Hmm not so many contributions from wild animals.
Source: https://www.weforum.org/agenda/2021/08/total-biomass-weight-species-earth
Data from this paper: https://www.pnas.org/doi/10.1073/pnas.1711842115#T1
393
#ml
https://ai.googleblog.com/2022/08/optformer-towards-universal.html?m=1
I find this work counter intuitive. They took some descriptions of the optimization in machine learning and trained a transformer to “guesstimate” the hyperparameters of a model. I understand that human being has some “feeling” of the hyperparameters after working with the data and model for a while. But it is usually hard to extrapolate such knowledge when we have completely new data and models. I guess our brain is doing some statistics based on our historical experiments. And we call this intuition. My “intuition” is that there is little generalizable knowledge in this problem. đ It would have been so great if they investigated the saliency maps.
388
#fun
I became a beta tester of DALLE. Played with it for a while and it is quite fun. See the comments for some examples. Comment if you would like to test some prompts.
387
#fun
participants who spent more than six hours working on a tedious and mentally taxing assignment had higher levels of glutamate â an important signalling molecule in the brain. Too much glutamate can disrupt brain function, and a rest period could allow the brain to restore proper regulation of the molecule
383
#ml
Fotios Petropoulos initiated the forecasting encyclopaedia project. They published this paper recently.
Petropoulos, Fotios, Daniele Apiletti, Vassilios Assimakopoulos, Mohamed Zied Babai, Devon K. Barrow, Souhaib Ben Taieb, Christoph Bergmeir, et al. 2022. âForecasting: Theory and Practice.â International Journal of Forecasting 38 (3): 705â871.
https://www.sciencedirect.com/science/article/pii/S0169207021001758
Also available here: https://forecasting-encyclopedia.com/
The paper covers many recent advances in forecasting, including deep learning models. There are some important topics missing but Iâm sure they will cover them in future releases.
382
#career
so the job of data scientist will only continue to grow in its importance in the business landscape.
However, it will also continue to change. We expect to see continued differentiation of responsibilities and roles that all once fell under the data scientist category.
https://hbr.org/2022/07/is-data-scientist-still-the-sexiest-job-of-the-21st-century
381
#python
Guidelines for research coding. It is not the highest standard but is easy to follow.
380
#ml
https://arxiv.org/abs/2205.02302
Kreuzberger D, KĂŒhl N, Hirschl S. Machine Learning Operations (MLOps): Overview, definition, and architecture. arXiv [csLG]. 2022 [cited 17 Jul 2022]. doi:10.48550/ARXIV.2205.02302
379
#ml
The recommended readings serve as a good curriculum for transformers.
376
375
#ml
I was playing with dalle-mini ( https://github.com/borisdayma/dalle-mini ).
So… in the eyes of Dalle-mini,
- science == chemistry (? I guess),
- scientists are men.
Tried several times, same conclusions.
It is so hard to fight against the bias in ML models.
Update: OpenAI is fixing this.
https://openai.com/blog/reducing-bias-and-improving-safety-in-dall-e-2/
371
#fun
đđđ
[P] No, we don’t have to choose batch sizes as powers of 2: MachineLearning https://www.reddit.com/r/MachineLearning/comments/vs1wox/p_no_we_dont_have_to_choose_batch_sizes_as_powers/
370
#ml
Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, Hutchinson B, et al. Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency. New York, NY, USA: ACM; 2019. doi:10.1145/3287560.3287596
369
#ml
This is also like one thousand years later…
PyMC 4.0 Release Announcement â PyMC project website https://www.pymc.io/blog/v4_announcement.html
368
#data
If you are building a simple dashboard using python, streamlit is a great tool to get started. One of the problems in the past was to create multipage apps.
To solve this problem, I created a template for multipage apps a year ago. https://github.com/emptymalei/streamlit-multipage-template
But today, streamlit officially introduced multipage support. And it looks great. I havenât built any dashboards for a while, but to me, this is still the go-to solution for a dashboard. https://blog.streamlit.io/introducing-multipage-apps/
367
#fun
Higharc is a start-up helping people design houses using generative designs.
The demo looks amazing.
366
#ml
This is hilarious.
Source: https://mobile.twitter.com/arankomatsuzaki/status/1529278580189908993
364
#ml
I have heard about deepeta before but never thought it was a transformer.
According to this blog post by uber, they are using an encoder decoder architecture with linear attention.
This blog post also explains how they made a transformer fast.
DeepETA: How Uber Predicts Arrival Times Using Deep Learning https://eng.uber.com/deepeta-how-uber-predicts-arrival-times/
362
#github
I have been following an issue on math support for github markdown (github/markup/issues/274).
One thousand years later …
Math support in Markdown | The GitHub Blog https://github.blog/2022-05-19-math-support-in-markdown/
361
#misc
Quote from this article:
“It doesnât transmit from person to person as readily, and because it is related to the smallpox virus, there are already treatments and vaccines on hand for curbing its spread. So while scientists are concerned, because any new viral behaviour is worrying â they are not panicked.”
360
#ml
Finally… We can now utilize the real power of M1 chips.
Introducing Accelerated PyTorch Training on Mac | PyTorch https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/
I have been following this issue: https://github.com/pytorch/pytorch/issues/47702#issuecomment-1130162835 There were even some fights. đ
359
356
#python
This post is a retro on how I learned Python.
Disclaimer: I can not claim that I am a master of Python. This post is a retrospective of how I learned Python in different stages.
I started using Python back in 2012. Before this, I was mostly a Matlab/C user.
Python is easy to get started, yet it is hard to master. People coming from other languages can easily make it work but will write some “disgusting” python code. And this is because Python people talk about “pythonic” all the time. Instead of being an actual style guide, it is rather a philosophy of styles.
When we get started, we are most likely not interested in PEP8 and PEP257. Instead, we focus on making things work. After some lectures from the university (or whatever sources), we started to get some sense of styles. Following these lectures, people will probably write code and use Python in some projects. Then we began to realize that Python is strange, sometimes even doesn’t make sense. Then we started leaning about the philosophy behind it. At some point, we will get some peer reviews and probably fight against each other on some philosophies we accumulated throughout the years.
The attached drawing (in comments) somehow captures this path that I went through. It is not a monotonic path of any sort. This path is most likely to be permutation invariant and cyclic. But the bottom line is that mastering Python requires a lot of struggle, fights, and relearning. And one of the most effective methods is peer review, just as in any other learning task in our life.
Peer review makes us think, and it is very important to find some good reviewers. Don’t just stay in a silo and admire our own code. To me, the whole journey helped me building one of the most important philosophies of my life: embrace open source and collaborate.
353
#fun
Could use this
How to Lie with Statistics - Wikipedia https://en.wikipedia.org/wiki/How_to_Lie_with_Statistics
352
#data
Stop squandering data: make units of measurement machine-readable https://www.nature.com/articles/d41586-022-01233-w
351
#ml
Highly recommended! If you are working on deep learning for forecasting, gluonts is a great package. It simplifies all these tedious data preprocessing, slicing, backrest stuff. We can then spend time on implementing the models themselves (there’re a lot of ready-to-use models). What’s even better, we can use pytorch lightning!
See this repository for a list of transformer based forecasting models. https://github.com/kashif/pytorch-transformer-ts
350
#ml
Came across this post this morning. I realized the reason I am not writing a lot in Julia is simply because I don’t know how to write quality code in Julia.
When we build a model in Python, we know all these details about making it quality code. For a new language, I’m just terrified by the amount of details I need to be aware of.
Ah I’m getting older.
JAX vs Julia (vs PyTorch) · Patrick Kidger https://kidger.site/thoughts/jax-vs-julia/
349
#python
Anaconda open sourced this…
I have no idea what this is for…
348
#ml
I heard about information bottleneck so many times but didn’t really go back and read the original papers.
I spent some time on it and I found it quite interesting. It is philosophically based on what was described in Vapnik’s The Nature of Statistical Learning, where he discussed how generalizations work by enforcing parsimony. Here in this information bottleneck paper, the most interesting thing is the quantified generalization gap and complexity gap. With these, we know where to go on the information plane.
It’s a good read.
Tishby N, Zaslavsky N. Deep Learning and the Information Bottleneck Principle. arXiv [cs.LG]. 2015. Available: http://arxiv.org/abs/1503.02406,
346
#work
I realized something interesting about time management.
If I open my calendar now, I see these âtilesâ of meetings filling up most of my working hours. It looks bad, but it was even worse in the past. The thing is, if I do meetings during my working hours, I will have to work extra hours to do some thinking and analysis. It is rather cruel.
So what changed? I think I realized the power of Google Docs. Instead of many people talking and nobody listening, someone should write up a draft first and send it out to the colleagues. Then, once people get the link to the docs, everyone can add comments.
This doesnât seem to be very different from meetings. Oh, it is very different. The workflow can be async. We are not forced to use our precious focus time to attend meetings. We can read and comment on the document whenever we like: when we are commuting, when we are taking a dump, when we are on a phone/tablet, just, any, time.
Apart from the async workflow, I also like the “think, comment and forget” idea. I feel people deliver better ideas when we think first, comment next, and forget about it unless there are replies to our comments. No pressure, no useless debates.
345
#ml #statistics
I read about conformal prediction a while ago and realized that I need to understand more about the hypothesis testing theories. As someone from natural science, I mostly work within the Neyman-Pearson ideas. So I explored it a bit and found two nice papers. See the list below. If you have other papers on similar topics, I would appreciate some comments.
- Perezgonzalez JD. Fisher, Neyman-Pearson or NHST? A tutorial for teaching data testing. Front Psychol. 2015;6: 223. doi:10.3389/fpsyg.2015.00223 https://www.frontiersin.org/articles/10.3389/fpsyg.2015.00223/full
- Lehmann EL. The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? J Am Stat Assoc. 1993;88: 1242â1249. doi:10.2307/2291263
344
343
342
341
#jobs
I vaguely feel there’s a talent shortage in Germany. “Hiring is hard”. I heard this several times. Our team also need more hires.
So the company came up with this: Land a job at Zalando within 3 days after the final interviews!Â
https://jobs.zalando.com/en/jobs/4004181/?gh_src=%20f46af3281us
340
#fun
Every chemistry graduate will be in charge of a molecule. Someone got to take care of âTitinâ (189,819 characters), and she/he will have to recite the name first in every meeting: https://en.wiktionary.org/wiki/Appendix:Protologisms/Long_words/Titin#Noun
339
#smarthome #misc
I have, somehow, 5 different brands of smart home products in our little apartment. I have no idea what is going on in the smart home industry. Every brand has its own app, hub, or even protocal. So I had to install five different apps to initialize the devices. I could, in principle, ditch these apps and use google/alexa only after I installed them, however, this is still extremely inconvenient as google/alexa doesnât support all the fancy functions of the devices.
Any solutions to this problem?
338
#fun
Not bad. đ
The Big Data Game | Firebolt https://www.firebolt.io/big-data-game
337
#visualization #fun
The Dunning-Kruger effect is quite real đ
Infographic: 50 Cognitive Biases in the Modern World https://www.visualcapitalist.com/50-cognitive-biases-in-the-modern-world/
336
#visualization
Plot Overview for Matplotlib Users / Observable / Observable https://observablehq.com/@observablehq/plot-overview-for-matplotlib-users
335
#ml
Interesting… There’re some discussions on the lottery ticket hypothesis.
334
#ml
Beautiful and systematic derivation showing how and why negative sampling works
Negative sampling is a great technique to estimate the softmax especially when the calculation of the partition function is intractable. It’s used in word2vec, and many other models such as node2vec.
Goldberg Y, Levy O. word2vec Explained: deriving Mikolov et al.âs negative-sampling word-embedding method. arXiv [cs.CL]. 2014. Available: http://arxiv.org/abs/1402.3722
333
#tool
I drafted a new release of the Hugo Connectome theme.
I like the command palette in VSCode. It is fast and accurate. So I added a command palette to the Hugo Connectome theme to help us navigate the notes and links.
Now we can use the command palette to navigate to backlinks, out links, references, and more.
See it in action: https://datumorphism.leima.is/wiki/time-series/state-space-models/ Use Command+K or Windows+K to activate the command palette.
- Type in search to search for notes.
- Type in Note ID to copy the current note id to the clipboard.
- Type in graph to see the graph view of all the notes.
- Type in references to go to references.
- Type in backlinks to select from backlinks to navigate to.
- Type in links to select from all outgoing links to navigate to.
Release: https://github.com/kausalflow/connectome/releases/tag/0.1.1
327
#ml
(WARNING: Promoting of my notes. This is a test.)
I learned something very interesting today: CRPS.
Suppose we would like to approximate the quantile function of some data points. If we assume a parametric model of the quantile function, e.g., Q(x|theta), how do we find the parameters using the given dataset? Naturally, we need a loss function to compare our quantile function to the datapoints. CRPS is a robust choice. I have seen it being used in several papers in time series forecasting.
You can find more details here: https://datumorphism.leima.is/cards/time-series/crps/
326
#ml
Itâs a lengthy article but also a well written one.
A few comments:
- The author wrote a paper on âThe Next Decade in AIâ: https://arxiv.org/abs/2002.06177
- Make things work in their own domain. If we are gonna come up with a âtheory of everythingâ for computing or intelligence, we will hit the âmesoscopicâ wall, where the bottom up theories and the top down approaches meet but we canât really make a connection. In the case of intelligence, the wall is determined by the complexities (maybe MDL?). You can make symbols work for high complexities but not always. Similar thing happens to neural networks.
- The neural symbolic approach sounds good but itâs almost like patching a bike as wheels of a train.
325
#visualization
Please click on the link and watch the animation. It’s 3D.
“The clever people at @NASA have created this deceptively simple yet highly effective data visualisation showing monthly global temperatures between 1880-2021”.: nextfuckinglevel https://www.reddit.com/r/nextfuckinglevel/comments/tejc0l/the_clever_people_at_nasa_have_created_this/?utm_source=share&utm_medium=ios_app&utm_name=iossmf
324
#ml
I share similar thoughts with the top comment by theXYZT.
If I may add to her comment, I would say: Embrace the new approach even if it shatters our philosophy. But it’s not only about what happened in the history of physics. It’s about what we believe in science. In some sense, the purpose of interpretability and parsimony is for human to come up with better ideas and making us happy. If a universal model is working well enough and can be improved gradually already, interpretability is not as important as predictability. This is more or less the first principle of science, if I may say so.
322
#python
I find poetry a great tool to manage Python requirements.
I used to manage Python requirements using requirements.txt(environment.yaml) and install them using pip(conda). The thing is, in this stack, we have to pin the version ranges manually. It is quite tedious, and we easily run into version problems for a large project.
Poetry is the savior here. When developing a package, we add some initial dependencies to the pyproject.yaml, a PEP standard. Whenever a new package is needed, we run poetry add package-name. Poetry tries to figure out the compatible versions. A lock file for the dependencies with restricted versions will be created or updated. To recreate an identical python environment, we only need to run poetry install.
There’s one drawback and may be quite painful at some point. Recreating the lock file for dependencies is extremely slow when the complexity grows in the requirements. But this is not a problem if poetry but rather constraints from pypi. One solution to this problem is to use cache.
321
#tool
I have been using Hugo for my public notes. I built a theme called connectome a while ago. This theme has been serving as my note-taking theme.
When building my notes website on data science, I have noticed many problems with the connectome theme. And today, I fixed most of the problems. The connectome theme deserves some visibility now.
If you are using Hugo and would like to build a website for connected notes, like this one I have https://datumorphism.leima.is/ , the Hugo connectome theme can help a bit.
The Connectome Theme: https://github.com/kausalflow/connectome A template one could use to bootstrap a new website: https://github.com/kausalflow/hugo-connectome-theme-demo Tutorials: https://hugo-connectome.kausalflow.com/projects/tutorials/ Real-world example: https://datumorphism.leima.is/
â If you would like to know more about how it was done, the idea is quite simple. Before we move on, one FAQ I got is, why Hugo. The answer is simple, speed.
The key components of the connectome theme are:
- automated backlinks, and
- a graph visualization of the whole notebook.
Behind the scene, the heart of the theme is a metadata file that describes the connections between the notes.
For each note, we use the metadata to get all the notes that links to the current note, and build backlinks based on the metadata.
320
#ML #RL #DeepMind
Magnetic control of tokamak plasmas through deep reinforcement learning | Nature https://www.nature.com/articles/s41586-021-04301-9
318
#ML
I made some slides to bootstrap a community in my company to share papers on graph related methods (spectral, graph neural networks, etc). These slides are mostly based on the first two chapters of the book by William Hamilton. I added some intuitive interpretations on some key ideas. Some of these are frequently used in graph neural networks even transformers. Building intuitions helps us unboxing these neural networks. But the slides are only skeleton notes so I probably have to expand them at some point.
I am thinking about drawing more about the book and on this topic. Maybe even making some short videos using these slides. Let’s see how far I can go. I am way too busy now. (<-no excuse)
317
#ml
Lol, DeepMind and OpenAI:
https://deepmind.com/blog/article/Competitive-programming-with-AlphaCode
vs
316
祟ćźæćŸć€ïŒæŻćŠæçš ack æżä»Łäș grepïŒéćșŠćż«äșäžć°ă
https://www.ruanyifeng.com/blog/2022/01/cli-alternative-tools.html
314
#visualization
Seaborn is getting a new interface.
Would be great if the author defines a dunder method _ _ add _ _ () instead of using .add() method. Using dunder add, we can simply use + on layers.
Nevertheless, we can all move away from plotnine when the migration is done.
313
#ds
Deepnote supports Great Expectations (GE) now.
I ran their template notebook:
312
#visualization
Beautiful, elegant, and informative. It reminds me of the Netflix movie chromatic storytelling visualization.
Full image: https://zenodo.org/record/5828349
Other discussions: https://www.reddit.com/r/dataisbeautiful/comments/s6vh8k/dutch_astronomer_cees_bassa_took_a_photo_of_the/
311
#python
I thought it was a trivial talk in the beginning. But I quickly realized that I may know every each piece of the code mentioned in the video but the philosophy is what makes it exciting.
He talked about some fundamental ideas of Python, e.g., protocols.
After watching this video, an idea came to me. Pytorch lightning has implanted a lot of hooks in a very pythonic way. This is what makes pytorch lightning easy to use. (So if you do a lot of machine learning experiments, pytorch lightning is worth a try.)
308
#data #ds
Disclaimer: I’m no expert in state diagram nor statecharts.
It might be something trivial but I find this useful: Combined with some techniques in statecharts (something frontend people like a lot), state diagram is a great way to document what our data is going through in data (pre)processing.
For complicated data transformations, we can make the corresponding state diagram and follow your code to make sure it is working as expected. The only thing is that we are focusing on the state of data not any other system.
We can use some techniques from statecharts, such as hierarchies and parallels.
State diagram is better than flowchart in this scenario because we are more interested in the different states of the data. State diagrams automatically highlights the states and we can easily spot the relevant part in the diagram and we donât have to start from the beginning.
I documented some data transformations using state diagrams already. I haven’t tired but it might also help us document our ML models.
References:
307
#visualization
Pu X, Kay M. A probabilistic grammar of graphics. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. New York, NY, USA: ACM; 2020. doi:10.1145/3313831.3376466 Available at: https://dl.acm.org/doi/10.1145/3313831.3376466
A very good read if you are visualizing probability densities a lot. The paper began with a common mistake people make when visualizing densities. Then they proposed a systematic grammar of graphics for probabilities. They also provide a package (quite preliminary, see here https://github.com/MUCollective/pgog ).
306
#ml #science
I remember several years ago when I was still doing my PhD, there’s this contest about predicting protein structure and none of them was working well. At that time, I would never have thought we could have anything like AlphaFold in a few years. .
304
#ML #Transformers
Alammar J. The Illustrated Transformer. [cited 14 Dec 2021]. Available: http://jalammar.github.io/illustrated-transformer/
So good.
303
#DS #visualization
A new lightweight language for data analysis and visualization. It looks promising.
I hate jupyter notebooks and I don’t use them on most of my projects. One of the reasons is low reproducibility due to its non-reative nature. You changed some old cells and forgot to run a cell below, you may read wrong results. This new language is reactive. If old cells are changed, related results are also updated.
302
#ml #rl
How to Train your Decision-Making AIs https://thegradient.pub/how-to-train-your-decision-making-ais/
The author reviewed  “five types of human guidance to train AIs: evaluation, preference, goals, attention, and demonstrations without action labels”.
The last one reminds me of the movie Finch. In the movie, Finch was teaching the robot to walk by demonstrating walking but without “labels”.
301
#visualization
Hmmm my plate is way off the planetary heath diet recommendation.
300
#DS
Just in case you are also struggling with Python packages on Apple M1 Macs
I am using the third option: anaconda + miniforge.
299
294
#ML
SHAP (SHapley Additive exPlanations) is a system of methods to interpret machine learning models. The author of SHAP built an easy-to-use package to help us understand how the features are contributing to the machine learning model predictions. The package comes with a comprehensive tutorial for different machine learning frameworks.
- Python Package: slundberg/shap
- A tutorial on how to use it: https://www.aidancooper.co.uk/a-non-technical-guide-to-interpreting-shap-analyses/
The package is so popular and you might be using it already. So what is SHAP exactly? It is a series of methods based on Shapley values.
SHAP (SHapley Additive exPlanations) is a game-theoretic approach to explain the output of any machine learning model.
Regarding Shapley value: There are two key ideas in calculating a Shapley value.
- A method to measure the contribution to the final prediction of some certain combination of features.
- A method to combine these “contributions” into a score.
SHAP provides some methods to estimate Shapley values and also for different models.
The following two pages explain Shapley value and SHAP thoroughly.
- https://christophm.github.io/interpretable-ml-book/shap.html
- https://christophm.github.io/interpretable-ml-book/shapley.html
References:
- Lundberg SM, Lee SI. A unified approach to interpreting model predictions. of the 31st international conference on neural âŠ. 2017. Available: http://papers.nips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf
- Lundberg SM, Nair B, Vavilala MS, Horibe M, Eisses MJ, Adams T, et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature Biomedical Engineering. 2018;2: 749â760. doi:10.1038/s41551-018-0304-0
I posted a similar article years ago in our Chinese data weekly newsletter but for a different story.
293
291
#visualization
Nicolas P. Rougier released his book on scientific visualization. He made some aesthetically pleasing figures. And the book is free.
290
#DS #Visualization
Okay, I’ll tell you the reason I wrote this post. It is because xkcd made this.
Choosing proper colormaps for our visualizations is important. It’s almost like shooting a photo using your phone. Some phones capture details in every corner, while some phones give us overexposed photos and we get no details in the bright regions.
A proper colormap should make sure we see the details we need to see. To address the importance of colormaps, we use the two examples shown on the website of colorcet1. The two colormaps, hot, and fire, can be found in matplotlib and colorcet, respectively.
I can not post multiple images in one message, please see the full post for the comparisons of the two colormaps. Really, it is amazing. Find the link below: https://github.com/kausalflow/community/discussions/20
It is clear that “hot” brings in some overexposure. The other colormap, “fire”, is a so-called perceptually uniform colormap. More experiments are performed in colorcet. Glasbey et al showed some examples of inspecting different properties using different colormaps2.
One of the methods to make sure the colormap shows enough details is to use perceptually uniform colrmaps3. Kovesi provides a method to validate if a color map has uniform perceptual contrast3.
References and links mentioned in this post:
Anaconda. colorcet 1.0.0 documentation. [cited 12 Nov 2021]. Available: https://colorcet.holoviz.org/ ↩︎
Glasbey C, van der Heijden G, Toh VFK, Gray A. Colour displays for categorical images. Color Research & Application. 2007. pp. 304â309. doi:10.1002/col.20327 ↩︎
Kovesi P. Good Colour Maps: How to Design Them. arXiv [cs.GR]. 2015. Available: http://arxiv.org/abs/1509.03700 ↩︎ ↩︎
289
#ML #fun
animegan v2! (I stole this animation from reddit. https://www.reddit.com/r/MachineLearning/comments/qo4kp8/r_p_animeganv2_face_portrait_v2/ )
Try it out:
- Telegram bot (works pretty well): https://t.me/face2stickerbot
- Dashboard (sometimes it doesn’t work): https://huggingface.co/spaces/akhaliq/AnimeGANv2
Code: https://github.com/bryandlee/animegan2-pytorch
Redditors made some funny photos too. https://www.reddit.com/r/MachineLearning/comments/qo4kp8/r_p_animeganv2_face_portrait_v2/
â This post is also available here: https://community.kausalflow.com/c/ml-applications/animeganv2
288
#ML #news
- https://ai.googleblog.com/2021/11/model-ensembles-are-faster-than-you.html
- Wang X, Kondratyuk D, Christiansen E, Kitani KM, Alon Y, Eban E. Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models. arXiv [cs.CV]. 2020. Available: http://arxiv.org/abs/2012.01988
Most companies probably have several models to solve the same problem. There are model A, model B, even model C. The final result is some kind of aggregation of the three models. Or the models are cascaded like what’s shown in the figure. But it takes a lot of computing resources to run the features through the three models.
Wang et al shows that ensembles are not more resource demanding than big models with similar performance in CV tasks.
287
#fun
Lol, thank you Mr Lossfunction. But, which sanitizer are you using?
https://www.reddit.com/r/learnmachinelearning/comments/qpolnw/data_cleaning_is_so_must/
286
#DS #news
This is a post about Zillow’s Zetimate Model.
Zillow (https://zillow.com/ ) is an online real-estate marketplace and it is a big player. But last week, Zillow withdrew from the house flipping market and planned to layoff a handful of employees.
There are rumors indicating that this action is related to their machine learning based price estimation tool, Zestimate ( https://www.zillow.com/z/zestimate/ ).
At a first glance, Zestimate seems fine. Though the metrics shown on the website may not be that convincing, I am sure they’ve benchmarked more metrics than those shown on the website. There are some discussions on reddit.
Anyways, this is not the best story for data scientists.
285
#ML
(See also https://bit.ly/3F1Kv2F )
Centered Kernel Alignment (CKA) is a similarity metric designed to measure the similarity of between representations of features in neural networks1.
CKA is based on the Hilbert-Schmidt Independence Criterion (HSIC). HSIC is defined using the centered kernels of the features to compare2. But HSIC is not invariant to isotropic scaling which is required for a similarity metric of representations1. CKA is a normalization of HSIC.
The attached figure shows why CKA makes sense.
CKA has problems too. Seita et al argues that CKA is a metric based on intuitive tests, i.e., calculate cases that we believe that should be similar and check if the CKA values is consistent with this intuition. Seita et al built a quantitive benchmark3.
282
#DS #ML
Microsoft created two depositories for Machine Learning and Data Science beginners. They created many sketches. I love this style.
281
#ML
( I am experimenting with a new platform. This post is also available at: https://community.kausalflow.com/c/ml-journal-club/probably-approximately-correct-pac-learning-and-bayesian-view )
The first time I read about PAC was in the book The Nature of Statistical Learning Theory by Vapnik 1.
PAC is a systematic theory on why learning from data is even feasible 2. The idea is to quantify the errors when learning from data and we find that is is possible to have infinitesimal error under some certain codnitions, e.g., large datasets. Quote from Guedj 3:
A PAC inequality states that with an arbitrarily high probability (hence “probably”), the performance (as provided by a loss function) of a learning algorithm is upper-bounded by a term decaying to an optimal value as more data is collected (hence “approximately correct”).
Bayesian learning is an very important topic in machine learning. We implement Bayesian rule in the components of learning, e.g., postierior in loss function. There also exists a PAC theory for Bayesian learning that explains why Bayesian algorithms works. Guedj wrote a primer on this topic3. ï»ż ï»ż
Vladimir N. Vapnik. The Nature of Statistical Learning Theory. 2000. doi:10.1007/978-1-4757-3264-1 ↩︎
Valiant LG. A theory of the learnable. Commun ACM. 1984;27: 1134â1142. doi:10.1145/1968.1972 ↩︎
Guedj B. A Primer on PAC-Bayesian Learning. arXiv [stat.ML]. 2019. Available: http://arxiv.org/abs/1901.05353 ↩︎ ↩︎
279
#ML
(I am experimenting with a new platform. This post is also available at: https://community.kausalflow.com/c/ml-journal-club/how-do-neural-network-generalize )
There are somethings that are quite hard to understand in deep neural networks. One of them is how the network generalizes.
[Zhang2016] shows some experiments about the amazing ability of neural networks to learn even completely random datasets. But they can not generalize as the data is random. How to understand generalization? The authors mentioned some theories like VC dimension, Rademacher complexity, and uniform stability. But none of them is good enough.
Recently, I found the work of Simon et al [Simon2021]. The authors also wrote a blog about this paper [Simon2021Blog].
The idea is to simplify the problem of generalization by looking at how a neural network approximates a function f. This is approximate vectors in Hilbert space. Thus we are looking at the similarity of the vectors f, and its neural network approximation f’. The similarity of these two vectors is related to the eigenvalues of the so-called âneural tangent kernelâ (NTK). Using NTK, they derived an amazingly simple quantity, learnability, which can measure how Hilbert space vectors align with each other, that is, how good the approximation using the neural network is.
[Zhang2016]: Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv [cs.LG]. 2016. Available: http://arxiv.org/abs/1611.03530
[Simon2021Blog]: Simon J. A First-Principles Theory of NeuralNetwork Generalization. In: The Berkeley Artificial Intelligence Research Blog [Internet]. [cited 26 Oct 2021]. Available: https://bair.berkeley.edu/blog/2021/10/25/eigenlearning/
[Simon2021]: Simon JB, Dickens M, DeWeese MR. Neural Tangent Kernel Eigenvalues Accurately Predict Generalization. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2110.03922
278
#visualization
“Fail” When visualizing data, the units being used have to be specified for any values shown.
But the style of the charts is attractive. :)
By chungischef Available at: https://www.reddit.com/r/dataisbeautiful/comments/q958if/recreation_of_a_classic_population_density_map/
277
275
#ML
Duan T, Avati A, Ding DY, Thai KK, Basu S, Ng AY, et al. NGBoost: Natural Gradient Boosting for probabilistic prediction. arXiv [cs.LG]. 2019. Available:Â http://arxiv.org/abs/1910.03225
(I had it on my reading list for a long time. However, I didn’t read it until today because the title and abstract are not attractive at all.) But this is a good paper. It goes deep to dig out the fundamental reasons why some methods work and others don’t.
When inferring probability distributions, it is straightforward to come up with methods with parametrized distributions (statistical manifolds). Then, by tuning the parameters, we adjust the distribution to fit our dataset the best. The problem is the choice of the objective function and optimization methods. This paper mentioned a most generic objective function and a framework to optimize the model along the natural gradient instead of just the gradient w.r.t. the parameters. Different parametrizations of the objective is like coordinate transformations and chain rule only works if the transformations are in a “flat” space but such “flat” space is not necessarily a good choice for a high dimensional problem. For a space that is approximately flat in a small region, we can define distance like what we do in differential geometry1. Meanwhile, just like “covariant derivatives” in differential geometry, some kind of covariant derivative can be found on statistical manifolds and they are called “natural derivatives”. Descending in the direction of natural derivatives is navigating the landscape more efficiently.
This a Riemannian space ↩︎
274
#visualization #art #fun
More like a blog post⊠But the visualisation is cool. I posted it as a comment.
[2109.15079] Asimov’s Foundation – turning a data story into an NFT artwork https://arxiv.org/abs/2109.15079
272
#academia
This is not only Julia for biologists. It is for everyone who is not using Julia.
Roesch, Elisabeth, Joe G. Greener, Adam L. MacLean, Huda Nassar, Christopher Rackauckas, Timothy E. Holy, and Michael P. H. Stumpf. 2021. âJulia for Biologists.â ArXiv [q-Bio.QM]. arXiv. http://arxiv.org/abs/2109.09973.
271
#visualization
I like this. I was testing visualization using antv’s G6. It is not for data analysis as it is quite tedious to generate visualizations.
Observable’s plot is a much easier fluent package for data analysis.
270
#visualization
Neural Networks visualized in 3D
Source: https://youtu.be/3JQ3hYko51Y
269
#career
Comment: Same for many competitive careers
Beware survivorship bias in advice on science careers https://www.nature.com/articles/d41586-021-02634-z
268
#ML
scikit learn reached 1.0. Nothing exciting about these new stuff but the major release probably means something.
Release Highlights for scikit-learn 1.0 â scikit-learn 1.0 documentation http://scikit-learn.org/stable/auto_examples/release_highlights/plot_release_highlights_1_0_0.html
267
#ML #fun
I read about the story of using tensorflow in google translate 1.
⊠Google Translate. Originally, the code that handled translation was a weighty 500,000 lines of code. The new, TensorFlow-based system has approximately 500, and it performs better than the old method.
This is crazy. Think about the maintenance of the code. A single person easily maintains 500 lines of code. 500,000 lines? No way.
Reference:
Pointer I. Programming PyTorch for Deep Learning: Creating and Deploying Deep Learning Applications. OâReilly Media; 2019. ↩︎
266
#ML
Phys. Rev. X 11, 031059 (2021) - Statistical Mechanics of Deep Linear Neural Networks: The Backpropagating Kernel Renormalization https://journals.aps.org/prx/abstract/10.1103/PhysRevX.11.031059
265
#visualization
The Doomsday Datavisualizations - Bulletin of the Atomic Scientists
263
#ML
A Gentle Introduction to Graph Neural Networks https://distill.pub/2021/gnn-intro
262
#ML
The authors investigate the geometry formed by the responses of neurons for certain stimulations (tunning curve). Using stimulation as the hidden variable, we can construct a geometry of neuron responses. The authors clarified the relations between this geometry and other measurements such as mutual information.
The story itself in this paper may not be interesting to machine learning practitioners. But the method of using the geometry of neuron responses to probe the brain is intriguing. We may borrow this method to help us with the internal mechanism of neural networks.
Kriegeskorte, Nikolaus, and Xue-Xin Wei. 2021. âNeural Tuning and Representational Geometry.â Nature Reviews. Neuroscience, September. https://doi.org/10.1038/s41583-021-00502-3.
261
#ML #self-supervised #representation
Contrastive loss is widely used in representation learning. However, the mechanism behind it is not as straightforward as it seems.
Wang & Isola proposed a method to rewrite the contrastive loss in to alignment and uniformity. Samples in the feature space are normalized to unit vectors. These vectors are allocated onto a hypersphere. The two components of the contrastive loss are
- alignment, which forces the positive samples to be aligned on the hypersphere, and
- uniformity, which distributes the samples uniformly on the hypersphere.
By optimization of such objectives, the samples are distributed on a hypersphere, with similar samples clustered, i.e., pointing to the similar directions. Uniformity makes sure the samples are using the whole hypersphere so we don’t waste “space”.
References:
Wang T, Isola P. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. arXiv [cs.LG]. 2020. Available: http://arxiv.org/abs/2005.10242
259
#äžæ #visualization
çć° TMS channel æšèç data stichesïŒ https://datastitches.substack.com/ ć łæłšäșć æïŒæè§èŽšééćžžć„œïŒç»ćžžèœçć°ćŸæŁçäœćă
ćæ¶æšèäžäž TMS channel https://t.me/tms_ur_way/1031 ć łäșæ¶éŽçźĄçïŒæçïŒćäșșçă
255
#ML
đ JĂŒrgen Schmidhuber invented transformers in the 90s.
https://people.idsia.ch/~juergen/fast-weight-programmer-1991-transformer.html
254
#fun
This is cool.
https://github.blog/2021-08-31-request-for-proposals-defining-standardized-github-metrics/
253
#DS
Hullman J, Gelman A. Designing for interactive exploratory data analysis requires theories of graphical inference. Harvard Data Science Review. 2021. doi:10.1162/99608f92.3ab8a587 https://hdsr.mitpress.mit.edu/pub/w075glo6/release/2
Creating visualizations seems to be a creative task. At least for entry-level visualization tasks, we follow our hearts and build whatever is needed. However, visualizations are made for different purposes. Some visualizations are simply explorations and for us to get some feelings on the data. Some others are built for the validation of hypotheses. These are very different things.
Confirmation of an idea using charts is usually hard. In most cases, we need statistical tests to (dis)prove a hypothesis instead of just looking at the charts. Thus, visualizations become a tool to help us formulate a good question.
However, not everyone is using charts as hints only. Instead, many use charts to conclude. As a result, even experienced analysts draw spurious conclusions. These so-called insights are not going to be too solid.
The visual analysis seems to be an adversarial game between humans and the visualizations. There are many different models for this process. A crude and probably stupid model can be illustrated through an example of analysis by the histogram of a variable. The histogram looks like a bell. It is symmetric. It is centered at 10 with an FWHM of 2.6. I guess this is a Gaussian distribution with a mean 10 and sigma 1. This is the posterior p(model | chart). Imagine a curve like what was just guessed on top of the original curve. Would my guess and the actual curve overlap with each other? If not, what do we have to adjust? Do we need to introduce another parameter? Guess the parameter of the new distribution model and compare it with the actual curve again. The above process is very similar to a repetitive Bayesian inference. Though, the actual analysis may be much more complicated as the analysts would carrier a lot of âŠ
252
#ML
Though not the core of the model, I noticed that this model (MEB) uses the user search behavior on Bing to build the language model. If a search result on Bing is clicked by the user, it is considered to be a positive sample for the query, otherwise a negative sample.
In self-supervised learning, it has been shown that negative sampling is extremely important. This Bing search dataset is naturally labeling the positive and negative samples. Kuhl idea.
251
#science
Nielsen M. Reinventing discovery: The New Era of networked science. Princeton, NJ: Princeton University Press; 2011.
I found this book this morning and skimmed through it. It looks concise yet unique. The author discusses how the internet is changing the way human beings think as one collective intelligence. I like the chapters about how the data web is enabling more scientific discoveries.
250
#ML
https://thegradient.pub/systems-for-machine-learning/
challenges in data collection, verification, and serving tasks
249
https://github.com/soumith/ganhacks
Training GAN can be baffling. For example, the generator and the discriminator just don’t “learn” at the same scale sometimes. Would you try to balance the generator loss and discriminator loss by hand? Soumith Chintala ( @ FAIR ) put together this list of tips for training GAN. “Don’t balance loss via statistics” is one of the 17 tips by Chintala. The list is quite inspiring.
248
I have downloaded the file so you don’t need to.
Anaconda-2021-SODS-Report-Final.pdf
247
#DS
This is an interesting report by anaconda. We can kind of confirm from this that Python is still the king of languages for data science. SQL is right following Python.
Quote from the report:
Between March 2020 to February 2021, the pandemic economic period, we saw 4.6 billion package downloads, a 48% increase from the previous year. We have no data for other languages so no predictions can be made but it is interesting to see Python growing so fast.
The roadblocks different data professionals facing are quite different. If the professional is a cloud engineer or mlops, then they do not mention that skills gap in the organization that many times. But for data scientists/analysts, skills gaps (e.g., data engineering, docker, k8s) is mentioned a lot. This might be related to the cases when the organization doesn’t even have cloud engineers/ops or mlops.
See the next message for the PDF file.
246
#ML
Julia Computing got a lot of investment recently. I need to dive deeper into the Julia Language.
244
#Coding
I found a nice place to practice programming thinking. It is not as comprehensive as hackerrank/leetcode but these problems are quite fun.
243
#ML
Implicit Regularization in Tensor Factorization: Can Tensor Rank Shed Light on Generalization in Deep Learning? â Off the convex path http://www.offconvex.org/2021/07/08/imp-reg-tf/
242
#TIL
In PyTorch, conversion from Torch tensors to numpy arrays is very fast on CPUs, though torch tensors and numpy arrays are very different things. This is because of the Python buffer protocol. The protocol makes it possible to use binary data directly from C without copying the object.
https://docs.python.org/3/c-api/buffer.htm
Reference: Eli Stevens Luca Antiga. Deep Learning with PyTorch: Build, Train, and Tune Neural Networks Using Python Tools. Simon and Schuster, 2020;
241
#Academia
The distill team’s thought on interactive publishing and self-publishing in academia.
240
#ML
Great. Tensorflow implemented built-in decision forest models.
https://blog.tensorflow.org/2021/05/introducing-tensorflow-decision-forests.html?m=1
239
#fun
GitHub Copilot · Your AI pair programmer https://copilot.github.com/
This is crazy.
What is GitHub Copilot? GitHub Copilot is an AI pair programmer that helps you write code faster and with less work. GitHub Copilot draws context from comments and code, and suggests individual lines and whole functions instantly. GitHub Copilot is powered by OpenAI Codex, a new AI system created by OpenAI. The GitHub Copilot technical preview is available as a Visual Studio Code extension.
How good is GitHub Copilot? We recently benchmarked against a set of Python functions that have good test coverage in open source repos. We blanked out the function bodies and asked GitHub Copilot to fill them in. The model got this right 43% of the time on the first try, and 57% of the time when allowed 10 attempts. And itâs getting smarter all the time.
238
#ML
A Turing lecture article by the three famous DL guys. It’s an overview of the history, development, and future of AI. There are two very interesting points in the outlook section:
- “From homogeneous layers to groups of neurons that represent entities.” In biological brains, there are memory engrams and motifs that almost do this.
- “Multiple time scales of adaption.” This is another key idea that has been discussed numerous times. One of the craziest things about our brain is the diversity of time scales of plasticity, i.e., different mechanisms change the brain on different time scales.
Reference: Bengio Y, Lecun Y, Hinton G. Deep learning for AI. Commun ACM. 2021;64: 58â65. doi:10.1145/3448250 https://dl.acm.org/doi/10.1145/3448250
237
#ML
Geometric Deep Learning is an attempt to unify deep learning using geometry. Instead of building deep neural networks ignoring the symmetries in the data and leaving it to be discovered by the network, we apply the symmetries in the problem to the network. For example, instead of flattening the matrix of a cat image and have some predetermined order of the pixels, we apply a translational transformation on the 2D image and the cat should also be a cat without any doubt. This transformation can be enforced in the network.
BTW, If you come from a physics background, it is most likely that you have heard about the symmetries in physical theories like Noether’s theorem. In the history of physics, there was an era of many theories yet most of them are connected or even unified under the umbrella of geometry. Geometric deep learning is another “benevolent propaganda” based on a similar idea.
References:
- Bronstein, Michael. âICLR 2021 Keynote - âGeometric Deep Learning: The Erlangen Programme of MLâ - M Bronstein.â Video. YouTube, June 8, 2021. https://www.youtube.com/watch?v=w6Pw4MOzMuo.
- Bronstein MM, Bruna J, LeCun Y, Szlam A, Vandergheynst P. Geometric deep learning: going beyond Euclidean data. arXiv [cs.CV]. 2016. Available: http://arxiv.org/abs/1611.08097
- Bronstein MM, Bruna J, Cohen T, VeliÄkoviÄ P. Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2104.13478
235
#ML
The Bayesian hierarchical model provides a process to use Bayesian inference hierarchically to update the posteriors. What is a Bayesian model? In a Bayesian linear regression problem, we can take the posterior from the previous data points and use it as our new prior for inferring based on new data. In other words, as more data coming in, our belief is being updated. However, this is a problem if some clusters in the dataset have small sample sizes, aka small support. As we take these samples and fit them onto the model, we may get a huge credible interval. One simple idea to mitigate this problem is to introduce some constraints on how the priors can change. For example, we can introduce a hyperprior that is parametrized by new parameters. Then the model becomes hierarchical since we will also have to model the new parameters.
The referenced post, “Bayesian Hierarchical Modeling at Scale”, provides some examples of coding such models using numpyro with performance in mind.
https://florianwilhelm.info/2020/10/bayesian_hierarchical_modelling_at_scale/
233
#fun
Germany, birthplace of the automobile, just gave the green light to robotaxis
232
#DS
This paper serves as a good introduction to the declarative data analytics tools.
Declarative analytics performs data analysis using a declarative syntax instead of functions for specific algorithms. Using declarative syntax, one can âdescribe what you want the program to achieve rather than how to achieve itâ. To be declarative, the declarative language has to be specific on the tasks. With this, we can only turn the knobs of some predefined model. To me, this is a deal-breaker.
Anyways, this paper is still a good read.
Makrynioti N, Vassalos V. Declarative Data Analytics: A Survey. IEEE Trans Knowl Data Eng. 2021;33: 2392â2411. doi:10.1109/TKDE.2019.2958084 http://dx.doi.org/10.1109/TKDE.2019.2958084
231
#DS
https://octo.github.com/projects/flat-data
Hmmm, so they gave it a name. I’ve built so many projects using this approach. I started building such data repos using CI/CD services way before github actions was born. Of course github actions made it much easier. One of them is the EU covid data tracking project ( https://github.com/covid19-eu-zh/covid19-eu-data ). It’s been running for more than a year with very little maintenance. Some covid projects even copied our EU covid data tracking setup.
I actually built a system (https://dataherb.github.io) to pull such github actions based data scraping repos together.
230
#ML
An interesting talk:
Dear all,
We are pleased to have Anna Golubeva speak on “Are wider nets better given the same number of parameters?” on Wednesday May 19th at 12:00 ET.
You can find further details here and listen to the talk here.
We hope you can join!
Best,
Sven
228
#career #DS
I believe this article is relevant. Most data scientists have very good academic records. These experiences of excellence compete with another required quality in the industry: The ability to survive in a less ideal yet competitive environment. We could be stubborn and find the environment that we fit well in or adapt based on the business playbook. Either way is good for us as long as we find the path that we love.
(I have a joke about this article: To reasoning productively, we do not need references for our claims at all.)
227
#DS #EDA #Visualization
If you are keen on data visualization, the new Observable Plot is something exciting for you. Observable Plot is based on d3 but it is easier to use in Observable Notebook. It also follows the guidelines of the layered grammar of graphics (e.g., marks, scales, transforms, facets.).
226
#DS
(This is an automated post by IFTTT.)
It is always good for a data scientist to understand more about data engineering. With some basic data engineering knowledge in mind, we can navigate through the blueprint of a fully productionized data project at any time. In this blog post, I listed some of the key concepts and tools that I learned in the past.
This is my blog post on Datumorphism https://datumorphism.leima.is/wiki/data-engeering-for-data-scientist/checklist/
225
#DS #ML
The âAI Expert Roadmapâ. This can be used as a checklist of prelims for data people.
224
#statistics
This is the original paper of Fraser information.
Fisher information measures the second moment of the model sensitivity; Shannon information measures compressed information or variation of the information; Kullback (aka KL divergence) distinguishes two distributions. Instead of defining a measure of information for different conditions, Fraser tweaked the Shannon information slightly and made it more generic. The Fraser information can be reduced to Fisher information, Shannon information, and Kullback information under certain conditions.
It is such a simple yet powerful idea.
Fraser DAS. On Information in Statistics. aoms. 1965;36: 890â896. doi:10.1214/aoms/1177700061 https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-36/issue-3/On-Information-in-Statistics/10.1214/aoms/1177700061.full
223
222
221
#ML
Voss, et al., “Branch Specialization”, Distill, 2021. https://distill.pub/2020/circuits/branch-specialization/
TLDR;
- Branch: neuron clusters that are roughly segregated locally, e.g., AlexNet branches by design.
- Branch specialization: branches specialize in specific tasks, e.g., the two AlexNet branches specialize in different detectors (color detector or black-white filter).
- Is it a coincidence? No. Branch specialization repeatedly occurs in different trainings and different models.
- Do we find the same branch specializations in different models and tasks? Yes.
- Why? The authors’ proposal is that a positive feedback loop will be established between layers, and this loop enhances what the branch will do.
- Our brains have specialized regions too. Are there any connections?
220
#ML
Silla CN, Freitas AA. A survey of hierarchical classification across different application domains. Data Min Knowl Discov. 2011;22: 31â72. doi:10.1007/s10618-010-0175-9
A survey paper on hierarchical classification problems. It is a bit old as it didnât consider the classifier chains, but this paper summarizes most of the ideas in hierarchical classification.
The authors also proposed a framework for the categorization of such problems using two different dimensions (ranks).
219
217
#TIL
How the pandemic changed the way people collaborate.
- Siloing: From April 2019 to April 2020, modularity, a measure of workgroup siloing, rose around the world.
216
#DataScience
(Please refer to this post https://t.me/amneumarkt/199 for more background.)
I read the book “everyday data science”. I think it is not as good as I expected.
The book doesn’t explain things clearly at all. Besides, I was expecting something starting from everyday life and being extrapolate to something more scientific.
I also mentioned previously that I would like to write a similar book. Attached is something I created recently that is quite close to the idea of my ideal book for everyday data science.
Cross Referencing Post: https://t.me/amneumarkt/199
213
210
#ML
How do we interpret the capacities of the neural nets? Naively, we would represent the capacity using the number of parameters. Even for Hopfield network, Hopfield introduced the concept of capacity using entropy which in turn is related to the number of parameters.
But adding layers to neural nets also introduces regularizations. It might be related to capacities of the neural nets but we do not have a clear clue.
This paper introduced a new perspective using sparse approximation theory. Sparse approximation theory represents the data by encouraging parsimony. The more parameters, the more accurate the model is representing the training data. But it causes generalization issues as similar data points in the test data may have been pushed apart [^Murdock2021].
By mapping the neural nets to shallow “overcomplete frames”, the capacity of the neural nets is easier to interpret.
[Murdock2021]: Murdock C, Lucey S. Reframing Neural Networks: Deep Structure in Overcomplete Representations. arXiv [cs.LG]. 2021. Available: http://arxiv.org/abs/2103.05804
209
#fun
India is growing so fast
Oh Germany…
Global AI Vibrancy Tool Whoâs leading the global AI race? https://aiindex.stanford.edu/vibrancy/
206
#ML
I just found an elegant decision tree visualization package for sklearn.
I have been trying to explain decision tree results to many business people. It is very hard. This package makes it much easier to explain the results to a non-techinical person.
205
204
#fun
Growth in data science interviews plateaued in 2020. Data science interviews only grew by 10% after previously growing by 80% year over year.
Data engineering specific interviews increased by 40% in the past year.Â
https://www.interviewquery.com/blog-data-science-interview-report
203
#ML #Phyiscs
The easiest method to apply constraints to a dynamical system is through Lagrange multiplier, aka, penalties in statistical learning. Penalties don’t guarantee any conservation laws as they are simply penalties, unless you find the multiplers carrying some physical meaning like what we have in Boltzmann statistics. This paper explains a simple method to hardcode conservation laws in a Neural Network architecture.
Paper: https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.126.098302
TLDR: See the attached figure. Basically, the hardcoded conservation is realized using additional layers after the normal neural network predictions.
A quick bite of the paper: https://physics.aps.org/articles/v14/s25
Some thoughts: I like this paper. When physicists work on problems, they like dimensionlessness. This paper follows this convention. This is extremely important when you are working on a numerical problem. One should always make it dimensionless before implementing the equations in code.
201
#event
If you are interested in free online AI Cons, Bosch CAI is organizing the AI Con 2021. This event starts tomorrow. https://www.ubivent.com/start/AI-CON-2021
200
#ML Haha
Deep Learning Activation Functions using Dance Moves https://www.reddit.com/r/learnmachinelearning/comments/lvehmi/deep_learning_activation_functions_using_dance/?utm_medium=android_app&utm_source=share
199
#DataScience
Ah I have always been thinking about writing a book like this. Just bought the book to educate myself on communications.
198
#ML
note2self:
From ref 1
we can take any expected utility maximization problem, and decompose it into an entropy minimization term plus a âmake-the-world-look-like-this-specific-modelâ term.
This view should be combined with ref 2. If the utility is related to the curvature of the discrete state space, we are making a connection between entropy + KL divergence and curvature on graph. (This idea has to be polished in depth.)
Refs:
- Trivial proof but interesting perspective: https://www.lesswrong.com/posts/voLHQgNncnjjgAPH7/utility-maximization-description-length-minimization
- Samal Areejit, Pharasi Hirdesh K., Ramaia Sarath Jyotsna, Kannan Harish, Saucan Emil, Jost JĂŒrgen and Chakraborti Anirban 2021Network geometry and market instabilityR. Soc. open sci.8201734. http://doi.org/10.1098/rsos.201734
197
#dev
You can even use Chinese in GitHub Codespaces. đ± Well this is trivial if you have Chinese input methods on your computer. What if you are using a company computer and you would like to add some Chinese comments just for fun….
196
#fun
Interesting talk on the softwares used by Apollo.
https://media.ccc.de/v/34c3-9064-the_ultimate_apollo_guidance_computer_talk#t=3305
193
#ML
The new AI spring: a deflationary view
It’s actually fun to watch philosophers fighting each other. The author is trying to deflate the inflated expectations on AI by looking into why inflated expectations are harming our society. It’s not exactly based on evidence but still quite interesting to read.
| SpringerLink https://link.springer.com/article/10.1007/s00146-019-00912-z
192
#neuroscience
Definitely weird. The authors used DNN to capture the firing behaviors of cortical neurons.
- A single hidden layer DNN (can you even call it Deep NN in this case?) can capture the neuronal activity without NMDA but with AMPA.
- With NMDA, the neuron requires more than 1 layer. This paper stops here.
WTH this is? Let’s go back to the foundations of statistical learning. What the author is looking for is a separation of “stimulation” space. The “stimulation” space is basically a very simple time series (Poissonic) space. We just need to map inputs back to the same space but with different feature values. Since the feature space is so small, we will absolutely fit everything if we increase the expressing power of the DNN. The thing is, we already know that NMDA-based synapses require more expressing power and we have very interpretable and good mathematical models for this… This research provides neither better predictability nor interpretability. Well done…
Maybe you have different opinions, prove me wrong.
191
#fun
We have been testing a new connected online work space using discord. Whoever is bored by home office can connect to a shared channel and chat.
Discord allows team voice chat and multiple screensharing. By adding bots to the channel, the team can share music playlists. Discord allows detailed adjustment of the voices so anyone could adjust volumes of any other users or even deafen himself/herself. So it is possible to be connected for the whole day.
It seems that jump in and chat at anytime and share working screen make it fun for WFH.
190
#TIL My cheerful price for the work I am currently doing is very high…
https://www.lesswrong.com/posts/MzKKi7niyEqkBPnyu/your-cheerful-price
189
#productivity
I find vscode remote-ssh very helpful. For some projects with frequent maintenance fixes, I prepared all the required environment on a remote server. I only need to click on the remote-ssh connection to connect to this remote server and immediately start my work. This low overhead setup makes me less reluctant to fix stuff. It is also possible to connect to Docker containers. By setting up different containers we can work in completely different environments with a few clicks. This is crazy.
188
Passing the Data Baton : A Retrospective Analysis on Data Science Work and Workers
A paper on the different components of data related work. They also proposed a framework and a team structure for data workers.
187
#ML
Machine Learning, Kolmogorov Complexity, and Squishy Bunnies http://www.theorangeduck.com/page/machine-learning-kolmogorov-complexity-squishy-bunnies
183
#ML
[D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG) https://www.reddit.com/r/MachineLearning/comments/leq2kf/d_convolution_neural_network_visualization_made/?utm_medium=android_app&utm_source=share
181
#research
ConnectedPapers is now integrated into arXiv.
This new perspective of references is often overlooked. It is not a gimmick at all.
180
#ML
âEveryone wants to do the model work, not the data workâ: Data Cascades in High-Stakes AI
TL;DR:
- Data quality is crucial in any AI especially for those with high-stakes.
- Many data work are overlooked easily: politics (some data entries are not recorded or misrecorded), human in the loop of data quality interventions for cleaning and wrangling but upstream data creation shall be controlled well too, etc
- Data Cascades: how the issues are cascading from upstream to downstream should be clear.
Data Cascades: compounding events causing negative, downstream effects from data issues, resulting in technical debt over time.
179
176
#market
Road freight between Britain and EU is down by a third, data shows https://www.theguardian.com/politics/2021/jan/31/road-freight-britain-eu-down-third-data-shows-brexit
Yeah, thanks to brexit
175
hmm why is the gov/feds interested in this Gamestop thing? đ (I have very limited knowledge about stocks.)
174
#data
https://ec.europa.eu/eurostat/cache/recovery-dashboard/
Eurostat built a dashboard to show the socioeconomic indicators of EU during the corona virus period. It is seen that most indicators are recovering.
173
#ML Sarcasm Detection with Sentiment Semantics Enhanced Multi-level Memory Network - ScienceDirect https://www.sciencedirect.com/science/article/abs/pii/S0925231220304689
Sheldon, this is your thing! (Didn’t read the paper. I just find this title a bit amusing.)
172
#ML http://akosiorek.github.io/ml/2018/03/14/what_is_wrong_with_vaes.html
- instabilities
- âlast-mileâ effort in optimization is too high
171
#science https://distill.pub/2017/research-debt/
Research debt is the accumulation of missing interpretive labor. Itâs extremely natural for young ideas to go through a stage of debt, like early prototypes in engineering.
That is because our value system doesnât respect interpretorsâŠ
170
#git https://mergebase.com/doing-git-wrong/2018/03/07/fun-with-git-pull-rebase/
Some people claim âgit fetch; git rebase origin/masterâ is equivalent to âgit pull -râ, but it isnât.
git pull -r also deals with squashes.
#TIL
Rebase hell happens when several commits on your branch edit the same area, and upstream also touched the same area. The problem occurs because each conflict resolution will itself conflict with the subsequent commit in the series.
169
#TIL
TIL Larry Hillblom, the H of DHL, regularly took “sex safari” trips to Asia to prey on underage girls. When he died in a plane crash, 4 of the illegitimate children he fathered were able to claim $50 million each from his estate.
168
#fun
The authors have got too many questions regarding Chinese translations….
167
ä»ç»äșäžç§ćșäșçČç„ćç±»ćéĄșćșçŒć·æ„æŽçæ°ćć ćźčçæčæłăćç±»çéšćè·æçźćæŽçæ件çæčćŒćŸç±»äŒŒïŒçŒć·çćæłćç»æ枊æ„äžäșćŻćïŒä»€äșșèæłć°è·æżćșæäș€éæ¶äœżçšçäžäșèĄšæ ŒïŒæŻćŠ I-140ă1099-B çăç䌌æŻéæçć珊ćșćïŒäœææçæèŻéąçäșșéœæ祟ćźä»ŹæŻä»äčă
165
#statistics
https://en.wikipedia.org/wiki/Dependent_and_independent_variables#Statistics_synonyms
The broken jargon system in statistics …đ±
Depending on the context, an independent variable is sometimes called a “predictor variable”, regressor, covariate, “controlled variable”, “manipulated variable”, “explanatory variable”, exposure variable (see reliability theory), “risk factor” (see medical statistics), “feature” (in machine learning and pattern recognition) or “input variable.” In econometrics, the term “control variable” is usually used instead of “covariate”.
163
https://www.youtube.com/watch?v=KXRtNwUju5g
This is one of the hidden problems of our world. In some sense, the US is destroying the world. If you look at Germany, plastic recycling is much easier with all these machines in the stores. (or, is it?)
161
160
#ML
An interesting idea on time series predictions. Instead of predicting the exact time series, the author proposed a method to predict the future using ordinal patterns.
The figure shows how to disintegrate the time series into 8 overlapping short-term series (each with three numbers). To transform the short-term series into patterns, we write down the permutation pattern (for size of the series D=3, we have only 6 possible permutations). Then we will use the permutation patterns in the past to predict the patterns in the future. BTW, this paper used the price of bitcoins as an example to test this method. This method will not be super amazing. The point of this paper is to propose a simple method to predict the future using very limited resource.
This is the paper: https://royalsocietypublishing.org/doi/10.1098/rsos.201011 Short-term prediction through ordinal patterns
159
158
#cn http://www.cddata.gov.cn/oportal/index æéœç«ç¶æćŒæŸæ°æźćčłć°ïŒèäžćçèżäžéă
æŽæ°ïŒ æćç°æć«ççćžäčæïŒéŸéæŻææććžćç仜ć·Čç»ç»äžäșïŒéœæèżäžȘćŒæŸæ°æźćčłć°ïŒ
157
#TIL
https://stackoverflow.com/a/28142831/1477359
I had the same idea that git fast forward merge is more or less the same as rebase. Until I read this stackoverflow answer.
I guess we should always rebase whenever possible to maintain a clean history.
156
#shameless
In the past years, I have been building a showcase of digital tools for academic researchers.
It started with some friends asking for recommendations of tools for reference management, visualization, note-taking, and so on ad infinitum.
So I built a GitHub repo to share what I have learned about these tools. This was way before the “awesome repo” concept. Later came the “GitHub awesome repo” shitstorm. Everyone is building an “awesome repo”. I created a website for a better user experience to flee from the shitstorm.
Tools for Academic Research is a website for digital tool listings. At the moment, there are 154 tools listed. You can browse by tags or categories to find whatever you need. Or add an item (books, tools, reviews, etc) you love.
153
#ML
Description of tables
http://feedproxy.google.com/~r/blogspot/gJZg/~3/jeMkmAfQxOk/totto-controlled-table-to-text.html
151
#career #business
[D] We Need More Data Engineers, Not Data Scientists https://www.reddit.com/r/MachineLearning/comments/kx0j1v/d_we_need_more_data_engineers_not_data_scientists/
The report: https://www.mihaileric.com/posts/we-need-data-engineers-not-data-scientists/
150
#business In 2015, there was a company called SixFold. They were one of the first heroes to disrupt an industry that has not changed much for a century, the freight market. They investigated the situation, established their hypothesis, created MVP. They did not succeed. The image is a summary of their post mortem.
There are at least two learning from this story.
- Think in terms of the utility function. Do not just point out blocks of reasons. Write down the utility function for the situation and make assumptions on the parameters.
- Swarm intelligence sometimes works better than one might expect. Improvements in swarm intelligence take a lot of effort if one does not have a smart plan.
Here is the article by their CEO: https://medium.com/@MartKelder/end-of-road-for-trucking-startup-palleter-523a4a906fe9
149
#ML
https://alan-turing-institute.github.io/skpro/introduction.html#a-motivating-example
Alan Turing Institute created a package called skpro for probabilistic modeling. Unlike many other probabilistic modeling packages, skpro integrates into sklearn pretty well.
148
https://www.chicagobooth.edu/why-booth/stories/in-memoriam-phd-student-yiran-fan
A 30-year-old Ph.D. student in a joint program of Chicago Booth and the Kenneth C. Griffin Department of Economics, Fan was shot and killed on Jan. 9.
Related news article: https://www.globaltimes.cn/page/202101/1212449.shtml
Fan was shot and killed in his car in the parking garage at an apartment building at about 1:50 pm Saturday. After shooting and killing Fan, the suspect, identified as 32-year-old Jason Nightengal by police, went on to shoot others across the city, reports said.
147
#ML #paper
https://www.nature.com/articles/s42256-020-00265-z
Intrinsic interpretability. arXiv: https://arxiv.org/abs/2002.01650
146
#fun
https://observablehq.com/@mbostock/hertzsprung-russell-diagram
Mike Bostock made a HertzsprungâRussell Diagram using d3.js. It looks so cool.
145
#productivity
I have been using Obsidian as my primary note-taking app for a while. It was a rough start. Linking notes was simply not in my workflow. In some sense, I am not familiar with my notes after a while. So I started to work on notes reviews every two weeks. On each notes review, I go through my notes inbox and spend some time connecting them with the the existing ones.
This is how my notes look like now. They are mostly well connected. (The cluster is because I have archived them as they are the notes for my previous position.)
I also borrowed the domain concept from dendron. I created folders with dot delimited domains. For example, I have this folder named inbox.ml which I use as my inbox for machine learning related notes. These notes will be distributed to a corresponding folder during my notes review.
Those notes worth publishing will then be distributed to my websites. For example, https://datumorphism.leima.is/ is for data science related notes.
144
142
141
140
#intelligence #paper #ML Superintelligence Cannot be Contained: Lessons from Computability Theory https://www.jair.org/index.php/jair/article/view/12202
We argue that total containment is, in principle, impossible, due to fundamental limits inherent to computing itself. Assuming that a superintelligence will contain a program that includes all the programs that can be executed by a universal Turing machine on input potentially as complex as the state of the world, strict containment requires simulations of such a program, something theoretically (and practically) impossible.
139
#machinelearning
A nice colloquium paper: The unreasonable effectiveness of deep learning in artificial intelligence | PNAS https://www.pnas.org/content/117/48/30033
138
#intelligence
https://www.economicprinciples.org/
How the economic machine works make easy
137
136
https://blog.waymo.com/2020/10/waymo-is-opening-its-fully-driverless.html
If you are in Phoenix.
135
https://github.com/volotat/DiffMorph #machinelearning #opensource
Differentiable Morphing
Image morphing without reference points by applying warp maps and optimizing over them.
134
#neuroscience
Source: https://science.sciencemag.org/content/370/6523/1410.full A gatekeeper for learning
Upon learning a hippocampus-dependent associative task, perirhinal inputs might act as a gate to modulate the excitability of apical dendrites and the impact of the feedback stream on layer 5 pyramidal neurons of the primary somatosensory cortex.
đČ In some sense, perirhinal inputs are like config files for learning.
133
#data
Could you prevent a pandemic? A very 2020 video game https://play.acast.com/s/nature/2020festivespectacular
132
131
https://www.nature.com/articles/s41557-020-0544-y
Here we propose PauliNet, a deep-learning wavefunction ansatz that achieves nearly exact solutions of the electronic Schrödinger equation for molecules with up to 30 electrons
129
#data #covid19
UK gov has an official covid 19 API. https://coronavirus.data.gov.uk/details/developers-guide#structure-metrics
I found this funny typo in the documentation. đ The first one should be cumCasesByPublishDateRate.
128
127
#datascience
I ran into this hilarious comment on pie chart in a book called The Grammar of Graphics.
âTo prevent bias, give the child the knife and someone else the first choice of slices.â đ±đ±đ±
125
#tools #writing https://www.losethevery.com/
“Very good english” is not very good english. Lose the very.
124
#datascience #career #academia
I regret quitting astrophysics
https://news.ycombinator.com/item?id=25444069
http://www.marcelhaas.com/index.php/2020/12/16/i-regret-quitting-astrophysics/
me too đ though not an astrophysicist, I miss academia too
123
120
118
#tools Space: The Integrated Team Environment https://www.jetbrains.com/space/
Wow, I love jetbrains.
117
#machinelearning https://arxiv.org/abs/2007.04504 Learning Differential Equations that are Easy to Solve
Jacob Kelly, Jesse Bettencourt, Matthew James Johnson, David Duvenaud
Differential equations parameterized by neural networks become expensive to solve numerically as training progresses. We propose a remedy that encourages learned dynamics to be easier to solve. Specifically, we introduce a differentiable surrogate for the time cost of standard numerical solvers, using higher-order derivatives of solution trajectories. These derivatives are efficient to compute with Taylor-mode automatic differentiation. Optimizing this additional objective trades model performance against the time cost of solving the learned dynamics. We demonstrate our approach by training substantially faster, while nearly as accurate, models in supervised classification, density estimation, and time-series modelling tasks.
116
#science The ergodicity problem in economics | Nature Physics https://www.nature.com/articles/s41567-019-0732-0
I read another paper about hot hand/gamblers’ fallacy a while ago and the author of that paper took a similar view. Here is the article: Surprised by the Hot Hand Fallacy ? A Truth in the Law of Small Numbers by Miller
115
#ML
https://arxiv.org/abs/2012.04863
Skillearn: Machine Learning Inspired by Humans’ Learning Skills
Interesting idea. I didn’t know interleaving is already being used in ML.
114
112
https://events.ccc.de/2020/09/04/rc3-remote-chaos-experience/
CCC is hosting the event for 2020 fully online. Everyone can join with a pay-as-you-wish ticket. Join if you like programming, hacking, social events, learning something crazy and new. đđđ
111
110
109
#ML #paper
https://arxiv.org/abs/2012.00152 Every Model Learned by Gradient Descent Is Approximately a Kernel Machine Deep learning’s successes are often attributed to its ability to automatically discover new representations of the data, rather than relying on handcrafted features like other learning methods.
108
A new search engine by a former chief scientist who helped developing the AI platform Einstein for Salesforce.
The new search engine is called “you”.
107
105
103
https://www.pnas.org/content/early/2020/12/02/2015954117
Oh Hi, it’s you, Mask. Or social distancing?
101
100
99
98
94
93
91
90
I did some investigation on the salary of tech employees working for Cologne city. It seems that the salary for IT employees are quite low. This may not be a fair representation of the whole Germany. But Cologne is one of the most digitized cities in Germany. So I would guess it should be a fair example.
For example, a data manager is among the salary group 11 (net from 2144EUR to 2993EUR)
This is the job description: https://www.stadt-koeln.de/politik-und-verwaltung/ausbildung-karriere-bei-der-stadt/stellenangebote/datenmanagerin-beziehungsweise-datenmanager-mwd-im-amt-fuer-informationsverarbeitung
This is the salary calculation: https://oeffentlicher-dienst.info/c/t/rechner/tvoed/vka?id=tvoed-vka-2020&g=E_11&s=1&f=&z=&zv=&r=&awz=&zulage=&kk=&kkz=&zkf=&stkl=
89
Does Apple really log every app you run? A technical look
https://blog.jacopo.io/en/post/apple-ocsp/
TL;DR
- No, macOS does not send Apple a hash of your apps each time you run them. You should be aware that macOS might transmit some opaque information about the developer certificate of the apps you run. This information is sent out in clear text on your network.
- You shouldnât probably block ocsp.apple.com with Little Snitch or in your hosts file.
88
Going from Bad to Worse: From Internet Voting to Blockchain Voting
This article examines the suggestions that âvoting over the Internetâ or âvoting on the blockchainâwould increase election security, and finds such claims to be wanting and misleading. While currentelection systems are far from perfect, Internet- and blockchain-based voting would greatly increase therisk of undetectable, nation-scale election failures
87
https://thegradient.pub/how-can-we-improve-peer-review-in-nlp/
(Anderson, 2009) argues that research paper merit is Zipf-distributed: many papers are clear rejects, while a few are clear accepts. In between those two extremes, decisions are very difficult, and any differences between the best rejected and the worst accepted paper are tiny, even given the best possible set of reviewers.
(And the following is quite discriminating towards non english speaking researchers.)
Work not-on-English: English is the “default” language to study (Bender, 2019), and work on other languages is easily accused of being “niche” and non-generalizable - even though English only workis equally non-generalizable.
86
Wow what is gonna happen to microsoft
https://twitter.com/gvanrossum/status/1326932991566700549
Guido van Rossum @gvanrossum I decided that retirement was boring and have joined the Developer Division at Microsoft. To do what? Too many options to say! But itâll make using Python better for sure (and not just on Windows :-). Thereâs lots of open source here. Watch this space.
85
I just finished the book Grokking Algorithms last night. https://www.manning.com/books/grokking-algorithms
I think it is a well-written book for people who is not from a CS background. The book has a lot of examples showing how the algorithms work step by step. To me, the most interesting chapter is dynamic programming. I had a lot of fun reading this. Highly recommended if you are interested in algorithms!
83
https://www.frontiersin.org/articles/10.3389/fncom.2012.00094/full
An emerging consensus for open evaluation: 18 visions for the future of scientific publishing
82
79
78
https://www.nytimes.com/interactive/2020/11/03/us/elections/forecast-president.html
Comment Am Neumarkt: The best information designers are summoned on each election day. It is a good time to learn about the best practices of data visualization. This âpaths to victoryâ visualization is one of the best I have ever seen. If the put some probabilities on each branch, it becomes a transitional decision tree to estimate risks used by investors. Does it tell us anything useful directly? Not really. Not all branches are created equal. Without probabilities, It is as useless as a piece of blank paper. But it helps people do some little experiments to feel the competitiveness. In some sense, the probabilities are encoded in the readerâs head. Each reader provides a different reality of probabilities.
Also they started to report uncertainties. I remember last time they were using jittering pointers to educate people of the uncertainties. Now they have range of estimates. Showing ranges is an important step forward.
76
https://www.scientificamerican.com/article/the-international-space-station-is-doomed-to-die-by-fire/
What is next? Space stations by private companies like bigelow?
75
74
73
71
Urban Dictionary Embeddings for Slang NLP Applications - ACL Anthology https://www.aclweb.org/anthology/2020.lrec-1.586
Haha very cool Water <-> butt-splash Soda <-> sodagasm
70
69
https://www.nature.com/articles/d41586-020-02986-y
Muotri, a neuroscientist at the University of California, San Diego (UCSD), has found some unusual ways to deploy his. He has connected organoids to walking robots, modified their genomes with Neanderthal genes, launched them into orbit aboard the International Space Station, and used them as models to develop more human-like artificial-intelligence systems.
68
The effect of influenza vaccination on trained immunity: impact on COVID-19 | medRxiv https://www.medrxiv.org/content/10.1101/2020.10.14.20212498v1
Hospital workers who got vaccinated were significantly less likely to develop COVID than those who did not
I believe that is just a simple sampling problem. People had flu shot this year because they’re really careful about infectious diseases. They maybe also sanitize more.
67
Media
Explorer | Explore Human Knowledge Explorer by Batou. Navigate wikipedia visually!. Product topic: Web App, User Experience, Education, Artificial Intelligence, Tech View on Product Hunt
64
Media
StellarX Create collaborative spaces & rich simulations without code. Product topic: Virtual Reality, Design Tools, Education, Artificial Intelligence, Augmented Reality, Tech View on Product Hunt
63
62
https://arxiv.org/abs/2010.06119
ReviewRobot: Explainable Paper Review Generation based on Knowledge Synthesis
To assist human review process, we build a novel ReviewRobot to automatically assign a review score and write comments for multiple categories. A good review needs to be knowledgeable, namely that the comments should be constructive and informative to help improve the paper; and explainable by providing detailed evidence. ReviewRobot achieves these goals via three steps: (1) We perform domain-specific Information Extraction to construct a knowledge graph (KG) from the target paper under review, a related work KG from the papers cited by the target paper, and a background KG from a large collection of previous papers in the domain. (2) By comparing these three KGs we predict a review score and detailed structured knowledge as evidence for each review category. (3) We carefully select and generalize human review sentences into templates, and apply these templates to transform the review scores and evidence into natural language comments. Experimental results show that our review score predictor reaches 71.4-100% accuracy. Human assessment by domain experts shows that 41.7%-70.5% of the comments generated by ReviewRobot are valid and constructive, and better than human-written ones 20% of the time. Thus, ReviewRobot can serve as an assistant for paper reviewers, program chairs and authors.
60
58
57
Jackson, D. E., & Ratnieks, F. L. W. (2006). Communication in ants. Current Biology, 16(15), R570âR574. https://doi.org/10.1016/j.cub.2006.07.015
I just realized that what we have been calling swarm intelligence is not very different from our single agent intelligence. They are all dealing with information diffusion. Using the diffusion, swarm intelligence shares the global information with dumb agents. Our brain, on the other hand, is using information diffusion (using Ca as an agent) as a way to regulate neuron firing rate. This is also a way to share the global firing status with each neuron. It is even more interesting if we think of it as a hierarchical model. âSingle agentâ is using smaller agents for their own intelligence. A âsingle agentâ is also a part of a larger agent. In the end, we are just part of Gaia.
56
https://www.nobelprize.org/prizes/physics/1972/cooper/biographical/
I just learned today that the Cooper in BCS for superconductivity and the Cooper in BCM in neuroscience are the same Cooper. This guy is amazing.
55
54
53
52
51
49
48
https://onlinelibrary.wiley.com/doi/full/10.1111/1475-6773.13553
“The intervention placed and retained frequent user, chronically homeless individuals in housing. It decreased psychiatric ED visits and shelter use, and increased outpatient mental health care, but not medical ED visits or hospitalizations. Limitations included more than oneâthird of usual care participants received another form of subsidized housing, potentially biasing results to the null, and loss of power due to high death rates. PSH can house highârisk individuals and reduce emergent psychiatric services and shelter use. Reductions in hospitalizations may be more difficult to realize.”
46
44
43
https://greenelab.github.io/scihub-manuscript/v/8fcd0cd665f6fb5f39bed7e26b940aa27d4770ba/
I am a lit bit scared whenever I think about how it accesses the papers. It is a black box and we have no idea if scihub is doing this in a way that is accepted by every researcher. On the other hand, it is not easy to live without scihub. There are legal alternatives like unpaywall and kopernio but they are way behind the game. What shall we do? Require the author of scihub to open source the code? Continue using a black box that may hurt other people? I don’t know.
41
37
“Docker for Mac uses https://github.com/moby/hyperkit to emulate the hypervisor capabilities and Hyperkit uses hypervisor.framework in its core. Hypervisor.framework is Mac’s native hypervisor solution. Hyperkit also uses VPNKit and DataKit to namespace network and filesystem respectively.”
hmmm I guess this is why docker on mac uses a lot of resources compared to its linux version
36
34
33
32
31
30
29
28
27
https://arxiv.org/abs/2005.12505
Very interesting research on voting mechanism. They built a theory to understand how the freemason member selection procedures shape the community. The Freemason only integrates a member if the member is accepted by all of the current members.
25
24
22
21
20
19
18
16
15
https://distill.pub/2020/communicating-with-interactive-articles/
details-on demand is so popular and crucial to perception.
14
11
Media
Bifrost Data Search Find the perfect image datasets for your next ML project. Product topic: Analytics, Robots, Developer Tools, Artificial Intelligence, Tech, Maker Tools View on Product Hunt
10
Media
Orchest An open source tool for creating data science pipelines. Product topic: Productivity, Open Source, Developer Tools, Tech View on Product Hunt
6
Media
Stat of the day Fascinating and important stats from the rest of world. Product topic: News, User Experience View on Product Hunt
5
4
Media
Paletro Enable command palette (â§âP) in any application on macOS. Product topic: Mac, Productivity View on Product Hunt