What Is the User-Agent “ICC-Crawler”? A Detailed Guide to NICT’s Research Crawler, Its Purpose, Control Methods, and How to View It in the AI Era
ICC-Crawleris an official crawler operated by the Universal Communication Research Institute of the National Institute of Information and Communications Technology (NICT).- It automatically crawls the public web and collects pages, mainly for research and development in multilingual translation, information analysis, artificial intelligence technologies, and related fields.
- In guidance published for data collected on and after July 11, 2024, it is explicitly stated that the collected data may be used not only for NICT’s own research and development, but also, within the scope of the law, for joint research partners and third-party provision.
- It respects
robots.txtand also supportsCrawl-Delay. - For that reason, it is most practical to understand
ICC-Crawlernot as a general search-engine crawler, but as a research- and AI-related collection crawler operated by a Japanese public research institution.
The Basic Nature of ICC-Crawler
ICC-Crawler is a web crawler operated by NICT’s Universal Communication Research Institute. According to the official explanation, it is a program that automatically travels across the internet and collects web pages. In other words, unlike crawlers such as Googlebot or bingbot, which primarily exist to build search results, it is easier to understand ICC-Crawler as a crawler used by a research institution to collect the public web for research purposes.
This point is very important for site operators. With general commercial crawlers, the purpose is often relatively easy to understand, such as search traffic, ad delivery, link previews, or data sales. By contrast, ICC-Crawler is operated by a public research institution, and its stated primary purposes include research and development in advanced information processing technologies such as multilingual translation, information analysis, and artificial intelligence. So rather than dismissing it as merely “an unfamiliar bot,” it is more accurate to understand it as a collection entity related to research, language processing, and AI infrastructure building.
What is more, the current official guidance states that for information collected on and after July 11, 2024, in addition to NICT’s own research and development and related activities, the data may also be used for joint research, third-party research and development, or the use of NICT’s research成果 by third parties, within the scope permitted by law. This is a particularly important point for operators to read carefully. In other words, ICC-Crawler is no longer described simply as a purely internal research crawler, but rather as a collection platform that also contemplates research collaboration.
This topic is useful for universities and research institutions, news media, corporate owned-media teams, specialist information sites, legal and intellectual property staff, and server administrators. For example, a media outlet with high-value specialist articles may want to think about whether it should be treated the same way as a search crawler. On the other hand, some operators may view cooperation with public research positively. ICC-Crawler is one concrete point where that decision becomes real.
Why Does ICC-Crawler Access Sites?
On NICT’s current page, the stated purposes for using the collected information include research and development in advanced information processing technologies such as multilingual translation, information analysis, various AI technologies, and related activities. What this tells us is that the core purpose of ICC-Crawler is data collection for information-processing research, including language processing and AI.
In addition, the guidance for data collected on and after July 11, 2024, explicitly says that the collected information and research成果 may, within the scope permitted by law, be provided to third parties for joint research, third-party research and development, or the use of NICT’s research成果 by third parties. Because of this, it is more accurate to understand that the information gathered by ICC-Crawler may not remain entirely within a closed laboratory environment, but may also flow into research collaboration and external use.
By contrast, the older guidance for data collected through July 10, 2024, stated that collected pages would not be used for purposes other than research. NICT itself therefore separates the old explanation and the current explanation on its official pages. As an operator, it is important not to rely on “the old impression,” but to check how the current explanation of the usage purpose has changed.
This difference matters a great deal in practice. Some people may feel that, because it is a public research institution, it is broadly acceptable. Others may feel that if third-party provision is included, they want to reconsider. ICC-Crawler is a User-Agent that raises quite modern questions about how public information should be handled.
Is ICC-Crawler a Search Crawler?
The short answer is no: ICC-Crawler is not on the same footing as general search-engine crawlers such as Googlebot or bingbot. NICT’s official explanation presents it as a crawler that collects web pages for research and development, not for building a search index. So it is not the primary counterpart in the SEO context. The basic way to understand it is as a research data collection crawler.
That said, its targets are normal public web pages, and technically it works by accessing sites and retrieving HTML and other content, just like other crawlers. So in access logs, it may look like “just another bot visit.” But the meaning of the access is different: it is not about forming search rankings, but about collecting resources for language data and information analysis. Once you understand that difference, it becomes easier to think separately about operations for search bots and operations for research crawlers.
For example, a site may want to prioritize traffic from Bing and Google while separately considering research- and AI-related collection. Conversely, another site may positively support public or academic research. ICC-Crawler is the kind of crawler that requires this kind of value-based decision.
Support for robots.txt and Crawl-Delay
ICC-Crawler officially states that it follows robots.txt. Both the current page and the old page explain that it reads the contents of the robots.txt published by the target host and follows any configured access restrictions. As a result, technical control is relatively straightforward, and it can generally be handled through ordinary robots.txt operations.
A notable feature is that it also supports Crawl-Delay. The official page states that if Crawl-Delay is set in robots.txt, the crawler uses the larger of the configured value and the crawler’s own minimum access interval. That means this is not just a binary choice between “allow” and “block.” There is also a middle-ground option of widening the interval between accesses to reduce load.
As an example of blocking site-wide collection, NICT itself gives the following robots.txt example:
User-agent: ICC-Crawler
Disallow: /
The official page also includes examples of blocking only specific directories or file types, and conversely, blocking everything while Allow-ing only certain parts. In other words, ICC-Crawler is a crawler that is fairly easy to control through standard Robots Exclusion Protocol practices.
What to Do When Problems Occur
NICT explains that it operates ICC-Crawler with the utmost care so as not to inconvenience target hosts. At the same time, it also states clearly that if a problem does occur, it will immediately stop collecting from the relevant host if contacted. This point is expressed quite clearly on both the current page and the older page.
The older guidance also said that if access does not stop even after configuring robots.txt, site operators should get in touch. So from the operator’s perspective, the process is fairly easy to follow: first use robots.txt, and if that does not solve the issue, contact them individually. Compared with unidentified scrapers, this makes it much easier to handle.
In practice, situations may arise where an operator wants to restrict only part of a public site, where there is an unexpected load spike, or where traffic appears to continue because of old caching behavior. In those cases, rather than immediately assuming malicious intent, it is more realistic to check the configuration and contact method described on the official page. Given that the operating body is a public research institution, ICC-Crawler can be described as a crawler where rule-based communication is comparatively easy.
Why the Difference Between the Old Page and the Current Page Matters
When you research ICC-Crawler, old and new information can easily get mixed together. The old page described the usage purposes as “building a web archive” and “collecting data for research and development in multilingual translation and information analysis,” and it said that the collected data would not be used for purposes other than research. By contrast, the current page says that for information collected on and after July 11, 2024, in addition to research and development in AI technologies and related fields, there is also the possibility of joint research and third-party provision.
If you miss this distinction, you might mistakenly assume that ICC-Crawler is simply “an old research bot, so the data stays internal.” But the current official explanation explicitly states that collected data and research成果 may be provided to joint research partners or third parties within the scope of the law. For operators, it is therefore very important to make decisions based on the current policy.
Of course, how to evaluate that is up to each site’s own policy. If you value contributing to academic or public-interest research, you may choose to allow it. If, on the other hand, you want to think more carefully about secondary use of public content or research collaboration, you may choose to restrict it. What matters is making that decision based on the current explanation, not on old fragments of information.
What Kinds of Sites Should Think About ICC-Crawler?
The most directly affected sites are those with highly distinctive text assets. News sites, specialist commentary sites, research blogs, knowledge bases, educational content, and technical documentation are all likely to have value as targets for language processing and information-analysis research. For those sites, how they handle ICC-Crawler is directly connected to their policy on the research use of public information.
The next important group is organizations that care strongly about legal, intellectual property, and governance issues. Because the current page explicitly mentions the possibility of third-party provision, it is not enough to think only “it’s a public research institution, so it must be fine.” It is also necessary to clarify how far your organization is willing to let its public information circulate into research collaboration. This is especially true for corporate technical information, unique analysis, and media articles, where operators may want to have a clear policy even if the material is public.
On the other hand, for sites that emphasize public value, or for operators who positively support contributions to language resources and AI research, ICC-Crawler is not necessarily something to reject. The fact that it is run by a public research institution, and that both robots.txt and a contact channel are clearly documented, can make it easier to feel comfortable about the decision. In that sense, ICC-Crawler is a crawler that should be chosen for or against explicitly, rather than treated by default one way or the other.
Conclusion
ICC-Crawler is an official crawler operated by NICT’s Universal Communication Research Institute. It collects the public web for use in research and development on multilingual translation, information analysis, artificial intelligence technologies, and related fields. Rather than being a general search bot for ranking formation, it is more accurately understood as a research- and AI-related collection crawler operated by a public research institution.
Also, in the current guidance for data collected on and after July 11, 2024, it is explicitly stated that collected information and research成果 may, within the scope of the law, be provided to joint research partners and third parties. This means the current operational description is somewhat broader than the older impression of “research purposes only.” That is an important point to keep in mind when making decisions today.
Technically, it respects robots.txt, supports Crawl-Delay, and provides a contact path for stopping collection if problems occur. So ICC-Crawler is not an unidentified scraper, but rather a research crawler whose purpose, operating policy, and refusal method are comparatively clearly disclosed. When you see it in your access logs, it is easier to understand if you treat it not as mere noise, but as a prompt to think about how far your organization wants to open its public information to research and AI-related use.
